Podchaser Logo
Home
Collaboration & evaluation for LLM apps

Collaboration & evaluation for LLM apps

Released Tuesday, 23rd January 2024
Good episode? Give it some love!
Collaboration & evaluation for LLM apps

Collaboration & evaluation for LLM apps

Collaboration & evaluation for LLM apps

Collaboration & evaluation for LLM apps

Tuesday, 23rd January 2024
Good episode? Give it some love!
Rate Episode

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:06

Welcome to Practical AI. If

0:09

you work in artificial intelligence, aspire

0:12

to, or are curious

0:14

how AI-related tech is changing the

0:17

world, this is the show for

0:19

you. Thank you to

0:21

our partners at fly.io, the home

0:23

of changelog.com. Fly

0:26

transforms containers into microvms that run

0:28

on their hardware in 30 plus

0:30

regions on six continents so you

0:32

can launch your app near your

0:35

users. Learn more at fly.io.

0:43

Welcome to another episode of

0:45

Practical AI. This is Daniel

0:47

Weitnack. I am CEO and

0:49

founder at Prediction Guard and

0:51

really excited today to be joined by

0:53

Dr. Reza Habib, who is

0:55

CEO and co-founder at Humanloop. How are

0:57

you doing Reza? Hi Daniel, it's a

1:00

pleasure to be here. I'm doing very

1:02

well. Yeah, thanks for having me on.

1:04

Yeah, yeah, it's super excited to

1:06

talk with you. I'm mainly

1:09

excited to talk with you selfishly

1:11

because I see the amazing things

1:13

that Humanloop is doing and the

1:15

really critical problems that you're thinking

1:18

about and every day of my

1:20

life. It's like, how am I

1:22

managing prompts and how does

1:24

this next model that I'm upgrading to,

1:27

how do my prompts do in that

1:29

model and how am I

1:32

constructing workflows around using LLMs,

1:34

which it definitely seems to

1:36

be the main thrust of some of

1:38

the things that you're thinking about at

1:40

Humanloop. Before we get into

1:42

the specifics of those things at Humanloop,

1:45

would you mind setting the context

1:47

for us in terms of workflows

1:49

around these LLMs, collaboration on team?

1:51

How did you start thinking about

1:54

this problem and what

1:56

does that mean in reality for

1:58

those working in industry? right now,

2:00

maybe more generally than at Humanloop. Yeah, absolutely.

2:02

So I guess on the question of how

2:05

I came to be working on this problem,

2:07

it was really something that my

2:09

co-founders, Peter and Jordan, I had been working on for

2:11

a very long time, actually. So previously,

2:13

Peter and I did PhDs together around

2:15

this area. And then when we started

2:17

the company, it was a little while

2:19

after transfer learning had started to work

2:21

at NLP for the first time. And

2:24

we were mostly helping companies fine tune

2:26

smaller models. But then sometime midway through

2:28

2022, we became absolutely convinced

2:30

that the rate of progress for these larger

2:32

models was so high, it was going to

2:34

start to eclipse essentially everything else

2:36

in terms of performance. But more importantly, in

2:38

terms of usability, right, it was the first

2:41

time that instead of having to like hand

2:43

annotate a new data set for every new

2:45

problem, there was this new way of customizing

2:47

AI models, which was that you could write

2:49

instructions in natural language, and have a reasonable

2:51

expectation that the model would then do that

2:54

thing. And that was unthinkable, you know, at

2:56

the start of 2022, I would say, or

2:58

maybe a little bit earlier. And

3:00

so that's really what made us want to

3:02

go work on this, because we realized that

3:05

the potential impact of NLP was already there.

3:07

But the accessibility had been expanded so far,

3:09

and the capabilities of the models have increased

3:11

so much that there was a particular moment

3:14

to go do this. But

3:16

at the same time, it introduced a whole bunch

3:18

of new challenges, right? So I guess historically, the

3:20

people who are building AI systems were machine learning

3:22

experts, the way that you would do it is

3:25

you would collect annotated data, you'd find you in

3:27

a custom model, it was typically being

3:29

used for like one specific task at a

3:31

time, there was a correct answer, so

3:33

it was easy to evaluate. And with

3:35

LMS, the power also brings new challenges.

3:37

So the way that you customize these

3:40

models is by writing these natural language

3:42

instructions, which are prompts. And

3:44

typically, that means that the people involved don't

3:46

need to be as technical. And usually, we

3:48

see actually that the the best people to

3:51

do prompt engineering tend to have domain expertise.

3:53

So often it's a product manager or someone

3:55

else within the company who is leading the

3:57

prompt engineering efforts. But you also have

3:59

a this new artifact lying around, which is

4:01

the prompt. And it has a similar impact

4:04

to code on your end application. So it

4:06

needs to be versioned and managed and treated

4:08

with the same level of respect and rigor

4:10

that you would treat normal code. But somehow

4:13

you also need to have the right workflows

4:15

and collaboration that lets the non-technical people work

4:17

with the engineers on the product or the

4:19

less technical people. And then the extra

4:22

challenge that comes with it as well is

4:24

that it's very subjective to measure performance

4:26

here. So in traditional code, we're used

4:28

to running unit tests, integration tests, regression

4:30

tests. We know what good looks like

4:33

and how to measure it. And even

4:35

in traditional machine learning, there's

4:37

a ground truth data set, people

4:39

calculate metrics. But once you go

4:41

into generative AI, it tends

4:43

to be harder to say what is the

4:45

correct answer. And so when that becomes difficult,

4:48

then measuring performance becomes hard. If measuring performance

4:50

is hard, how do you know when you

4:52

make changes if you're gonna cause regressions? Or

4:55

all the different design choices you have in developing

4:57

an app. How do you make those design choices

5:00

if you don't have good metrics of performance?

5:02

And so those are the problems that motivated

5:04

what we've built and really human

5:07

loop exists to solve both of these

5:09

problems. So to help companies with the

5:11

task of finding the best prompts, managing,

5:13

versioning them, dealing with collaboration, but then

5:15

also helping you do the evaluation that's

5:17

needed to have confidence that

5:19

the models are gonna behave as you expect in production.

5:22

And as related to these things, maybe

5:24

you can start with one that you

5:26

would like to start with and go

5:29

to the others. But in terms

5:31

of managing, versioning prompts,

5:33

evaluating the performance of these

5:35

models, dealing with regressions, as

5:38

you've kind of seen people try to

5:41

do this across probably a lot of

5:43

different clients, a lot of different industries,

5:46

how are people trying to manage

5:49

this in maybe some good ways

5:51

and some bad ways? Yeah, I think we see a

5:53

lot of companies go on a bit of a journey.

5:55

So early on, people are

5:58

excited about Genre to the I and LLM. The

6:00

the lot of hype around it now

6:02

so some people in the company just

6:04

could try things out and often they'll

6:06

start off using one of the large

6:08

enough publicly available models. open our Anthropic

6:10

cohere one of these, the prototype in

6:12

their own and a playground environment that

6:14

those providers have still eyeball a few

6:16

samples, maybe go rather a couple of

6:18

libraries that support orchestration and they'll put

6:20

together a prototype and the first version

6:22

is fairly easy to built. It's we,

6:24

you know, it's very quick to get

6:26

to, like the first wow moments and

6:29

them as people start. Moving towards production and

6:31

they started a rating from that, you know,

6:33

maybe eighty percent good enough version to something

6:35

that they really trust. They start to run

6:37

into these problems of like oh, I got

6:39

like twenty different versions of this problems and

6:41

I'm storing it as a strain code and

6:43

actually I wanna be able to collaborate with

6:45

a colleague on there and so now we're

6:47

sharing things know either by screen sharing or

6:49

were like both. You know, we've had some

6:52

serious company who would have heard of who

6:54

were sending their moral concepts to each other

6:56

via a mix of cheats. and obviously if

6:58

you wouldn't send someone of important piece. Of

7:00

code for slack or teams or something

7:02

like this. But because the collaboration software

7:04

isn't there to bridge the technical month

7:06

technical divide, that was the kind of

7:08

problems we see. And so at this

7:10

point typically a year ago people would

7:12

start building their own solution to more

7:14

often than not, like this is when

7:16

people would start building in house tools

7:18

increasingly because there are companies like Uma

7:20

loop around. That's usually when someone books

7:22

a demo with us and they say

7:24

hey, you know we've reached this point

7:26

where actually managing these artifacts has become

7:28

cumbersome. Were worried about. The quantity of

7:30

what we're producing do have a solution

7:33

to help, and the way that humor

7:35

loop helps at least on the prompt

7:37

management side is we have this interactive

7:39

environment. It's a little bit like those

7:41

open a I playgrounds or the Anthropic

7:43

Playground, but a lot more fully featured

7:45

and designed for actual development. so it's

7:47

collaborative. It has history built, and you

7:49

can connect variables and datasets and so

7:51

it becomes like a development environment for

7:53

your sort of L M application. You

7:55

can prototype the application, interact with their,

7:57

try out a few things, and then

7:59

people progress. From that development

8:01

environment into production through

8:03

evaluation and monitoring. You.

8:06

Mention this kind of in passing. I'd

8:08

love to dig into at a little

8:10

bit more. you mention kind of the

8:12

types of people that are coming you

8:14

know are at the table in designing

8:16

these systems and often times domain experts

8:18

you know Previously in working as a

8:20

data scientist it was always kind of

8:22

assumed oh you need to talk to

8:24

the domain experts but it's sort of

8:26

like at least for many years it

8:28

was like data scientist talk to the

8:30

domain experts and then go off and

8:32

build their thing. The domain experts were

8:34

not involved in the. Sort of building

8:36

of the system and even then

8:39

like the data scientists were maybe

8:41

building things that were kind of

8:43

for and to software engineers and

8:45

what I'm hearing you say as

8:47

you kind of got like these

8:49

multiple layers you have like domain

8:51

experts who might not be that

8:53

technical you've got may be a

8:55

I, and data people who are

8:57

using this kind of unique set

8:59

of tools, maybe even they're hosting

9:02

their own models and then you've

9:04

got like product software engineering. People

9:06

seems like a much more

9:08

complicated landscape of interactions. Have

9:10

you seen this kind of

9:12

play out in reality? In

9:14

terms of non technical people

9:16

and technical people, Both. Working

9:18

together on something that is

9:20

ultimately something implemented in code

9:23

and run as an application.

9:25

Accessing. One of the most exciting

9:27

things about El Ends and the progressive

9:29

era in general is that product managers

9:32

and subject matter experts can for the

9:34

first time be very directly involved in

9:36

implementing these applications. So I think it's

9:38

always been the case that the piano

9:40

or someone like that you know is

9:42

the person who distills the problem, speak

9:44

to the customers, produces the spat with

9:47

as a translation step where they sort

9:49

of produce back yard a document than

9:51

and then someone else because often implements

9:53

set and because we're now able to

9:55

program. And the some of the application

9:57

and natural language actually it's accessible to

9:59

those. Very directly the and it's worth

10:01

within a concrete example for like I

10:03

use: I'm an Ai note taker for

10:05

a lot of my sales calls and it

10:08

recorded the call and then I get

10:10

a summary afterwards and the app actually

10:12

allows you to choose a lot of different

10:14

types of summary. See you can say

10:16

hey I'm a salesperson I want to summary

10:18

that will extracts budget and authority and

10:20

need and timeline vs you can say

10:22

oh actually I had a product interview and

10:25

I want a different type of summary

10:27

and if you think about developing that

10:29

application the person who has the knowledge that

10:31

needed to say what a good summary

10:33

is and right the prom from the

10:35

model is the person was the domain expertise

10:37

is not the software engineer, but obviously

10:39

the prompt is only one piece of

10:41

the application right? If you gotta who question

10:43

answering system there's usually retrieval is part

10:45

of this. There may be other components usually

10:48

Llm is a block and a wider

10:50

applications. You obviously still needs the software

10:52

engineers around because they're implementing the bulk of

10:54

the application, but the product mentors me

10:56

much more directly involved and then you

10:58

know I actually we see. Increasingly less

11:00

involvement from machine learning or Ai experts

11:02

and less people are fine tuning their

11:04

own models for the majority of product

11:07

teams were seeing. There is a an

11:09

Ai platform team that made these facilities

11:11

setting things up, but the bulk of

11:13

the work is led by the product

11:15

managers and then the engineers and one

11:17

interesting sample of is on the extreme

11:20

end is one of our customers. It's

11:22

very large at tech company, they actually

11:24

do not let their engineers at it.

11:26

The prompts to the have a team

11:28

of linguists to do prompt. Development.

11:31

The. Linguists finalize the prompts, their season

11:33

a serialized format, and they go

11:35

to production, but it's a one

11:37

way transfer so the engineers can

11:39

edit them because they're not considered

11:41

able to assess the. The. Actual

11:43

outfits even though they are responsible for the

11:45

rest of the application. Just. Thinking

11:47

about how teams interact and who's

11:50

doing what it seems like the

11:52

problems that you've laid outer I

11:54

think. very clear and we're solving,

11:56

but it's probably hard to think

11:58

about. All. I, My Billie. building a

12:00

developer tool or am I building

12:02

something that these non-technical people interact

12:04

with or is it both? How

12:06

did you think about that as

12:08

you entered into the stages of

12:11

bringing human loop into existence? I

12:13

think it has to be both.

12:16

And the honest answer is it evolved organically

12:18

by going to customers, speaking to them about

12:20

their problems and trying to figure out what

12:22

the best version of a solution looks like.

12:24

So we didn't set out to build a

12:26

tool that needed to do both of these

12:28

things. But I think the reality is, given

12:31

the problems that people face, you do need both. An

12:34

analogy to think about might be something

12:37

like Figma. Figma is somewhere

12:39

where multiple different stakeholders come together

12:41

to iterate on things and to develop them and

12:43

provide feedback. And I think you need something analogous

12:46

to that for Gen AI, although it's not

12:48

an exact analogy because we also need to attach

12:50

the evaluation to this. So it's

12:52

almost by necessity that we've had to do that.

12:55

But I also think that it's very

12:57

exciting. And the reason I think it's

12:59

exciting is because it is expanding who

13:01

can be involved in developing these notifications.

13:22

If you're listening, you know software is built

13:24

from thousands of small technical choices. And

13:27

some of these seemingly inconsequential choices

13:29

can have a profound impact on the

13:31

economics of internet services, who gets to

13:33

participate in them, build them and profit

13:36

from them. This is especially true

13:38

for artificial intelligence, where the decisions we

13:40

made today can determine who can

13:42

have access to world changing technologies and

13:44

who can decide their future. Read, write,

13:47

own, building the next era of the

13:49

internet is a new book from

13:51

startup investor Chris Dixon that explores the

13:53

decisions that took us from open

13:55

networks governed by communities to massive social

13:58

networks run by internet giants. This

14:00

book, Read Write Own, is a

14:02

call to action for building a

14:04

new era of the internet that

14:06

puts people in charge. From AI

14:08

projects that compensate creators for their

14:11

work to protocols that fund open

14:13

source contributions, this is our chance

14:15

to build the internet we want,

14:17

not the one we inherited. Order

14:19

your copy of Read Write

14:22

Own today or go to

14:24

readwriteown.com to learn more. You

14:40

mentioned how this environment

14:42

of domain experts coming together

14:44

and technical teams coming together

14:47

in a collaborative environment opens

14:49

up new possibilities for both

14:52

collaboration and innovation. I'm wondering if at

14:54

this point you could kind of just

14:56

lay out, we've talked about

14:58

the problems, we've talked about those involved

15:00

and those kind of that would use

15:02

such a system or a platform to

15:04

enable this these kind of workflows. Could

15:07

you describe a little bit more what

15:09

human loop is specifically in

15:11

terms of both what it

15:14

can do and kind of

15:16

how these different personas engage

15:18

with the system? Yes,

15:20

I guess in terms of what it can do

15:22

concretely, it's firstly helping

15:24

you with prompt iteration, versioning and management

15:26

and then with evaluation and monitoring and

15:28

the way it does that is there's

15:30

a web app and there's a web

15:32

UI where people are coming in and

15:34

in that UI. Is an

15:36

interactive playground like environment where people they

15:38

try out different prompts, they can compare

15:40

them side by side with different models,

15:42

they can try them with different inputs

15:45

when they find versions that they think

15:47

are good, they save them and

15:49

then those can be deployed from that environment

15:51

to production or even to a development

15:53

or staging environment so that's

15:56

the kind of development stage and then

15:58

once you have something that's developed what's

16:00

very. very typical is people then want

16:02

to put in evaluation steps into place.

16:04

So you can define goal standard test

16:06

sets, and then you can define

16:08

evaluators within human loop. And evaluators

16:10

are ways of scoring the outputs of

16:12

a model or a sequence of models

16:14

because oftentimes the LLM is part

16:16

of a wider application. And so

16:18

the way that scoring works is there's

16:21

very traditional metrics that you would have in code

16:23

for any machine learning system. So

16:25

precision, recall, Rouge, Blue, these kind of

16:27

scores that anyone from a machine learning

16:29

background would already be familiar with. But

16:31

what's new in the kind of LLM space

16:34

is also things that help when things are

16:36

more subjective. So we have the ability to

16:38

do model as judge, where you might actually

16:40

prompt another LLM to score the output in

16:42

some way. And this can be particularly useful

16:45

when you're trying to measure things like hallucination.

16:48

So a very common thing to do is to

16:50

ask the model, is the

16:52

final answer contained within the retrieved context?

16:55

Or is it possible to infer the answer

16:57

from the retrieved context? And you can calculate

16:59

those scores. And then the

17:01

final way is we also support human evaluation. So

17:04

in some cases, you really do want either

17:06

feedback from an end user or

17:08

from an internal annotator involved as well.

17:11

And so we allow you to gather

17:13

that feedback either from your live production application

17:16

and have it logged against your

17:18

data. Or you can cue internal

17:20

annotation tasks from a team. And

17:22

I can maybe tell you a little bit more about

17:24

sort of in production feedback, because that's something that that's

17:26

actually where we started. Yeah, yeah, go ahead. I would

17:29

love to hear more. Yes, I think that because

17:31

it's so subjective for a lot of the

17:33

applications that people are building, whether it be

17:36

email generation, question answering, a

17:38

language learning app, there isn't

17:40

a correct answer, quote unquote.

17:42

And so people want to measure how things

17:44

are actually performing with their end users. And

17:47

so human loop makes it very easy to

17:49

capture different sources of end user feedback. And

17:51

that might be explicit feedback, things like thumbs

17:54

up, thumbs down votes that you see in

17:56

chat GPT, but it can also be more

17:58

implicit signals. So how How did the

18:00

user behave after they were

18:02

shown some generated content? Did they progress to

18:04

the next stage of the application? Did they

18:06

send the generated email? Did they

18:09

edit the text? All of

18:11

that feedback data becomes useful both

18:13

for debugging and also for

18:15

fine-tuning the model later on. That

18:18

evaluation data becomes this rich resource that

18:20

allows you to continuously improve your application

18:22

over time. Yeah, that's awesome. I

18:25

know that that fits in. Maybe

18:28

you could talk a little bit about how

18:31

one of the things that you mentioned

18:33

earlier is you're seeing fewer people do

18:35

fine-tuning, which I see this

18:38

very commonly as a... It's

18:40

not an irrelevant point, but it's maybe

18:42

a misconception where a lot of teams

18:45

come into this space and they just

18:47

assume they're going to be fine-tuning their

18:49

models. And what

18:51

they end up doing is fine-tuning

18:53

their workflows or their language model

18:55

chains or their retrieval, the data

18:58

that they're retrieving, or their prompt

19:00

formats or that templates or that

19:02

sort of thing. They're not really

19:04

fine-tuning. And I think there's this

19:06

really blurred line right now for

19:09

many teams that are adopting

19:12

AI into their organization where they'll

19:14

frequently just use the term, oh,

19:16

I'm training the AI to do

19:18

this and now it's better,

19:20

right? But all they've really done is just

19:22

inject some data into their prompts

19:25

or something like that. So could

19:27

you maybe help clarify

19:30

that distinction and also

19:32

in reality what you're seeing people

19:34

do with this capability of evaluation,

19:37

both online and offline, and

19:39

how that's filtering back into

19:42

upgrades to the system or

19:44

actual fine-tunes of models? Yeah.

19:47

So I guess you're right. And

19:49

especially for people who are new to the field,

19:51

the word fine-tuning has a colloquial meaning and then

19:53

it has a technical meaning in machine learning and

19:56

the two end up being blurred. So

19:58

fine-tuning in a machine learning context is a very good question. usually

20:00

means doing some extra training on the

20:03

base model, where you're actually changing

20:05

the weights of the model, given

20:07

some sets of example pairs of inputs, outputs

20:09

that you want. And then

20:11

obviously there's like prompt engineering and

20:14

maybe context engineering, where you're changing the

20:16

instructions to the language model, or you're

20:18

changing the data that's set into the context,

20:20

or how the, you know, an agent

20:22

system might be set up. And both

20:24

are really important. Typically the

20:26

advice we give the majority of our

20:29

customers and what we see play out

20:31

in practice is that people should first

20:33

push the limits of prompt engineering. Because

20:36

it's very fast, it's easy to do,

20:38

and it can have like very high

20:40

impact, especially around changing the sort of

20:42

outputs and also in helping the model

20:44

have the right data that's needed to

20:47

answer the question. So prompt engineering is

20:49

kind of usually where most people start

20:51

and sometimes where people finish as well.

20:53

And fine tuning tends to be

20:55

useful either if people are trying

20:57

to improve latency or cost, or

21:00

if they have like a particular tone of voice

21:02

or output constraint that they want to enforce. So,

21:04

you know, if people want their

21:06

model to output valid JSON, then fine

21:09

tuning might be a great way to achieve that. Or

21:11

if they want to use a local private model cause

21:13

it needs to run on an edge device or something

21:15

like this, then fine tuning I think is a great

21:17

candidate. And it can also let you

21:20

reduce costs because oftentimes you can fine tune a

21:22

smaller model to get similar performance.

21:25

The analogy I like to use is fine tuning is

21:27

a bit like compilation, right? You have a, you've

21:29

already sort of built your first version of the

21:31

language. When you want to optimize it, you might

21:33

use a compiled language and you've got a kind

21:35

of compiled binary. I think

21:37

there was a second part to your question, but

21:39

just remind me, actually, I've lost the second part.

21:41

Yeah, basically you mentioned that

21:43

maybe fewer people are doing

21:46

fine tunes. Maybe

21:48

you could comment on, I

21:50

don't know if you have a sense

21:52

of why that is or how you

21:55

would see that sort of progressing into

21:57

this year as more and more people

21:59

adopt. this technology and maybe get

22:01

better tooling around the, let's

22:04

not call it fine tuning so we don't

22:06

mix all the jargon, but the iterative

22:09

development of these systems, do

22:11

you see that trend continuing

22:13

or how do you see

22:15

that kind of going into maybe larger

22:18

or wider adoption in 2024?

22:21

Yeah, so I think that we've definitely seen

22:23

less fine tuning than we thought we would see

22:25

when we started, you know, when we launched human

22:28

loop back, this version of human loop back in 2022.

22:31

And I think that's been true of others

22:33

as well. Like I've spoken to friends at

22:35

OpenAI and OpenAI is expecting there will be

22:37

more fine tuning in the future, but they've

22:39

been surprised that there wasn't more initially. I

22:42

think some of that is because prompt engineering has turned

22:44

out to be remarkably powerful. And

22:46

also because some of the changes that people want

22:48

to do to these models are more about getting

22:51

factual context into the model. So

22:53

one of the downsides of LLMs

22:55

today is they're obviously trained on

22:57

the public internet. So they don't necessarily know private

22:59

information about your company. They tend not

23:02

to know information past the training date of the

23:04

model. And you know, one way

23:06

you might have thought you could overcome that is I'm

23:08

going to fine tune the model on my company's data.

23:11

But I think in practice, what people are finding is

23:13

a better solution to that is to

23:15

use a hybrid system of search

23:17

or information retrieval plus generation. So

23:19

what's come to be known as

23:21

like RAG or retrieval augmented generation has turned out

23:24

to be a really good solution to this problem.

23:27

And so the main reasons to fine

23:29

tune now are more about optimizing cost

23:31

and latency and maybe a little bit

23:33

tone of voice, but they're

23:35

not needed so much to adapt the model

23:37

to a specific use case. And

23:40

fine tuning is a heavier duty operation

23:42

because it takes longer. You

23:44

can edit a prompt very quickly and then see what

23:46

the impact is. Fine tuning, you need to

23:48

have the data set that you want to fine tune on, and

23:51

then you need to run a training job and then

23:53

evaluate that job afterwards. So there are

23:55

certainly circumstances where it's going to make sense. I

23:57

think especially anyone who wants to do the private.

24:00

private open source model will likely find themselves

24:02

wanting to do more fine tuning. But

24:04

the quality of an off prompt engineering and the distance you

24:06

can go with it, I think took a lot of people

24:08

by surprise. And on that

24:11

note, you mentioned the closed proprietary

24:13

model ecosystem versus open models that

24:15

people might host in their own

24:17

environment and or fine tune on

24:20

their own data. I

24:22

know that human loop, like you explicitly

24:24

say that you kind of have

24:27

all of the models you're integrating these

24:29

sort of closed models and integrate with

24:32

open models. Why and

24:34

how is that kind of decided to

24:37

kind of include all of those?

24:39

And in terms of the mix

24:42

of what you're seeing with people's

24:44

implementations, how do you

24:46

see this sort of proliferation of

24:48

open models impacting the workflows that

24:50

you're supporting in the future? So

24:53

the reason for supporting them again is largely

24:55

customer pull, right? What we were finding is

24:58

that many of our customers were

25:00

using a mixture of models for

25:02

different use cases, either because the

25:04

large proprietary ones had slightly different

25:06

performance trade offs or because

25:08

there were use cases where they cared about privacy

25:11

or they cared about latency. And so they couldn't

25:13

use a public model for those

25:15

instances. And so we had to

25:17

support all of them. It really was something that it

25:20

would it wouldn't be a useful product to our customers if

25:22

they could only use it for one particular model. And

25:25

the way we've got around this is that we try

25:27

to integrate all of the publicly available ones, but we

25:29

also make it easy for people to connect their own

25:31

models so they don't necessarily need

25:33

us. As long as they expose

25:36

the appropriate API, you can plug in any model

25:38

to human loop. That would be a matter of

25:41

hosting the model and making sure

25:43

that the API contract that you're

25:45

expecting in terms of responses from

25:48

a model server that maybe someone's

25:50

running in their own AWS or

25:52

wherever would fulfill that contract.

25:55

That's exactly right. And

25:58

in terms of the proliferation... of

26:00

open source and how that's going. I think

26:03

there's still a performance gap at the moment

26:05

between the very best closed models, so between

26:07

the GPD4 or some of the better models

26:09

from Entropic and the best open

26:11

source, but it is closing, right? So the latest

26:13

models from say Mistral have

26:16

proved to be very good, LAMA2 was

26:18

very good. Increasingly, you're not

26:20

paying as big a performance gap, although

26:22

there is still one, but you

26:24

need to have high volumes for it to

26:26

be economically competitive to host your own model.

26:28

So the main reasons we see people doing

26:30

it are related to data privacy.

26:33

Companies that for whatever reason cannot

26:36

or don't want to send data to

26:38

a third party end up using

26:40

open source, and then also anyone who's

26:43

doing things on edge and who

26:45

wants real-time or very low latency ends

26:47

up using open source. This

26:54

is a changelog newsbreak. VANA.ai

26:57

is a Python RAG

26:59

framework for accurate text

27:01

to SQL generation. It

27:04

lets you chat with

27:06

any relational database by

27:08

accurately generating SQL queries

27:10

trained via RAG, which

27:12

stands for retrieval augmented

27:14

generation, to use with

27:17

any LLM that you want. You

27:19

load up your data definitions, your

27:21

documentation, and any raw SQL queries

27:23

you have laying around into VANA,

27:26

and then you're off to the

27:28

races. VANA boasts high accuracy on

27:30

complex datasets, excellent security and

27:32

privacy because your database contents are never

27:35

sent to the LLM or

27:37

a vector DB. It boasts

27:39

the ability to self-learn by

27:41

choosing to auto train on

27:43

successful queries, and a choose

27:46

your own front end approach

27:48

with front ends provided for

27:50

Jupyter Notebook, Streamlit, Flask, and

27:52

Slack. You just heard

27:54

one of our five top stories

27:56

from Monday's changelog news. Subscribe to

27:59

the podcast. to get

28:01

all of the week's top stories

28:03

and pop your email address in

28:05

at changelog.com/news to also receive our

28:07

free companion email with even more

28:09

developer news worth your attention. Once

28:12

again, that's changelog.com/news.

28:20

Well, Reza, I'd love for you to

28:22

maybe describe if you can, we've kind

28:24

of talked about the problems that you're

28:27

addressing. We've talked about the

28:29

sort of workflows that you're enabling the

28:31

evaluation and some trends that you're seeing.

28:33

But I'd love for you to describe

28:36

if you can maybe for like

28:38

a non-technical persona, like a domain

28:40

expert who's engaging with the human

28:42

loop system. And maybe

28:45

for a more technical person

28:47

who's integrating, you know, data

28:49

sources or other things, what

28:51

does it look like to

28:54

use the human loop system,

28:56

maybe describe the roles

28:58

in which these people are like

29:01

what they're trying to do from each

29:03

perspective, because I think that might be

29:05

instructive for people that are trying to

29:07

engage domain experts and technical people in

29:10

a collaboration around these problems. Absolutely. So

29:12

maybe it might be helpful to have

29:14

a kind of imagined concrete example. So

29:16

a very common example we see is

29:18

people building some kind of question answering

29:20

system, maybe it's for their internal customer

29:22

service staff, or maybe they want to

29:25

replace an FAQ that, so

29:27

I'm just gonna drink water. Maybe they're trying to

29:29

build some kind of internal question answering

29:31

system to replace something, or an

29:33

FAQ or that kind of thing. So there's a set

29:35

of documents or questions going to come in, there'll be

29:37

a retrieval step and then they want to generate an

29:39

answer. So, typically

29:42

the PMs or the domain experts will be figuring out, you

29:44

know, what are the requirements of the system? What is good

29:46

look like? What do we want it to build? And

29:49

the engineers will be building the

29:51

retrieval part, orchestrating all the model calls

29:53

and code, integrating the human loop API

29:55

into their system. And also,

29:57

usually they lead on setting

30:00

up evaluation. So maybe once

30:02

it's set up, the domain experts might continue

30:04

to do the evaluation themselves, but

30:06

the engineers tend to set it up the first

30:09

time. So if you're the domain expert, typically, you

30:11

would start off in our playground environment where you

30:13

can just try things out. So

30:15

the engineers might connect a database to human loop

30:17

for you. So maybe they'll store the data in

30:20

a vector database and connect that

30:22

to human loop. And then once you're in

30:24

that environment, you could try different prompts to the models,

30:26

you could try them to gt4 to cohere to an

30:29

open source model, see what impact that

30:31

has, see if you're getting answers that you

30:33

like, right? Oftentimes early on, it's not in

30:35

the right tone of voice, or the retrieval

30:37

system is not quite right. And so the

30:39

model is not giving factually correct answers. So

30:41

it takes a certain amount of iteration to

30:44

get to the point where even when you

30:46

eyeball it, it's looking appropriate. And usually at

30:48

that point, people then move to doing a

30:50

little bit more of a rigorous evaluation. So

30:52

they might generate either automatically or internally

30:55

a set of test cases. And

30:57

they'll also come up with a set of evaluation

30:59

criteria that matter to them in their context, they'll

31:02

set up that evaluation, run it,

31:04

and then usually at that point, they might

31:06

deploy to production. So that's the point at

31:08

which things would end up with

31:10

real users, they started gathering user feedback. And

31:12

usually the situation is not finished at that

31:14

point, because people then look at the production

31:16

logs, or they look at the real usage

31:18

data, and they will filter based on the

31:21

evaluation criteria. And they might say, Hey, show

31:23

me the ones that didn't result in a

31:25

good outcome. And then they'll try and debug

31:27

them in some way, maybe make a change

31:29

to a prompt, rerun the evaluation and submit

31:31

it. And so the engineers

31:33

are doing the orchestration of the code,

31:36

they're typically making the model calls, they'll

31:38

add logging calls to human loop. So

31:40

the way that works, there's

31:42

a couple of ways between the integration, but you

31:44

can imagine every time you call the model, you're

31:46

effectively also logging back to human loop, what the

31:48

inputs and outputs were, as well as any user

31:51

feedback data. And then the domain

31:53

experts are typically looking at the data,

31:55

analyzing it, debugging, making decisions about how

31:57

to improve things. And they're able to

32:00

actually take some of those actions themselves

32:02

in the UI. Yeah. And

32:04

so if I just kind of

32:06

abstract that a bit to maybe

32:09

give people a frame of thinking, it

32:11

sounds like there's kind of this framework

32:13

set up where there's data sources, there's

32:17

maybe logging calls within a

32:19

version of an application. If

32:23

you're using a hosted model or if you're

32:25

using proprietary

32:27

API, you decide

32:30

that. And so it's kind of set

32:32

up and then there's maybe an evaluation

32:35

or prototyping phase, let's call it

32:37

where the domain experts try their

32:39

prompting. Eventually, they find prompts that

32:41

they think will work well for

32:43

these various steps in a

32:45

workflow or something like that. Those

32:47

are pushed, as you said, I

32:49

think one way into the actual

32:51

code or application such that

32:53

the domain experts are in charge

32:56

of the prompting to some degree.

32:58

And as you're logging feedback into

33:01

the system, the domain experts

33:03

are able to iterate on their prompts, which

33:05

hopefully then improve the system. And those are

33:07

then pushed back into the production

33:10

system maybe after an evaluation or

33:12

something. Is that a fair representation?

33:15

Yeah, I think that's a great representation. Thanks

33:17

for articulating it so clearly. And the kinds

33:19

of things that the evaluation becomes useful for

33:21

is avoiding regressions, say. So people might notice

33:23

one type of problem, they go in and

33:25

they change a prompt or they change the

33:28

retrieval system and they want to make sure

33:30

they don't break what was already working. And

33:32

so having good evaluation in place helps with

33:35

that. And then maybe it's also worth, because

33:37

I think we didn't sort of

33:39

do this at the beginning, just thinking about

33:41

what are the components of these LLM applications?

33:43

So I think you're exactly right. We sort of

33:45

think of the blocks of LLM app being composed

33:48

of a base model. So that might be a

33:50

private fine tune model or one of these large

33:52

public ones. A prompt template,

33:54

which is usually an instruction to the model that

33:56

might have gaps in it for

33:59

retrieved data. or context, a

34:01

data collection strategy. And

34:04

then that whole thing of data collection,

34:06

prompt template, and model might

34:08

be chained together in a loop or

34:10

might be repeated one after another.

34:13

And there's an extra complexity, which is

34:15

the models might also be allowed to

34:17

call tools or APIs. But

34:20

I think those pieces to get taken

34:22

together more or less comprehensively cover things.

34:24

So tools, data retrieval, prompt template, and

34:27

base model are the main components.

34:29

But then within each of those, you have a lot of

34:32

design choices and freedom. And so you

34:34

have a combinatorially large number of decisions to

34:36

get right when building one of these applications.

34:39

One of the things that you mentioned is this

34:42

evaluation phase of what goes

34:44

on as helping prevent regressions.

34:47

Because in testing behaviorably,

34:49

the output of the models, you

34:51

might make one change on a

34:53

small set of examples that looks

34:56

like it's improving things, but has

34:58

different behavior across a wide range

35:00

of examples. I'm wondering

35:02

also, I could imagine

35:04

two scenarios. Models are

35:07

being released all the time, whether it's upgrading

35:09

from this version of a GPT

35:11

model to the next version or this

35:13

Mistral fine tune to this one over

35:15

here. I'm thinking even in the

35:18

past few days, we've been

35:20

using the neural chat model from Intel

35:22

a good bit. And there's a version

35:24

of that, the neural magic release that's

35:26

a sparsified version of that,

35:29

where they pruned out some

35:31

of the weights and the layers to

35:33

make it more efficient and to

35:35

run on better, or not better

35:37

hardware, but more commodity hardware that's more

35:39

widely available. And so one of the

35:41

questions that we were discussing is, well,

35:44

we could flip the version of this

35:46

model to the sparse one, but we

35:48

have to decide on how

35:50

to evaluate that over the use cases

35:52

that we care about. Because you could

35:54

look at the output for a few

35:56

test prompts, and it might

35:58

look similar. Or. good or even

36:01

better, but on a wider scale

36:03

might be quite different in ways

36:05

that you don't expect. So I

36:07

could see that the evaluation also being used for

36:09

that, but I could also see where if you're

36:12

upgrading to a new model, it

36:14

could just throw everything up in the air

36:16

in terms of like, oh,

36:18

this is an entirely different prompt format,

36:21

or this is a whole

36:23

new behavior from this new

36:25

model that is distinct from

36:27

an old model. So how are

36:29

you seeing people navigate that landscape

36:32

of model upgrades? I think

36:34

you should just view it as a change, as you

36:36

would, to any other part of the system. And hopefully

36:38

the desired behavior of the model is not changing. So

36:41

even if the model is changed, you

36:43

still want to run your regression test and

36:46

say, are we meeting a minimum threshold that

36:48

we had on these gold standard test set

36:50

before? In general, I think

36:52

evaluation, we see it happening at three

36:54

different stages during development. There is

36:57

during this interactive stage very early on,

36:59

when you're prototyping, you want fast feedback,

37:01

you're just looking to get a sense

37:03

of is this even working appropriately? At

37:06

that stage, eyeballing examples and looking at

37:08

things side by side in a very

37:10

interactive way can be helpful. And interactive

37:13

testing can also be helpful for adversarial

37:15

testing. So a fixed test set

37:17

doesn't tell you what will happen when

37:19

a user who actually wants to break the system

37:21

comes in. So a concrete example of this, one

37:24

of our customers has children

37:27

as their end users, and they want to

37:29

make sure that things are age appropriate. So they

37:31

have guardrails in place. But when

37:33

they come to test the system, they

37:35

don't want to just test it for

37:38

against an input that's benign.

37:40

They want to see if we try, if we

37:42

really red team this, can we break it? And

37:45

their interactive testing can be very helpful. And

37:48

then the next place where you want testing in

37:50

place is this regression testing, where you

37:52

have a fixed set of evaluators on a test set, and

37:54

you want to know when I make a change, does it

37:56

get worse? And the final place we see

37:58

people using it is actually from Mona. So, okay,

38:01

I'm in production now. There's new

38:03

data flowing through. I may not have the ground

38:05

truth answer, but I can still set up different

38:07

forms of evaluator. And I want

38:09

to be alerted if the performance drops below

38:11

some threshold. So one of

38:14

the things that I've been thinking about

38:16

throughout our conversation here, and that's

38:18

I think highlighted by what you just mentioned and

38:20

sort of the upgrades to

38:22

one's workflow and the various

38:25

levels at which such a

38:27

platform can benefit teams.

38:30

And it made me think of, you

38:32

know, used to, I have a background

38:35

in physics and there were plenty of

38:37

physics teams or collaborators that we

38:39

worked with, you know, we were writing code and

38:42

not doing great sort of

38:44

version control practices and not

38:46

everyone was using GitHub. And

38:48

there's sort of collaboration

38:51

challenges associated with

38:53

that, which are obviously

38:55

solved by great code collaboration systems

38:57

that are of various forms that

38:59

have been developed over time. And

39:03

I think there's probably a parallel

39:05

here with some of the collaboration

39:07

systems that are being built around

39:09

both playgrounds and prompts and evaluation.

39:12

I'm wondering if you could, if

39:15

there's any examples from clients

39:17

that you've worked with, or

39:19

maybe it's just interesting use cases

39:21

of surprising things they've been able to

39:23

do when going from sort

39:25

of doing things ad hoc

39:28

and maybe versioning prompts in spreadsheets

39:30

or whatever it might be to

39:33

actually being able to work in

39:35

a more seamless way between domain

39:37

experts and technical staff. Are

39:39

there any clients or use

39:41

cases or surprising stories that

39:44

come to mind? Yeah, it's a good question. I'm

39:46

kind of thinking through them to see, you know,

39:48

what the more interesting examples might be. I

39:51

think that fundamentally, it's not

39:53

necessarily enabling completely new behavior,

39:55

right? But it's making the

39:57

old behavior significantly faster. less

40:00

error prone. So, you know,

40:02

certainly fewer mistakes and less time

40:04

spent, you know, one, okay, so

40:06

surprising example, publicly listed company, and

40:08

they told me that one of

40:10

the issues they were having is

40:13

because they were sharing these prompt

40:15

conflicts in teams, they were

40:17

having differences in behavior based on white space

40:19

being copied. So the, you know, someone was

40:21

like playing around with the opening, I played

40:24

around, they'd copy pasted into teams, that person

40:26

would copy paste from teams into code. And

40:29

there was small white space differences, and

40:31

you wouldn't think it should expect affect

40:33

the models that actually did. And so

40:35

they would then get performance differences they

40:37

couldn't explain. And actually, it just turned

40:39

out that, you know, you shouldn't be

40:41

sharing your code via. Right. So

40:44

I guess that's one one surprising example.

40:46

I think another thing as well is

40:48

the complexity of apps that people are

40:50

now beginning to be able to build.

40:53

So increasingly, I think people

40:55

are building simple agents,

40:57

right, I think more complex agents are still

41:00

not super reliable. But a trend that we've

41:02

been hearing a lot about from our customers

41:04

recently, is people trying to

41:06

build a systems that can use their

41:09

existing software. So you know,

41:11

an example of this is, you know,

41:13

ironclad is a company that's added a

41:15

lot of LLM based features to their

41:17

product, and they actually are able to

41:19

automate a lot of workflows that were

41:21

previously being done by humans,

41:24

because the models can use the API that

41:26

exists within the ironclad software. So they're actually,

41:28

you know, able to leverage their existing infrastructure.

41:30

But to get that to work, they had

41:33

to innovate quite a lot in tooling. And

41:35

in fact, you know, this isn't the plug

41:37

for human loop ironclad, in this case, built

41:40

a system called rivet, which is their

41:42

own open source, you know, prompt engineering

41:44

and iteration framework. But I think it's

41:46

a good example of, you know, in

41:48

order to achieve the complexity of that

41:50

use case, this happened to be

41:52

before tools like human loop around, they had to build

41:54

something themselves. And it's quite sophisticated

41:56

tooling, actually rivet is great. So people should check

41:58

that out as well. well, it's an open source

42:01

library, anyone can go and get the tool. So

42:03

yeah, I think the surprising things are like

42:05

how error prone things are without good tooling

42:07

and, and the crazy ways in which

42:09

people are solving problems. Another example of a mistake that

42:12

we saw someone do is two

42:14

different people triggered exactly the same annotation

42:16

job. So they had annotation and

42:18

spreadsheets. And they both outsourced

42:20

the same job to different annotation

42:22

team, which obviously an

42:24

expensive mistake to make. So very

42:26

error prone. And then I think also just

42:28

like impossible to scale to

42:31

more complex, agentic use cases. Well,

42:33

you already kind of alluded to

42:35

some trends that you're seeing moving

42:38

forward, as we kind of draw

42:40

to a close here, I'd love

42:43

to know from someone who's seeing

42:45

a lot of different use cases

42:48

being enabled through human loop and

42:50

your platform, what's exciting for you

42:53

as you move into this next year

42:55

in terms of maybe it's

42:57

things that are happening in AI more broadly,

43:00

or things that are being enabled

43:02

by human loop or things that are

43:04

on your roadmap that you can't wait

43:07

for them to go live. What, as

43:09

you're lying in bed at night and getting

43:11

excited for for the next day of AI

43:14

stuff, what's on your mind? So

43:16

AI more broadly, I just feel

43:18

the rate of progressive capabilities is

43:20

both exciting and scary, right? It's

43:22

extremely fast multimodal models, better generative

43:25

models, models with increased reasoning. I

43:27

think the range of possible applications

43:29

is expanding very quickly as the

43:31

capabilities of the models expand. I

43:34

think people have been excited about agent use

43:36

cases for a while, right systems that

43:38

can act on their own and go off

43:41

and achieve something for you. But in

43:43

practice, we've not seen that many people succeed

43:45

in production with those. There are a couple of examples,

43:48

ironclad being a good one. But it

43:50

feels like we're still at the very beginning of

43:52

that. And I think I'm excited about seeing more

43:54

people get to success with that. I'd

43:56

say that the most common, you know, successful

43:59

applications we've seen. seen today are mostly

44:01

either retrieval augmented applications

44:03

or more simple LLM

44:05

applications. But increasingly, I'm

44:08

excited about seeing agents in production and

44:10

also multimodal models in production. In

44:12

terms of things that I'm particularly excited

44:14

about from Humanloop is I think us

44:17

becoming a proactive rather than a passive

44:19

platform. So today, the product

44:21

managers and the engineers drive the changes

44:23

on Humanloop. But I think that's

44:25

something that we're going to hopefully release later this year

44:27

is actually the system,

44:30

Humanloop itself can start proactively suggesting improvements

44:32

to your application. Because we have the

44:34

evaluation data, because we have all the

44:37

prompts, we can start saying things to

44:39

you like, hey, we have a

44:41

new prompt for this application. It's a lot shorter than

44:43

the one you have. It scores similarly on eval data.

44:45

If you upgrade, we think we can cut your costs

44:47

by 40% and allowing

44:50

people to then accept that change. And

44:52

so going from a system that is

44:54

observing to a system that's actually intervening.

44:56

That's awesome. I definitely look

44:59

forward to seeing how that rolls out

45:01

and really appreciate the work that you

45:03

and the team at Humanloop are doing

45:05

to help us upgrade our workflows and

45:08

enable these sort of more complicated use

45:10

cases. So thank you so much for

45:12

taking time out of that work to

45:14

join us. It's been a pleasure. Really

45:16

enjoyed the conversation. Thanks so much for

45:19

having me, Daniel. All right.

45:28

That is Practical AI for this week.

45:31

Subscribe now. If you haven't

45:34

already, head to practicalai.fm for

45:37

all the ways. And join our

45:39

free Slack team, where you can hang

45:41

out with Daniel, Chris, and the entire

45:43

change log community. Sign

45:45

up today at

45:47

practicalai.fm slash community.

45:50

Thanks again to our partners at

45:52

fly.io, to our beat freaking residents,

45:54

Breakmaster Cylinder, and to you for

45:56

listening. We appreciate you spending time

45:58

with us. That's great. That's all for

46:00

now, we'll talk to you again next time.

Unlock more with Podchaser Pro

  • Audience Insights
  • Contact Information
  • Demographics
  • Charts
  • Sponsor History
  • and More!
Pro Features