Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:06
Welcome to Practical AI. If
0:09
you work in artificial intelligence, aspire
0:12
to, or are curious
0:14
how AI-related tech is changing the
0:17
world, this is the show for
0:19
you. Thank you to
0:21
our partners at fly.io, the home
0:23
of changelog.com. Fly
0:26
transforms containers into microvms that run
0:28
on their hardware in 30 plus
0:30
regions on six continents so you
0:32
can launch your app near your
0:35
users. Learn more at fly.io.
0:43
Welcome to another episode of
0:45
Practical AI. This is Daniel
0:47
Weitnack. I am CEO and
0:49
founder at Prediction Guard and
0:51
really excited today to be joined by
0:53
Dr. Reza Habib, who is
0:55
CEO and co-founder at Humanloop. How are
0:57
you doing Reza? Hi Daniel, it's a
1:00
pleasure to be here. I'm doing very
1:02
well. Yeah, thanks for having me on.
1:04
Yeah, yeah, it's super excited to
1:06
talk with you. I'm mainly
1:09
excited to talk with you selfishly
1:11
because I see the amazing things
1:13
that Humanloop is doing and the
1:15
really critical problems that you're thinking
1:18
about and every day of my
1:20
life. It's like, how am I
1:22
managing prompts and how does
1:24
this next model that I'm upgrading to,
1:27
how do my prompts do in that
1:29
model and how am I
1:32
constructing workflows around using LLMs,
1:34
which it definitely seems to
1:36
be the main thrust of some of
1:38
the things that you're thinking about at
1:40
Humanloop. Before we get into
1:42
the specifics of those things at Humanloop,
1:45
would you mind setting the context
1:47
for us in terms of workflows
1:49
around these LLMs, collaboration on team?
1:51
How did you start thinking about
1:54
this problem and what
1:56
does that mean in reality for
1:58
those working in industry? right now,
2:00
maybe more generally than at Humanloop. Yeah, absolutely.
2:02
So I guess on the question of how
2:05
I came to be working on this problem,
2:07
it was really something that my
2:09
co-founders, Peter and Jordan, I had been working on for
2:11
a very long time, actually. So previously,
2:13
Peter and I did PhDs together around
2:15
this area. And then when we started
2:17
the company, it was a little while
2:19
after transfer learning had started to work
2:21
at NLP for the first time. And
2:24
we were mostly helping companies fine tune
2:26
smaller models. But then sometime midway through
2:28
2022, we became absolutely convinced
2:30
that the rate of progress for these larger
2:32
models was so high, it was going to
2:34
start to eclipse essentially everything else
2:36
in terms of performance. But more importantly, in
2:38
terms of usability, right, it was the first
2:41
time that instead of having to like hand
2:43
annotate a new data set for every new
2:45
problem, there was this new way of customizing
2:47
AI models, which was that you could write
2:49
instructions in natural language, and have a reasonable
2:51
expectation that the model would then do that
2:54
thing. And that was unthinkable, you know, at
2:56
the start of 2022, I would say, or
2:58
maybe a little bit earlier. And
3:00
so that's really what made us want to
3:02
go work on this, because we realized that
3:05
the potential impact of NLP was already there.
3:07
But the accessibility had been expanded so far,
3:09
and the capabilities of the models have increased
3:11
so much that there was a particular moment
3:14
to go do this. But
3:16
at the same time, it introduced a whole bunch
3:18
of new challenges, right? So I guess historically, the
3:20
people who are building AI systems were machine learning
3:22
experts, the way that you would do it is
3:25
you would collect annotated data, you'd find you in
3:27
a custom model, it was typically being
3:29
used for like one specific task at a
3:31
time, there was a correct answer, so
3:33
it was easy to evaluate. And with
3:35
LMS, the power also brings new challenges.
3:37
So the way that you customize these
3:40
models is by writing these natural language
3:42
instructions, which are prompts. And
3:44
typically, that means that the people involved don't
3:46
need to be as technical. And usually, we
3:48
see actually that the the best people to
3:51
do prompt engineering tend to have domain expertise.
3:53
So often it's a product manager or someone
3:55
else within the company who is leading the
3:57
prompt engineering efforts. But you also have
3:59
a this new artifact lying around, which is
4:01
the prompt. And it has a similar impact
4:04
to code on your end application. So it
4:06
needs to be versioned and managed and treated
4:08
with the same level of respect and rigor
4:10
that you would treat normal code. But somehow
4:13
you also need to have the right workflows
4:15
and collaboration that lets the non-technical people work
4:17
with the engineers on the product or the
4:19
less technical people. And then the extra
4:22
challenge that comes with it as well is
4:24
that it's very subjective to measure performance
4:26
here. So in traditional code, we're used
4:28
to running unit tests, integration tests, regression
4:30
tests. We know what good looks like
4:33
and how to measure it. And even
4:35
in traditional machine learning, there's
4:37
a ground truth data set, people
4:39
calculate metrics. But once you go
4:41
into generative AI, it tends
4:43
to be harder to say what is the
4:45
correct answer. And so when that becomes difficult,
4:48
then measuring performance becomes hard. If measuring performance
4:50
is hard, how do you know when you
4:52
make changes if you're gonna cause regressions? Or
4:55
all the different design choices you have in developing
4:57
an app. How do you make those design choices
5:00
if you don't have good metrics of performance?
5:02
And so those are the problems that motivated
5:04
what we've built and really human
5:07
loop exists to solve both of these
5:09
problems. So to help companies with the
5:11
task of finding the best prompts, managing,
5:13
versioning them, dealing with collaboration, but then
5:15
also helping you do the evaluation that's
5:17
needed to have confidence that
5:19
the models are gonna behave as you expect in production.
5:22
And as related to these things, maybe
5:24
you can start with one that you
5:26
would like to start with and go
5:29
to the others. But in terms
5:31
of managing, versioning prompts,
5:33
evaluating the performance of these
5:35
models, dealing with regressions, as
5:38
you've kind of seen people try to
5:41
do this across probably a lot of
5:43
different clients, a lot of different industries,
5:46
how are people trying to manage
5:49
this in maybe some good ways
5:51
and some bad ways? Yeah, I think we see a
5:53
lot of companies go on a bit of a journey.
5:55
So early on, people are
5:58
excited about Genre to the I and LLM. The
6:00
the lot of hype around it now
6:02
so some people in the company just
6:04
could try things out and often they'll
6:06
start off using one of the large
6:08
enough publicly available models. open our Anthropic
6:10
cohere one of these, the prototype in
6:12
their own and a playground environment that
6:14
those providers have still eyeball a few
6:16
samples, maybe go rather a couple of
6:18
libraries that support orchestration and they'll put
6:20
together a prototype and the first version
6:22
is fairly easy to built. It's we,
6:24
you know, it's very quick to get
6:26
to, like the first wow moments and
6:29
them as people start. Moving towards production and
6:31
they started a rating from that, you know,
6:33
maybe eighty percent good enough version to something
6:35
that they really trust. They start to run
6:37
into these problems of like oh, I got
6:39
like twenty different versions of this problems and
6:41
I'm storing it as a strain code and
6:43
actually I wanna be able to collaborate with
6:45
a colleague on there and so now we're
6:47
sharing things know either by screen sharing or
6:49
were like both. You know, we've had some
6:52
serious company who would have heard of who
6:54
were sending their moral concepts to each other
6:56
via a mix of cheats. and obviously if
6:58
you wouldn't send someone of important piece. Of
7:00
code for slack or teams or something
7:02
like this. But because the collaboration software
7:04
isn't there to bridge the technical month
7:06
technical divide, that was the kind of
7:08
problems we see. And so at this
7:10
point typically a year ago people would
7:12
start building their own solution to more
7:14
often than not, like this is when
7:16
people would start building in house tools
7:18
increasingly because there are companies like Uma
7:20
loop around. That's usually when someone books
7:22
a demo with us and they say
7:24
hey, you know we've reached this point
7:26
where actually managing these artifacts has become
7:28
cumbersome. Were worried about. The quantity of
7:30
what we're producing do have a solution
7:33
to help, and the way that humor
7:35
loop helps at least on the prompt
7:37
management side is we have this interactive
7:39
environment. It's a little bit like those
7:41
open a I playgrounds or the Anthropic
7:43
Playground, but a lot more fully featured
7:45
and designed for actual development. so it's
7:47
collaborative. It has history built, and you
7:49
can connect variables and datasets and so
7:51
it becomes like a development environment for
7:53
your sort of L M application. You
7:55
can prototype the application, interact with their,
7:57
try out a few things, and then
7:59
people progress. From that development
8:01
environment into production through
8:03
evaluation and monitoring. You.
8:06
Mention this kind of in passing. I'd
8:08
love to dig into at a little
8:10
bit more. you mention kind of the
8:12
types of people that are coming you
8:14
know are at the table in designing
8:16
these systems and often times domain experts
8:18
you know Previously in working as a
8:20
data scientist it was always kind of
8:22
assumed oh you need to talk to
8:24
the domain experts but it's sort of
8:26
like at least for many years it
8:28
was like data scientist talk to the
8:30
domain experts and then go off and
8:32
build their thing. The domain experts were
8:34
not involved in the. Sort of building
8:36
of the system and even then
8:39
like the data scientists were maybe
8:41
building things that were kind of
8:43
for and to software engineers and
8:45
what I'm hearing you say as
8:47
you kind of got like these
8:49
multiple layers you have like domain
8:51
experts who might not be that
8:53
technical you've got may be a
8:55
I, and data people who are
8:57
using this kind of unique set
8:59
of tools, maybe even they're hosting
9:02
their own models and then you've
9:04
got like product software engineering. People
9:06
seems like a much more
9:08
complicated landscape of interactions. Have
9:10
you seen this kind of
9:12
play out in reality? In
9:14
terms of non technical people
9:16
and technical people, Both. Working
9:18
together on something that is
9:20
ultimately something implemented in code
9:23
and run as an application.
9:25
Accessing. One of the most exciting
9:27
things about El Ends and the progressive
9:29
era in general is that product managers
9:32
and subject matter experts can for the
9:34
first time be very directly involved in
9:36
implementing these applications. So I think it's
9:38
always been the case that the piano
9:40
or someone like that you know is
9:42
the person who distills the problem, speak
9:44
to the customers, produces the spat with
9:47
as a translation step where they sort
9:49
of produce back yard a document than
9:51
and then someone else because often implements
9:53
set and because we're now able to
9:55
program. And the some of the application
9:57
and natural language actually it's accessible to
9:59
those. Very directly the and it's worth
10:01
within a concrete example for like I
10:03
use: I'm an Ai note taker for
10:05
a lot of my sales calls and it
10:08
recorded the call and then I get
10:10
a summary afterwards and the app actually
10:12
allows you to choose a lot of different
10:14
types of summary. See you can say
10:16
hey I'm a salesperson I want to summary
10:18
that will extracts budget and authority and
10:20
need and timeline vs you can say
10:22
oh actually I had a product interview and
10:25
I want a different type of summary
10:27
and if you think about developing that
10:29
application the person who has the knowledge that
10:31
needed to say what a good summary
10:33
is and right the prom from the
10:35
model is the person was the domain expertise
10:37
is not the software engineer, but obviously
10:39
the prompt is only one piece of
10:41
the application right? If you gotta who question
10:43
answering system there's usually retrieval is part
10:45
of this. There may be other components usually
10:48
Llm is a block and a wider
10:50
applications. You obviously still needs the software
10:52
engineers around because they're implementing the bulk of
10:54
the application, but the product mentors me
10:56
much more directly involved and then you
10:58
know I actually we see. Increasingly less
11:00
involvement from machine learning or Ai experts
11:02
and less people are fine tuning their
11:04
own models for the majority of product
11:07
teams were seeing. There is a an
11:09
Ai platform team that made these facilities
11:11
setting things up, but the bulk of
11:13
the work is led by the product
11:15
managers and then the engineers and one
11:17
interesting sample of is on the extreme
11:20
end is one of our customers. It's
11:22
very large at tech company, they actually
11:24
do not let their engineers at it.
11:26
The prompts to the have a team
11:28
of linguists to do prompt. Development.
11:31
The. Linguists finalize the prompts, their season
11:33
a serialized format, and they go
11:35
to production, but it's a one
11:37
way transfer so the engineers can
11:39
edit them because they're not considered
11:41
able to assess the. The. Actual
11:43
outfits even though they are responsible for the
11:45
rest of the application. Just. Thinking
11:47
about how teams interact and who's
11:50
doing what it seems like the
11:52
problems that you've laid outer I
11:54
think. very clear and we're solving,
11:56
but it's probably hard to think
11:58
about. All. I, My Billie. building a
12:00
developer tool or am I building
12:02
something that these non-technical people interact
12:04
with or is it both? How
12:06
did you think about that as
12:08
you entered into the stages of
12:11
bringing human loop into existence? I
12:13
think it has to be both.
12:16
And the honest answer is it evolved organically
12:18
by going to customers, speaking to them about
12:20
their problems and trying to figure out what
12:22
the best version of a solution looks like.
12:24
So we didn't set out to build a
12:26
tool that needed to do both of these
12:28
things. But I think the reality is, given
12:31
the problems that people face, you do need both. An
12:34
analogy to think about might be something
12:37
like Figma. Figma is somewhere
12:39
where multiple different stakeholders come together
12:41
to iterate on things and to develop them and
12:43
provide feedback. And I think you need something analogous
12:46
to that for Gen AI, although it's not
12:48
an exact analogy because we also need to attach
12:50
the evaluation to this. So it's
12:52
almost by necessity that we've had to do that.
12:55
But I also think that it's very
12:57
exciting. And the reason I think it's
12:59
exciting is because it is expanding who
13:01
can be involved in developing these notifications.
13:22
If you're listening, you know software is built
13:24
from thousands of small technical choices. And
13:27
some of these seemingly inconsequential choices
13:29
can have a profound impact on the
13:31
economics of internet services, who gets to
13:33
participate in them, build them and profit
13:36
from them. This is especially true
13:38
for artificial intelligence, where the decisions we
13:40
made today can determine who can
13:42
have access to world changing technologies and
13:44
who can decide their future. Read, write,
13:47
own, building the next era of the
13:49
internet is a new book from
13:51
startup investor Chris Dixon that explores the
13:53
decisions that took us from open
13:55
networks governed by communities to massive social
13:58
networks run by internet giants. This
14:00
book, Read Write Own, is a
14:02
call to action for building a
14:04
new era of the internet that
14:06
puts people in charge. From AI
14:08
projects that compensate creators for their
14:11
work to protocols that fund open
14:13
source contributions, this is our chance
14:15
to build the internet we want,
14:17
not the one we inherited. Order
14:19
your copy of Read Write
14:22
Own today or go to
14:24
readwriteown.com to learn more. You
14:40
mentioned how this environment
14:42
of domain experts coming together
14:44
and technical teams coming together
14:47
in a collaborative environment opens
14:49
up new possibilities for both
14:52
collaboration and innovation. I'm wondering if at
14:54
this point you could kind of just
14:56
lay out, we've talked about
14:58
the problems, we've talked about those involved
15:00
and those kind of that would use
15:02
such a system or a platform to
15:04
enable this these kind of workflows. Could
15:07
you describe a little bit more what
15:09
human loop is specifically in
15:11
terms of both what it
15:14
can do and kind of
15:16
how these different personas engage
15:18
with the system? Yes,
15:20
I guess in terms of what it can do
15:22
concretely, it's firstly helping
15:24
you with prompt iteration, versioning and management
15:26
and then with evaluation and monitoring and
15:28
the way it does that is there's
15:30
a web app and there's a web
15:32
UI where people are coming in and
15:34
in that UI. Is an
15:36
interactive playground like environment where people they
15:38
try out different prompts, they can compare
15:40
them side by side with different models,
15:42
they can try them with different inputs
15:45
when they find versions that they think
15:47
are good, they save them and
15:49
then those can be deployed from that environment
15:51
to production or even to a development
15:53
or staging environment so that's
15:56
the kind of development stage and then
15:58
once you have something that's developed what's
16:00
very. very typical is people then want
16:02
to put in evaluation steps into place.
16:04
So you can define goal standard test
16:06
sets, and then you can define
16:08
evaluators within human loop. And evaluators
16:10
are ways of scoring the outputs of
16:12
a model or a sequence of models
16:14
because oftentimes the LLM is part
16:16
of a wider application. And so
16:18
the way that scoring works is there's
16:21
very traditional metrics that you would have in code
16:23
for any machine learning system. So
16:25
precision, recall, Rouge, Blue, these kind of
16:27
scores that anyone from a machine learning
16:29
background would already be familiar with. But
16:31
what's new in the kind of LLM space
16:34
is also things that help when things are
16:36
more subjective. So we have the ability to
16:38
do model as judge, where you might actually
16:40
prompt another LLM to score the output in
16:42
some way. And this can be particularly useful
16:45
when you're trying to measure things like hallucination.
16:48
So a very common thing to do is to
16:50
ask the model, is the
16:52
final answer contained within the retrieved context?
16:55
Or is it possible to infer the answer
16:57
from the retrieved context? And you can calculate
16:59
those scores. And then the
17:01
final way is we also support human evaluation. So
17:04
in some cases, you really do want either
17:06
feedback from an end user or
17:08
from an internal annotator involved as well.
17:11
And so we allow you to gather
17:13
that feedback either from your live production application
17:16
and have it logged against your
17:18
data. Or you can cue internal
17:20
annotation tasks from a team. And
17:22
I can maybe tell you a little bit more about
17:24
sort of in production feedback, because that's something that that's
17:26
actually where we started. Yeah, yeah, go ahead. I would
17:29
love to hear more. Yes, I think that because
17:31
it's so subjective for a lot of the
17:33
applications that people are building, whether it be
17:36
email generation, question answering, a
17:38
language learning app, there isn't
17:40
a correct answer, quote unquote.
17:42
And so people want to measure how things
17:44
are actually performing with their end users. And
17:47
so human loop makes it very easy to
17:49
capture different sources of end user feedback. And
17:51
that might be explicit feedback, things like thumbs
17:54
up, thumbs down votes that you see in
17:56
chat GPT, but it can also be more
17:58
implicit signals. So how How did the
18:00
user behave after they were
18:02
shown some generated content? Did they progress to
18:04
the next stage of the application? Did they
18:06
send the generated email? Did they
18:09
edit the text? All of
18:11
that feedback data becomes useful both
18:13
for debugging and also for
18:15
fine-tuning the model later on. That
18:18
evaluation data becomes this rich resource that
18:20
allows you to continuously improve your application
18:22
over time. Yeah, that's awesome. I
18:25
know that that fits in. Maybe
18:28
you could talk a little bit about how
18:31
one of the things that you mentioned
18:33
earlier is you're seeing fewer people do
18:35
fine-tuning, which I see this
18:38
very commonly as a... It's
18:40
not an irrelevant point, but it's maybe
18:42
a misconception where a lot of teams
18:45
come into this space and they just
18:47
assume they're going to be fine-tuning their
18:49
models. And what
18:51
they end up doing is fine-tuning
18:53
their workflows or their language model
18:55
chains or their retrieval, the data
18:58
that they're retrieving, or their prompt
19:00
formats or that templates or that
19:02
sort of thing. They're not really
19:04
fine-tuning. And I think there's this
19:06
really blurred line right now for
19:09
many teams that are adopting
19:12
AI into their organization where they'll
19:14
frequently just use the term, oh,
19:16
I'm training the AI to do
19:18
this and now it's better,
19:20
right? But all they've really done is just
19:22
inject some data into their prompts
19:25
or something like that. So could
19:27
you maybe help clarify
19:30
that distinction and also
19:32
in reality what you're seeing people
19:34
do with this capability of evaluation,
19:37
both online and offline, and
19:39
how that's filtering back into
19:42
upgrades to the system or
19:44
actual fine-tunes of models? Yeah.
19:47
So I guess you're right. And
19:49
especially for people who are new to the field,
19:51
the word fine-tuning has a colloquial meaning and then
19:53
it has a technical meaning in machine learning and
19:56
the two end up being blurred. So
19:58
fine-tuning in a machine learning context is a very good question. usually
20:00
means doing some extra training on the
20:03
base model, where you're actually changing
20:05
the weights of the model, given
20:07
some sets of example pairs of inputs, outputs
20:09
that you want. And then
20:11
obviously there's like prompt engineering and
20:14
maybe context engineering, where you're changing the
20:16
instructions to the language model, or you're
20:18
changing the data that's set into the context,
20:20
or how the, you know, an agent
20:22
system might be set up. And both
20:24
are really important. Typically the
20:26
advice we give the majority of our
20:29
customers and what we see play out
20:31
in practice is that people should first
20:33
push the limits of prompt engineering. Because
20:36
it's very fast, it's easy to do,
20:38
and it can have like very high
20:40
impact, especially around changing the sort of
20:42
outputs and also in helping the model
20:44
have the right data that's needed to
20:47
answer the question. So prompt engineering is
20:49
kind of usually where most people start
20:51
and sometimes where people finish as well.
20:53
And fine tuning tends to be
20:55
useful either if people are trying
20:57
to improve latency or cost, or
21:00
if they have like a particular tone of voice
21:02
or output constraint that they want to enforce. So,
21:04
you know, if people want their
21:06
model to output valid JSON, then fine
21:09
tuning might be a great way to achieve that. Or
21:11
if they want to use a local private model cause
21:13
it needs to run on an edge device or something
21:15
like this, then fine tuning I think is a great
21:17
candidate. And it can also let you
21:20
reduce costs because oftentimes you can fine tune a
21:22
smaller model to get similar performance.
21:25
The analogy I like to use is fine tuning is
21:27
a bit like compilation, right? You have a, you've
21:29
already sort of built your first version of the
21:31
language. When you want to optimize it, you might
21:33
use a compiled language and you've got a kind
21:35
of compiled binary. I think
21:37
there was a second part to your question, but
21:39
just remind me, actually, I've lost the second part.
21:41
Yeah, basically you mentioned that
21:43
maybe fewer people are doing
21:46
fine tunes. Maybe
21:48
you could comment on, I
21:50
don't know if you have a sense
21:52
of why that is or how you
21:55
would see that sort of progressing into
21:57
this year as more and more people
21:59
adopt. this technology and maybe get
22:01
better tooling around the, let's
22:04
not call it fine tuning so we don't
22:06
mix all the jargon, but the iterative
22:09
development of these systems, do
22:11
you see that trend continuing
22:13
or how do you see
22:15
that kind of going into maybe larger
22:18
or wider adoption in 2024?
22:21
Yeah, so I think that we've definitely seen
22:23
less fine tuning than we thought we would see
22:25
when we started, you know, when we launched human
22:28
loop back, this version of human loop back in 2022.
22:31
And I think that's been true of others
22:33
as well. Like I've spoken to friends at
22:35
OpenAI and OpenAI is expecting there will be
22:37
more fine tuning in the future, but they've
22:39
been surprised that there wasn't more initially. I
22:42
think some of that is because prompt engineering has turned
22:44
out to be remarkably powerful. And
22:46
also because some of the changes that people want
22:48
to do to these models are more about getting
22:51
factual context into the model. So
22:53
one of the downsides of LLMs
22:55
today is they're obviously trained on
22:57
the public internet. So they don't necessarily know private
22:59
information about your company. They tend not
23:02
to know information past the training date of the
23:04
model. And you know, one way
23:06
you might have thought you could overcome that is I'm
23:08
going to fine tune the model on my company's data.
23:11
But I think in practice, what people are finding is
23:13
a better solution to that is to
23:15
use a hybrid system of search
23:17
or information retrieval plus generation. So
23:19
what's come to be known as
23:21
like RAG or retrieval augmented generation has turned out
23:24
to be a really good solution to this problem.
23:27
And so the main reasons to fine
23:29
tune now are more about optimizing cost
23:31
and latency and maybe a little bit
23:33
tone of voice, but they're
23:35
not needed so much to adapt the model
23:37
to a specific use case. And
23:40
fine tuning is a heavier duty operation
23:42
because it takes longer. You
23:44
can edit a prompt very quickly and then see what
23:46
the impact is. Fine tuning, you need to
23:48
have the data set that you want to fine tune on, and
23:51
then you need to run a training job and then
23:53
evaluate that job afterwards. So there are
23:55
certainly circumstances where it's going to make sense. I
23:57
think especially anyone who wants to do the private.
24:00
private open source model will likely find themselves
24:02
wanting to do more fine tuning. But
24:04
the quality of an off prompt engineering and the distance you
24:06
can go with it, I think took a lot of people
24:08
by surprise. And on that
24:11
note, you mentioned the closed proprietary
24:13
model ecosystem versus open models that
24:15
people might host in their own
24:17
environment and or fine tune on
24:20
their own data. I
24:22
know that human loop, like you explicitly
24:24
say that you kind of have
24:27
all of the models you're integrating these
24:29
sort of closed models and integrate with
24:32
open models. Why and
24:34
how is that kind of decided to
24:37
kind of include all of those?
24:39
And in terms of the mix
24:42
of what you're seeing with people's
24:44
implementations, how do you
24:46
see this sort of proliferation of
24:48
open models impacting the workflows that
24:50
you're supporting in the future? So
24:53
the reason for supporting them again is largely
24:55
customer pull, right? What we were finding is
24:58
that many of our customers were
25:00
using a mixture of models for
25:02
different use cases, either because the
25:04
large proprietary ones had slightly different
25:06
performance trade offs or because
25:08
there were use cases where they cared about privacy
25:11
or they cared about latency. And so they couldn't
25:13
use a public model for those
25:15
instances. And so we had to
25:17
support all of them. It really was something that it
25:20
would it wouldn't be a useful product to our customers if
25:22
they could only use it for one particular model. And
25:25
the way we've got around this is that we try
25:27
to integrate all of the publicly available ones, but we
25:29
also make it easy for people to connect their own
25:31
models so they don't necessarily need
25:33
us. As long as they expose
25:36
the appropriate API, you can plug in any model
25:38
to human loop. That would be a matter of
25:41
hosting the model and making sure
25:43
that the API contract that you're
25:45
expecting in terms of responses from
25:48
a model server that maybe someone's
25:50
running in their own AWS or
25:52
wherever would fulfill that contract.
25:55
That's exactly right. And
25:58
in terms of the proliferation... of
26:00
open source and how that's going. I think
26:03
there's still a performance gap at the moment
26:05
between the very best closed models, so between
26:07
the GPD4 or some of the better models
26:09
from Entropic and the best open
26:11
source, but it is closing, right? So the latest
26:13
models from say Mistral have
26:16
proved to be very good, LAMA2 was
26:18
very good. Increasingly, you're not
26:20
paying as big a performance gap, although
26:22
there is still one, but you
26:24
need to have high volumes for it to
26:26
be economically competitive to host your own model.
26:28
So the main reasons we see people doing
26:30
it are related to data privacy.
26:33
Companies that for whatever reason cannot
26:36
or don't want to send data to
26:38
a third party end up using
26:40
open source, and then also anyone who's
26:43
doing things on edge and who
26:45
wants real-time or very low latency ends
26:47
up using open source. This
26:54
is a changelog newsbreak. VANA.ai
26:57
is a Python RAG
26:59
framework for accurate text
27:01
to SQL generation. It
27:04
lets you chat with
27:06
any relational database by
27:08
accurately generating SQL queries
27:10
trained via RAG, which
27:12
stands for retrieval augmented
27:14
generation, to use with
27:17
any LLM that you want. You
27:19
load up your data definitions, your
27:21
documentation, and any raw SQL queries
27:23
you have laying around into VANA,
27:26
and then you're off to the
27:28
races. VANA boasts high accuracy on
27:30
complex datasets, excellent security and
27:32
privacy because your database contents are never
27:35
sent to the LLM or
27:37
a vector DB. It boasts
27:39
the ability to self-learn by
27:41
choosing to auto train on
27:43
successful queries, and a choose
27:46
your own front end approach
27:48
with front ends provided for
27:50
Jupyter Notebook, Streamlit, Flask, and
27:52
Slack. You just heard
27:54
one of our five top stories
27:56
from Monday's changelog news. Subscribe to
27:59
the podcast. to get
28:01
all of the week's top stories
28:03
and pop your email address in
28:05
at changelog.com/news to also receive our
28:07
free companion email with even more
28:09
developer news worth your attention. Once
28:12
again, that's changelog.com/news.
28:20
Well, Reza, I'd love for you to
28:22
maybe describe if you can, we've kind
28:24
of talked about the problems that you're
28:27
addressing. We've talked about the
28:29
sort of workflows that you're enabling the
28:31
evaluation and some trends that you're seeing.
28:33
But I'd love for you to describe
28:36
if you can maybe for like
28:38
a non-technical persona, like a domain
28:40
expert who's engaging with the human
28:42
loop system. And maybe
28:45
for a more technical person
28:47
who's integrating, you know, data
28:49
sources or other things, what
28:51
does it look like to
28:54
use the human loop system,
28:56
maybe describe the roles
28:58
in which these people are like
29:01
what they're trying to do from each
29:03
perspective, because I think that might be
29:05
instructive for people that are trying to
29:07
engage domain experts and technical people in
29:10
a collaboration around these problems. Absolutely. So
29:12
maybe it might be helpful to have
29:14
a kind of imagined concrete example. So
29:16
a very common example we see is
29:18
people building some kind of question answering
29:20
system, maybe it's for their internal customer
29:22
service staff, or maybe they want to
29:25
replace an FAQ that, so
29:27
I'm just gonna drink water. Maybe they're trying to
29:29
build some kind of internal question answering
29:31
system to replace something, or an
29:33
FAQ or that kind of thing. So there's a set
29:35
of documents or questions going to come in, there'll be
29:37
a retrieval step and then they want to generate an
29:39
answer. So, typically
29:42
the PMs or the domain experts will be figuring out, you
29:44
know, what are the requirements of the system? What is good
29:46
look like? What do we want it to build? And
29:49
the engineers will be building the
29:51
retrieval part, orchestrating all the model calls
29:53
and code, integrating the human loop API
29:55
into their system. And also,
29:57
usually they lead on setting
30:00
up evaluation. So maybe once
30:02
it's set up, the domain experts might continue
30:04
to do the evaluation themselves, but
30:06
the engineers tend to set it up the first
30:09
time. So if you're the domain expert, typically, you
30:11
would start off in our playground environment where you
30:13
can just try things out. So
30:15
the engineers might connect a database to human loop
30:17
for you. So maybe they'll store the data in
30:20
a vector database and connect that
30:22
to human loop. And then once you're in
30:24
that environment, you could try different prompts to the models,
30:26
you could try them to gt4 to cohere to an
30:29
open source model, see what impact that
30:31
has, see if you're getting answers that you
30:33
like, right? Oftentimes early on, it's not in
30:35
the right tone of voice, or the retrieval
30:37
system is not quite right. And so the
30:39
model is not giving factually correct answers. So
30:41
it takes a certain amount of iteration to
30:44
get to the point where even when you
30:46
eyeball it, it's looking appropriate. And usually at
30:48
that point, people then move to doing a
30:50
little bit more of a rigorous evaluation. So
30:52
they might generate either automatically or internally
30:55
a set of test cases. And
30:57
they'll also come up with a set of evaluation
30:59
criteria that matter to them in their context, they'll
31:02
set up that evaluation, run it,
31:04
and then usually at that point, they might
31:06
deploy to production. So that's the point at
31:08
which things would end up with
31:10
real users, they started gathering user feedback. And
31:12
usually the situation is not finished at that
31:14
point, because people then look at the production
31:16
logs, or they look at the real usage
31:18
data, and they will filter based on the
31:21
evaluation criteria. And they might say, Hey, show
31:23
me the ones that didn't result in a
31:25
good outcome. And then they'll try and debug
31:27
them in some way, maybe make a change
31:29
to a prompt, rerun the evaluation and submit
31:31
it. And so the engineers
31:33
are doing the orchestration of the code,
31:36
they're typically making the model calls, they'll
31:38
add logging calls to human loop. So
31:40
the way that works, there's
31:42
a couple of ways between the integration, but you
31:44
can imagine every time you call the model, you're
31:46
effectively also logging back to human loop, what the
31:48
inputs and outputs were, as well as any user
31:51
feedback data. And then the domain
31:53
experts are typically looking at the data,
31:55
analyzing it, debugging, making decisions about how
31:57
to improve things. And they're able to
32:00
actually take some of those actions themselves
32:02
in the UI. Yeah. And
32:04
so if I just kind of
32:06
abstract that a bit to maybe
32:09
give people a frame of thinking, it
32:11
sounds like there's kind of this framework
32:13
set up where there's data sources, there's
32:17
maybe logging calls within a
32:19
version of an application. If
32:23
you're using a hosted model or if you're
32:25
using proprietary
32:27
API, you decide
32:30
that. And so it's kind of set
32:32
up and then there's maybe an evaluation
32:35
or prototyping phase, let's call it
32:37
where the domain experts try their
32:39
prompting. Eventually, they find prompts that
32:41
they think will work well for
32:43
these various steps in a
32:45
workflow or something like that. Those
32:47
are pushed, as you said, I
32:49
think one way into the actual
32:51
code or application such that
32:53
the domain experts are in charge
32:56
of the prompting to some degree.
32:58
And as you're logging feedback into
33:01
the system, the domain experts
33:03
are able to iterate on their prompts, which
33:05
hopefully then improve the system. And those are
33:07
then pushed back into the production
33:10
system maybe after an evaluation or
33:12
something. Is that a fair representation?
33:15
Yeah, I think that's a great representation. Thanks
33:17
for articulating it so clearly. And the kinds
33:19
of things that the evaluation becomes useful for
33:21
is avoiding regressions, say. So people might notice
33:23
one type of problem, they go in and
33:25
they change a prompt or they change the
33:28
retrieval system and they want to make sure
33:30
they don't break what was already working. And
33:32
so having good evaluation in place helps with
33:35
that. And then maybe it's also worth, because
33:37
I think we didn't sort of
33:39
do this at the beginning, just thinking about
33:41
what are the components of these LLM applications?
33:43
So I think you're exactly right. We sort of
33:45
think of the blocks of LLM app being composed
33:48
of a base model. So that might be a
33:50
private fine tune model or one of these large
33:52
public ones. A prompt template,
33:54
which is usually an instruction to the model that
33:56
might have gaps in it for
33:59
retrieved data. or context, a
34:01
data collection strategy. And
34:04
then that whole thing of data collection,
34:06
prompt template, and model might
34:08
be chained together in a loop or
34:10
might be repeated one after another.
34:13
And there's an extra complexity, which is
34:15
the models might also be allowed to
34:17
call tools or APIs. But
34:20
I think those pieces to get taken
34:22
together more or less comprehensively cover things.
34:24
So tools, data retrieval, prompt template, and
34:27
base model are the main components.
34:29
But then within each of those, you have a lot of
34:32
design choices and freedom. And so you
34:34
have a combinatorially large number of decisions to
34:36
get right when building one of these applications.
34:39
One of the things that you mentioned is this
34:42
evaluation phase of what goes
34:44
on as helping prevent regressions.
34:47
Because in testing behaviorably,
34:49
the output of the models, you
34:51
might make one change on a
34:53
small set of examples that looks
34:56
like it's improving things, but has
34:58
different behavior across a wide range
35:00
of examples. I'm wondering
35:02
also, I could imagine
35:04
two scenarios. Models are
35:07
being released all the time, whether it's upgrading
35:09
from this version of a GPT
35:11
model to the next version or this
35:13
Mistral fine tune to this one over
35:15
here. I'm thinking even in the
35:18
past few days, we've been
35:20
using the neural chat model from Intel
35:22
a good bit. And there's a version
35:24
of that, the neural magic release that's
35:26
a sparsified version of that,
35:29
where they pruned out some
35:31
of the weights and the layers to
35:33
make it more efficient and to
35:35
run on better, or not better
35:37
hardware, but more commodity hardware that's more
35:39
widely available. And so one of the
35:41
questions that we were discussing is, well,
35:44
we could flip the version of this
35:46
model to the sparse one, but we
35:48
have to decide on how
35:50
to evaluate that over the use cases
35:52
that we care about. Because you could
35:54
look at the output for a few
35:56
test prompts, and it might
35:58
look similar. Or. good or even
36:01
better, but on a wider scale
36:03
might be quite different in ways
36:05
that you don't expect. So I
36:07
could see that the evaluation also being used for
36:09
that, but I could also see where if you're
36:12
upgrading to a new model, it
36:14
could just throw everything up in the air
36:16
in terms of like, oh,
36:18
this is an entirely different prompt format,
36:21
or this is a whole
36:23
new behavior from this new
36:25
model that is distinct from
36:27
an old model. So how are
36:29
you seeing people navigate that landscape
36:32
of model upgrades? I think
36:34
you should just view it as a change, as you
36:36
would, to any other part of the system. And hopefully
36:38
the desired behavior of the model is not changing. So
36:41
even if the model is changed, you
36:43
still want to run your regression test and
36:46
say, are we meeting a minimum threshold that
36:48
we had on these gold standard test set
36:50
before? In general, I think
36:52
evaluation, we see it happening at three
36:54
different stages during development. There is
36:57
during this interactive stage very early on,
36:59
when you're prototyping, you want fast feedback,
37:01
you're just looking to get a sense
37:03
of is this even working appropriately? At
37:06
that stage, eyeballing examples and looking at
37:08
things side by side in a very
37:10
interactive way can be helpful. And interactive
37:13
testing can also be helpful for adversarial
37:15
testing. So a fixed test set
37:17
doesn't tell you what will happen when
37:19
a user who actually wants to break the system
37:21
comes in. So a concrete example of this, one
37:24
of our customers has children
37:27
as their end users, and they want to
37:29
make sure that things are age appropriate. So they
37:31
have guardrails in place. But when
37:33
they come to test the system, they
37:35
don't want to just test it for
37:38
against an input that's benign.
37:40
They want to see if we try, if we
37:42
really red team this, can we break it? And
37:45
their interactive testing can be very helpful. And
37:48
then the next place where you want testing in
37:50
place is this regression testing, where you
37:52
have a fixed set of evaluators on a test set, and
37:54
you want to know when I make a change, does it
37:56
get worse? And the final place we see
37:58
people using it is actually from Mona. So, okay,
38:01
I'm in production now. There's new
38:03
data flowing through. I may not have the ground
38:05
truth answer, but I can still set up different
38:07
forms of evaluator. And I want
38:09
to be alerted if the performance drops below
38:11
some threshold. So one of
38:14
the things that I've been thinking about
38:16
throughout our conversation here, and that's
38:18
I think highlighted by what you just mentioned and
38:20
sort of the upgrades to
38:22
one's workflow and the various
38:25
levels at which such a
38:27
platform can benefit teams.
38:30
And it made me think of, you
38:32
know, used to, I have a background
38:35
in physics and there were plenty of
38:37
physics teams or collaborators that we
38:39
worked with, you know, we were writing code and
38:42
not doing great sort of
38:44
version control practices and not
38:46
everyone was using GitHub. And
38:48
there's sort of collaboration
38:51
challenges associated with
38:53
that, which are obviously
38:55
solved by great code collaboration systems
38:57
that are of various forms that
38:59
have been developed over time. And
39:03
I think there's probably a parallel
39:05
here with some of the collaboration
39:07
systems that are being built around
39:09
both playgrounds and prompts and evaluation.
39:12
I'm wondering if you could, if
39:15
there's any examples from clients
39:17
that you've worked with, or
39:19
maybe it's just interesting use cases
39:21
of surprising things they've been able to
39:23
do when going from sort
39:25
of doing things ad hoc
39:28
and maybe versioning prompts in spreadsheets
39:30
or whatever it might be to
39:33
actually being able to work in
39:35
a more seamless way between domain
39:37
experts and technical staff. Are
39:39
there any clients or use
39:41
cases or surprising stories that
39:44
come to mind? Yeah, it's a good question. I'm
39:46
kind of thinking through them to see, you know,
39:48
what the more interesting examples might be. I
39:51
think that fundamentally, it's not
39:53
necessarily enabling completely new behavior,
39:55
right? But it's making the
39:57
old behavior significantly faster. less
40:00
error prone. So, you know,
40:02
certainly fewer mistakes and less time
40:04
spent, you know, one, okay, so
40:06
surprising example, publicly listed company, and
40:08
they told me that one of
40:10
the issues they were having is
40:13
because they were sharing these prompt
40:15
conflicts in teams, they were
40:17
having differences in behavior based on white space
40:19
being copied. So the, you know, someone was
40:21
like playing around with the opening, I played
40:24
around, they'd copy pasted into teams, that person
40:26
would copy paste from teams into code. And
40:29
there was small white space differences, and
40:31
you wouldn't think it should expect affect
40:33
the models that actually did. And so
40:35
they would then get performance differences they
40:37
couldn't explain. And actually, it just turned
40:39
out that, you know, you shouldn't be
40:41
sharing your code via. Right. So
40:44
I guess that's one one surprising example.
40:46
I think another thing as well is
40:48
the complexity of apps that people are
40:50
now beginning to be able to build.
40:53
So increasingly, I think people
40:55
are building simple agents,
40:57
right, I think more complex agents are still
41:00
not super reliable. But a trend that we've
41:02
been hearing a lot about from our customers
41:04
recently, is people trying to
41:06
build a systems that can use their
41:09
existing software. So you know,
41:11
an example of this is, you know,
41:13
ironclad is a company that's added a
41:15
lot of LLM based features to their
41:17
product, and they actually are able to
41:19
automate a lot of workflows that were
41:21
previously being done by humans,
41:24
because the models can use the API that
41:26
exists within the ironclad software. So they're actually,
41:28
you know, able to leverage their existing infrastructure.
41:30
But to get that to work, they had
41:33
to innovate quite a lot in tooling. And
41:35
in fact, you know, this isn't the plug
41:37
for human loop ironclad, in this case, built
41:40
a system called rivet, which is their
41:42
own open source, you know, prompt engineering
41:44
and iteration framework. But I think it's
41:46
a good example of, you know, in
41:48
order to achieve the complexity of that
41:50
use case, this happened to be
41:52
before tools like human loop around, they had to build
41:54
something themselves. And it's quite sophisticated
41:56
tooling, actually rivet is great. So people should check
41:58
that out as well. well, it's an open source
42:01
library, anyone can go and get the tool. So
42:03
yeah, I think the surprising things are like
42:05
how error prone things are without good tooling
42:07
and, and the crazy ways in which
42:09
people are solving problems. Another example of a mistake that
42:12
we saw someone do is two
42:14
different people triggered exactly the same annotation
42:16
job. So they had annotation and
42:18
spreadsheets. And they both outsourced
42:20
the same job to different annotation
42:22
team, which obviously an
42:24
expensive mistake to make. So very
42:26
error prone. And then I think also just
42:28
like impossible to scale to
42:31
more complex, agentic use cases. Well,
42:33
you already kind of alluded to
42:35
some trends that you're seeing moving
42:38
forward, as we kind of draw
42:40
to a close here, I'd love
42:43
to know from someone who's seeing
42:45
a lot of different use cases
42:48
being enabled through human loop and
42:50
your platform, what's exciting for you
42:53
as you move into this next year
42:55
in terms of maybe it's
42:57
things that are happening in AI more broadly,
43:00
or things that are being enabled
43:02
by human loop or things that are
43:04
on your roadmap that you can't wait
43:07
for them to go live. What, as
43:09
you're lying in bed at night and getting
43:11
excited for for the next day of AI
43:14
stuff, what's on your mind? So
43:16
AI more broadly, I just feel
43:18
the rate of progressive capabilities is
43:20
both exciting and scary, right? It's
43:22
extremely fast multimodal models, better generative
43:25
models, models with increased reasoning. I
43:27
think the range of possible applications
43:29
is expanding very quickly as the
43:31
capabilities of the models expand. I
43:34
think people have been excited about agent use
43:36
cases for a while, right systems that
43:38
can act on their own and go off
43:41
and achieve something for you. But in
43:43
practice, we've not seen that many people succeed
43:45
in production with those. There are a couple of examples,
43:48
ironclad being a good one. But it
43:50
feels like we're still at the very beginning of
43:52
that. And I think I'm excited about seeing more
43:54
people get to success with that. I'd
43:56
say that the most common, you know, successful
43:59
applications we've seen. seen today are mostly
44:01
either retrieval augmented applications
44:03
or more simple LLM
44:05
applications. But increasingly, I'm
44:08
excited about seeing agents in production and
44:10
also multimodal models in production. In
44:12
terms of things that I'm particularly excited
44:14
about from Humanloop is I think us
44:17
becoming a proactive rather than a passive
44:19
platform. So today, the product
44:21
managers and the engineers drive the changes
44:23
on Humanloop. But I think that's
44:25
something that we're going to hopefully release later this year
44:27
is actually the system,
44:30
Humanloop itself can start proactively suggesting improvements
44:32
to your application. Because we have the
44:34
evaluation data, because we have all the
44:37
prompts, we can start saying things to
44:39
you like, hey, we have a
44:41
new prompt for this application. It's a lot shorter than
44:43
the one you have. It scores similarly on eval data.
44:45
If you upgrade, we think we can cut your costs
44:47
by 40% and allowing
44:50
people to then accept that change. And
44:52
so going from a system that is
44:54
observing to a system that's actually intervening.
44:56
That's awesome. I definitely look
44:59
forward to seeing how that rolls out
45:01
and really appreciate the work that you
45:03
and the team at Humanloop are doing
45:05
to help us upgrade our workflows and
45:08
enable these sort of more complicated use
45:10
cases. So thank you so much for
45:12
taking time out of that work to
45:14
join us. It's been a pleasure. Really
45:16
enjoyed the conversation. Thanks so much for
45:19
having me, Daniel. All right.
45:28
That is Practical AI for this week.
45:31
Subscribe now. If you haven't
45:34
already, head to practicalai.fm for
45:37
all the ways. And join our
45:39
free Slack team, where you can hang
45:41
out with Daniel, Chris, and the entire
45:43
change log community. Sign
45:45
up today at
45:47
practicalai.fm slash community.
45:50
Thanks again to our partners at
45:52
fly.io, to our beat freaking residents,
45:54
Breakmaster Cylinder, and to you for
45:56
listening. We appreciate you spending time
45:58
with us. That's great. That's all for
46:00
now, we'll talk to you again next time.
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More