Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:00
Machine learning models don't patterns and
0:02
relationships from data to make predictions
0:04
or decisions. The. Quality of
0:06
the data influences how all these models
0:08
can represent and generalize from the data.
0:11
The. Head to say is the cofounder
0:13
and city or at refuel ai.
0:16
The. Company is using Allah Lambs for
0:18
tasks such as data labeling, cleaning,
0:20
and enrichment. He joins the show
0:22
to talk about the platform and how to manage
0:25
data in the current Ai era. This.
0:27
Episode of Software Engineering Daily is
0:29
hosted by Sean Cow Connor. Check.
0:32
The show notes for more information on Sean's work
0:34
and where to find him. The.
0:46
Welcome to Show. They. Do Prince
0:48
George Romney? Yeah. I think so
0:51
much for being here. I'm really excited to talk
0:53
about refuel in some of the cool things that
0:55
you guys are doing over there and you're seeking
0:57
as I'm sort of preparing for. Us
0:59
having this conversation to a life. I
1:01
generally people can understand now that we're
1:03
entering this like a I revolution like
1:06
every was talking about Ai, vegetarian, Gen
1:08
Vi and alarms of the last year
1:10
and a half and rendering this a
1:12
Ira. But really, there's no way I
1:14
without data individually. high quality data. And
1:16
I've. Asked you started and I probably stole
1:18
this from someone but you'll data's them of
1:21
language of a I am. In.
1:23
A waterways with the less sexy headline. Is.
1:25
That we're entering massive quantity of quality A
1:27
I did air as probably like sounds less
1:30
he just wants clicks by. It's kind of
1:32
the reality so can you talk a little
1:34
bit of a why data for A I
1:36
support him wouldn't. Some of the challenges was
1:38
accessing quality data to them. So.
1:40
In such as data quality is everything it's
1:42
a source of on knowledge and behavior that
1:45
the money will learn from That any new
1:47
Ai system we learn from and. It
1:50
sounds the performance art. Any.
1:53
He I system is limited by how good
1:55
is a data and how much of it
1:57
is there and how. Represent. Data
1:59
is it. The final application to
2:01
final can abuse gates and much the system
2:03
would be fucking. There's a few
2:05
challenges when it comes to acquiring. Good.
2:08
Quality data for Ai systems today.
2:11
And roughly like, maybe I can
2:13
walk through. What? The challenges are
2:15
in each step up the pipeline. I think
2:17
I'd have at the top of that connection
2:19
or acquisition of data. There's. A
2:21
wide sources of it ate at a
2:24
profound. There's the public about data that
2:26
there's challenges around the skin of at
2:28
the freshness. Assessing. The reliability
2:30
of beta sources. There's a user
2:32
data that's publicly available but of
2:34
course my gets better at. There's
2:36
some challenges around I think like
2:38
privacy policies have made us platforms
2:41
in. Platforms. Like to and as
2:43
you tube, etc. There's. A lot
2:45
of creative. What's. Up Arts at
2:47
and of Data Iowa Images Music
2:49
Box Copyright cetera. But. Yet and
2:52
your project as collecting an aquatic one
2:54
of the seat as expensive right? it's
2:56
expensive to fly them to store them
2:58
to get benefits and disciplined. Sundance literally
3:00
just the first step to the process.
3:03
We haven't done anything meaningful that this
3:05
data yet and purchased. An
3:08
openness and cleaning and to nation. Here's
3:10
the bears to challenges and questions around.
3:13
Okay, how do we ensure that does
3:15
data as representative in some senses? Across
3:17
Geography the cross cultural no one says across
3:20
languages said brought me A lot of the
3:22
questions here are tend to be focused on
3:24
like how do we ensure that Decatur that
3:27
they're feeding to D C I Systems. Is
3:29
representative of the audience or like
3:31
the user base and the kind
3:34
of applications people that abusing this
3:36
in the future. And. Then like
3:38
to eat it straight and as important as for
3:40
efficient training for me. A lot of
3:42
them in this weekend's a big and
3:44
the real world tend to be not
3:46
cited. There's overrepresentation and under representation of
3:49
radiance slices of it and it's really
3:51
important to I am do things like
3:53
a duplicate normalize it. So I to
3:55
make sure that your. Model. Isn't wasting time
3:57
in some senses looking at and learning
3:59
from. data that's quite redundant. And
4:02
then there's like the last kind of challenge of
4:04
just enrichment and labeling, right? Like you've collected all
4:06
of your data, you've cleaned it, normalized it, curated
4:09
it. And now there's the question of,
4:11
okay, how do we label this data so
4:13
that the model can actually
4:16
learn something meaningful from it? Traditionally, like
4:18
all the data labeling has been kind
4:20
of very, very human, operationally intensive manual
4:22
labeling. It's kind of both time consuming,
4:24
it's prone to errors, it's prone to
4:26
biases. And there's like the question
4:28
of how do we ensure that these human
4:30
preferences again are representative of the entire user
4:32
base. These are like some of
4:35
the kind of challenges I would highlight. Yeah,
4:37
so there's a lot of impact there. So just going
4:39
back to like data
4:41
quality, would you see that the quality of
4:43
the data is also one of the things
4:45
that's like sort of like a competitive edge?
4:48
If we're thinking about like LLMs and particular,
4:50
like there's all these, you know, there's a
4:52
ton of models available and like is really
4:54
the sort of separator in some
4:56
sense between, you know, I don't know, a
4:58
llama, true and a minstrel, the quality of
5:00
the inputs, because this is kind of like,
5:03
you know, garbage in garbage out. You
5:05
describe it very well. In some sense, like we
5:08
can think of the two
5:10
axes for improving performance
5:12
of any AI system is the data
5:15
access and the model access. And in
5:17
the limit, broadly, you know,
5:19
the kinds of model architectures,
5:22
the training schemes, just broadly,
5:24
how do we get these models to learn all
5:26
of that, we see a convergence of
5:28
right everything from chief 3.5
5:30
to quad to llama to minstrel,
5:32
all of them are probably the
5:35
same architectures. And it's the
5:37
same architecture. It appears on this paper that
5:39
came out in 2017 2018,
5:41
I believe from Google that introduced the
5:43
transformer architecture. Right.
5:46
Yeah, exactly. Yeah, attention is all you need.
5:49
And so, really, the biggest access
5:51
that we I mean, we've seen kind of in
5:54
our user customer base that people and probably what
5:56
we see at large in the ecosystem is that
5:58
data access, like, how do you you
6:00
acquire, collect, clean
6:02
data at scale, and then
6:04
that becomes like the differentiator
6:06
over time, because that's what
6:08
leads to differentiator performance behavior
6:11
in the model. So this
6:13
seems like there's a lot of, I
6:16
guess like problems with actually getting the data to
6:18
a state where you would want to train on, because
6:20
part of it's like, okay, first how do you
6:22
get access to the data? But then
6:24
there's also you need to navigate potentially things
6:26
like, especially if you're scraping data from the
6:28
web, copyright, I know
6:30
for example, there's been issues
6:33
around like pulling source code
6:35
from GitHub and what open source licenses it
6:37
on, and how does that impact
6:39
code that gets generated? Is it under the
6:41
same licenses that was inspired by that? Or
6:43
if I pull information from a book that
6:45
was scraped, is that, you know, copyright infringement
6:48
and stuff. And then there's ethical issues on
6:50
top of that. And then there's also all
6:52
the labeling data. So what are some of
6:54
the things that essentially companies are doing today
6:56
to try to like navigate this, the data
6:58
collection and cleaning and labeling
7:00
process? Like how is that done essentially
7:03
without, are they using tools without
7:05
it? Is it mostly like a manual process in
7:07
some sense? Yeah, it's a good question. So I
7:09
would say there's a few categories of problems that
7:11
we just highlighted here. Everything to do
7:14
with collection of
7:16
vast quantities of data, some of which is
7:19
copyrighted, some of which has licensing issues, some of
7:21
it like has kind of questions associated with them
7:23
as well. I think a lot of this is
7:26
in the domain of what we think
7:28
of as language model pre-training. We're
7:31
starting completely from scratch in
7:34
the sort of model parameters or model weights.
7:37
And we're feeding it just trillions
7:39
of tokens of these kind of,
7:41
in some sense, human generated data, because
7:44
that is ultimately kind of what the sum total
7:46
of the internet represents. And
7:48
we're just training the model
7:50
to learn this representation. It's
7:52
completely task-agnostic, it's use-case-agnostic. We're
7:54
just getting it to learn our
7:57
language and hence the language model. And
8:00
of course, I mean, I use the term
8:02
language model a little bit more. Yeah,
8:05
it is meant to cover other modalities of data
8:07
as well. Audio, video, text,
8:09
images, all of them broadly work the same
8:11
way as the transformer side. So
8:13
I think there's a specific set of,
8:15
well, there aren't that many companies doing
8:18
large scale model pre-training yet. So I think
8:20
those challenges do tend to be focused and
8:22
limited to the open AI, the end of
8:25
the matters of the world. And I
8:28
believe, yeah, that it's of course, like an
8:30
active ongoing, I think area of discussion and
8:32
debate around what's okay, what's not okay, how
8:34
should artists, writers, et
8:37
cetera, whose work is being used directly
8:39
or indirectly, how should they be compensated in software,
8:42
should be asked for permission, et cetera. And
8:44
then there's like a set of problems that's
8:46
a little bit downstream of that, which where
8:49
the prevalence of that problem is much
8:51
more widespread. It's pretty much every organization,
8:53
every team that wants to use AI
8:55
systems in some way, which is around
8:58
how do we, once we
9:00
take this pre-trained model off the shelves, chances
9:02
are it's going to be great
9:04
for prototyping, but then it's not going to
9:06
be good enough to take into production as
9:08
it is, right? And it typically needs some
9:10
tweaking, some customization, either in the form
9:12
of what we call in
9:14
context, learning, fine tuning, potentially some
9:16
reward modeling on top of it,
9:18
where the type of data that
9:20
we need to label and collect
9:22
is much more use case specific.
9:25
And there are of course, a set of challenges out like a little
9:27
bit different in terms of how do you
9:29
label this data? How do you ensure that the
9:31
evaluation that we're doing is good and relevant to
9:33
your task? So maybe these are like
9:35
some of the two things that I like. I'm a lot
9:37
more familiar with the latter, so I'm happy to like that.
9:41
Yeah. I mean, I think you raise the point
9:43
there in terms of the thing with the foundation
9:45
model is probably going to be only
9:47
like an activity that's taken on by very
9:50
specialized companies that are able to do that
9:52
at scale, have the means to do it.
9:54
Kind of like public cloud, like not everybody's
9:56
public. It's like four companies that are doing
9:58
that, right? So you're... eventually, they'll probably
10:00
be even more convergence from the foundation models
10:03
where maybe there's only going to be, you
10:05
know, sort of four or five companies that
10:07
are really doing that, that have the means
10:09
to keep that going and continually update them
10:12
and do that scale and purchase the GPUs
10:14
and all that sort of stuff. So, but
10:17
a lot of companies are going to be able to be
10:19
stiff and will be starting with that as a base. And then
10:21
they're sort of, you know, modifying them through
10:23
fine tuning or other means to
10:25
build more domain specific things that solve
10:28
like application problems that are
10:31
for their companies. As
10:38
a listener of software engineering daily, you
10:40
understand the impact of generative AI. On
10:43
the podcast, we've covered many exciting aspects
10:46
of Gen AI technologies, as well as
10:48
the new vulnerabilities and risks they bring.
10:51
HackerOne's AI Red teaming addresses the
10:53
novel challenges of AI safety and
10:55
security for businesses launching new AI
10:58
deployments. Their approach involves stress
11:00
testing AI models and deployments to make
11:02
sure they can't be tricked into providing
11:04
information beyond their intended use and that
11:07
security flaws can't be exploited to access
11:09
confidential data or systems. Within
11:11
the HackerOne community, over 750
11:14
active hackers specialize in prompt hacking and
11:16
other AI security and safety testing. In
11:19
a single recent engagement, a team of
11:21
18 HackerOne hackers quickly
11:23
identified 26 valid findings
11:25
within the initial 24 hours
11:27
and accumulated over 100 valid findings
11:30
in the two week
11:32
engagement. HackerOne offers strategic
11:34
flexibility, rapid deployment, and
11:36
a hybrid talent strategy.
11:38
Learn more at hackerone.com/AI.
11:40
That's hackerone.com/AI.
11:43
So
11:53
we've been talking a lot about like, you know,
11:55
labeled data and some of these other challenges, but
11:57
like for those that maybe are less, you know,
11:59
familiar. with the world
12:01
of AI and how training works. Can you give a
12:03
little bit more of an explanation
12:05
of what is way more data and why it's
12:07
important for AI? It's
12:10
best to think of these AI
12:12
models or the architectures behind them,
12:14
probably, that these are senior networks,
12:16
as function
12:19
approximation machines. What I mean
12:21
by that is, let's imagine that all
12:23
of these AI systems fundamentally take, that
12:26
input some observations about the real
12:28
world, about users, about customers, about
12:31
something. Then, they're trying to make
12:34
some meaningful production from that. This
12:37
relationship between what we
12:39
observe and what we want the model to
12:41
predict is, in some sense, it
12:43
can be, think of it conceptually as some
12:46
function. This function could be very
12:48
high dimensional, it could be indeterministic, it
12:50
could be, and sometimes it
12:52
can't even be enumerated in many cases.
12:56
That is the function that we're trying these
12:58
models to get to approximate as best
13:00
as they can. Really,
13:03
the only way to do this is for
13:06
the model, for these AI systems to see
13:09
kinds of labeled data and then have
13:12
some systematic way that they can learn from it,
13:14
which is probably what the training and the optimization
13:16
for these models is about. At
13:18
the core of it, that's why labeled data, fast
13:20
quantities of it and good quality of
13:22
it is important because that is what
13:24
they feed to these AI systems so
13:27
that they can learn from it and
13:30
approximate this functional mapping and generalize
13:32
it to unseen, to unknown use
13:34
cases. Conceptually, that's why
13:36
labeled data is important. There's
13:39
many instantiations of it. If you want
13:42
a trainer, a self-driving car, it has
13:44
to learn from millions of
13:46
hours of humans driving cars and
13:49
seeing what is the right next move, how
13:51
do they anticipate the next action and reactions
13:53
down the line and so on. You
13:56
want to train a really good
13:58
chatbot to reply to customer support tickets.
14:00
Okay. Yeah. It has learned that
14:02
behavior by seeing tons of it in
14:05
action. Right. We want to, well, I mean, we want
14:07
to build the best search engine in the world. Okay.
14:10
Yeah. Actually that happens because Google
14:12
has just billions and billions of search
14:15
and user action data, right? That it can learn
14:17
from. I mean, what counts
14:19
as a label, I should say is like
14:21
it varies a little bit. And that's what
14:23
kind of in the realm of pre-training versus
14:25
continuing. But pre-training like we're starting
14:28
with kind of this very large
14:30
corpus of what is unlabeled
14:32
data and then transforming it so that
14:34
there is still some supervision from it,
14:37
right? So there's like a few different
14:39
ways to do pre-training, but a common
14:41
way to do this is to mask
14:43
out some specific parts
14:45
of that input, right? Could be
14:47
words, characters, tokens, entire sentences, and
14:49
then ask the model
14:52
to predict what is
14:54
the thing that is masked out, right? And
14:56
in some sense, then you have the model's
14:58
predictions and then the ground growth,
15:01
which is kind of what is actually known,
15:03
but just masked out from the model and that
15:05
gives, that is like some way of creating label
15:07
data in some sense for the models to learn
15:09
from, right? And then when it comes to kind
15:11
of fine tuning and kind of
15:13
just training anything task specific, that's where
15:15
a lot of the things like human
15:17
labeling and expert human judgment and data
15:19
generated from that kind of comes into
15:21
the picture. WorkOS
15:30
is a modern identity platform built for
15:32
B2B SaaS. It provides seamless
15:35
APIs for authentication, user identity, and
15:37
complex enterprise features like SSO and
15:39
skin provisioning. It's a drop in
15:41
replacement for Auth0 and supports up
15:44
to 1 million monthly active users
15:46
for free. It's perfect for B2B
15:48
SaaS companies frustrated with high costs,
15:50
opaque pricing, and lack of enterprise
15:52
capabilities supported by legacy auth vendors.
15:54
The APIs are flexible and easy
15:57
to use designed to provide an
15:59
effortless experience. from your first
16:01
user all the way to your largest
16:03
enterprise customer. Today, hundreds of high-growth scale-ups
16:05
are already powered by WorkOS, including
16:08
ones you probably know like
16:10
Vercel, Webflow, and Loom. Check
16:12
out workos.com/SED to learn more.
16:21
So a lot of times with labeling sort
16:24
of fall into like I'm like basically
16:26
giving a categorization for something so like
16:28
if you take the autonomous vehicle example,
16:30
like maybe I have I
16:32
don't know footage of an accident and I got
16:34
labeled as an accident and essentially use
16:37
that as a way to train the autonomous
16:39
vehicle to maybe avoid accidents or those types
16:41
of situations. So maybe
16:43
in that example what labeling would mean
16:45
is okay, Vercel, there's labeling
16:48
for specific objects
16:51
and parts of what you know what driving on the
16:53
road would look like. So here's
16:56
the road, here's the kind of pedestrian sidewalk,
16:58
here's the tree, here's other cars, here's the
17:00
truck, etc. There's a lot of
17:02
that kind of labeling and then there's like labeling
17:04
for specific events or scenarios, right? Which is kind
17:06
of very much like the kind of thing that's
17:08
highlighting. So I want to start
17:10
to talk a little bit about some of the
17:13
stuff that you're doing over at Refuel. So a
17:15
lot of companies are like sitting on like mappings
17:17
of data that and they don't really know how
17:19
to like use it. So it's structured, maybe it's
17:21
encrypted, stuck in, you know, it has three buckets
17:23
somewhere. And there's data leaks and
17:25
wake house and so forth, but those take
17:28
a lot of initial work and maintenance to
17:30
actually get up and drive value from them.
17:32
So how do you get a
17:34
situation of data into a form that's immediately useful
17:37
without engineering and manual work and what
17:39
are some of the things that Refuel is doing
17:41
to assist with that workflow? So
17:44
at a high level, like Refuel is a
17:47
platform to help enterprises, teams label,
17:49
clean, enrich their data at scale with the
17:52
power of L&Ms, right? And
17:54
so we can think of working
17:56
with Refuel as a free-time process
17:58
where you point
18:01
to where your data sets. It could
18:03
be a database, it could be a
18:06
data lake, it could be a set
18:08
of objects sitting in S3. Typically, this
18:10
data is unstructured, it's kind of just
18:12
coming from either some production system that's
18:14
logging the data there, or it's some
18:17
dump of data that you're getting from
18:19
the external source. And this
18:21
is typically the starting point for most teams
18:23
that want to use leverage refuel in some
18:25
way. And the first step there is, define
18:28
the thing that you want to do in
18:30
natural language. It could be something
18:32
as simple as, it classified
18:34
the sentiment in this piece of
18:36
text into one of these three
18:38
categories, but it could be arbitrarily
18:40
complex, right? Imagine you have a
18:43
large taxonomy of hundreds of different
18:45
classes, and you want to make
18:47
a determination for, yeah, there's three
18:49
layers of the taxonomy to first
18:51
categorize this input into layer one,
18:54
then depending on what that answer looks
18:56
like, maybe do something conditionally downstream. But
18:58
broadly think of this as very much
19:01
guidelines that you would describe for
19:04
a domain expert, or
19:06
like for a human reviewer, if they were
19:08
going ahead and labeling this data, right? That
19:10
is what it feels like. Just like define the rules.
19:13
How do you define those rules? Broadly, like just
19:15
natural language. Within the product, like
19:17
we have an interface where it looks, feels
19:20
very much like you're writing guidelines for a
19:22
human reviewer, or like for a
19:24
human annotator to write it. In fact, the annotator
19:26
really happens to be an LLM that they train,
19:28
that they're customized for these kinds of
19:30
tasks. But that's what it starts
19:33
with, right? Which is, so the assumption is that
19:35
you as a user, as a domain expert have
19:37
a very good idea of like, what do you
19:39
want to do with this data? And so like
19:41
just help us kind of codify some of that
19:44
expertise in the form of a set of
19:46
guidelines. And then we'll take
19:48
these guidelines, we'll take the data
19:50
that you pointed us to, and
19:53
the tools that are at the end will start running
19:56
the labeling job and produce a set of
19:58
initial outputs, right? Along with
20:00
this, like we'll do a few things
20:02
like, okay, flag things that are potentially
20:05
low consonants, we'll flag things that where
20:07
the input is maybe weird
20:09
or outlier or noisy in some way.
20:12
So essentially like looking at bubbling up things
20:14
that would be good for you to review
20:16
and provide feedback on. And
20:18
once you give us like this initial round of
20:20
feedback, we use it in
20:23
real time to improve the model's predictions
20:25
for the remainder of the data, right?
20:27
So almost think of this active labeling
20:29
type approach where you define a set
20:31
of guidelines, you produce a set of
20:33
results, you give us feedback, you correct
20:35
potentially some of the like your review
20:37
from the low confidence labels and you
20:39
potentially correct ones that are the LMM
20:41
might've made a mistake. We collect all
20:43
that feedback and then label the next
20:45
batch of data. And typically
20:48
like what you've seen is within 30,
20:50
45 minutes of interacting with the system, we
20:53
can get most teams, most use
20:55
cases to a place where the result
20:57
is at parity, potentially better
21:00
than human annotators, right? And in some
21:02
sense, like this is the first part
21:04
of the kind of workflow where teams
21:06
see a lot of value where, traditionally,
21:08
if you're doing this
21:11
with hiring, maintaining training,
21:13
like a team of human reviewers,
21:15
that process of defining the initial
21:17
guidelines, getting them to label it,
21:19
reviewing that work, sharing
21:21
some mistakes and getting, you know, like those
21:23
iterations typically tend to be on the order
21:26
of days to weeks. Whereas, you know, given
21:28
that many, many times faster, you can get
21:30
that process done to something like an hour.
21:33
That's the first part. And then like, depending
21:35
on kind of exactly what the final goal
21:37
is, typically like most teams can do one
21:40
or the other, which is sometimes like
21:42
this task that they've built, they'll
21:45
want to deploy that and just
21:47
start using it online in some
21:49
fashion and start collecting, you know,
21:51
telemetry and usage data. And
21:53
then at some sequence, people want to review
21:55
this data within the platform, again,
21:58
provided feedback and. This becomes
22:00
this data engine where data is
22:02
being collected in real time. Some
22:04
frequency you're reviewing it, you're providing
22:06
it feedback. And we're using all
22:08
of this to improve the model's
22:11
output on an ongoing basis. And
22:13
then there's like another set of
22:15
use cases where just deploying this task
22:17
with like a fairly big, what ultimately
22:19
is like a fairly big model, multiple
22:21
billions of parameters that has some implications
22:23
in terms of just what we can
22:25
do there that are in terms of
22:27
latency, supporting throughput and so on. So
22:30
if that is not something that is
22:32
feasible for many applications, then teams would
22:34
want to distill that, probably all of
22:36
that knowledge into like a much smaller,
22:38
less specific model. And that's where a
22:40
lot of the kind of fine tuning
22:42
comes in as well. Rutter
22:56
Stack is the warehouse native customer
22:58
data platform. With Rutter Stack,
23:00
you can collect data from every source,
23:02
unified in your data warehouse or data
23:04
lake to create a customer 360 and
23:07
deliver it to every team and every tool
23:09
for activation. Rutter Stack provides
23:11
tools to help you guarantee data quality at
23:13
the source, ensure compliance across
23:16
the data lifecycle and create model
23:18
ready data for AI and ML
23:20
teams. With Rutter Stack,
23:22
you can spend less time on
23:24
low value work and more
23:26
time driving better business outcomes. Visit
23:29
rudderstack.com/SED to learn more. As
23:42
a user, what is the output of this process
23:45
that I'm getting? And then how do I know
23:47
when I'm done? How do I know essentially
23:50
how good the resulting output is? So
23:52
the output at the end of this process
23:54
is transform Android label
23:56
data that's used, that's
23:58
ready to be. them to some
24:00
downstream application that you have in mind as
24:02
a user. It could be for
24:05
training downstream models. It could
24:07
be for powering a set of product
24:09
features in whatever product that you're building.
24:12
In some cases, it could be a lot
24:14
of our users, customers are data providers in
24:16
some ways, where the data that
24:19
they clean and reach using
24:21
the refuels platform is valuable just as
24:23
a product offering for them. How
24:26
do I know it's any good? It's broadly
24:28
like this area of, as
24:30
it's called colloquially in the LN
24:33
ecosystem, I'd say it's a fairly
24:36
active area of discussion, debate,
24:38
development. There's
24:40
a few reasons for having a fairly
24:42
new, I think we're just beginning to
24:44
learn how
24:49
do we evaluate LN. But
24:52
at the core of it, it does have to be,
24:54
at least nowadays, it has to be some
24:57
comparison in some fashion to what
24:59
the expected output is. In
25:02
some cases, humans should be the judge at the end of
25:04
it. And there is
25:06
the question of what data do you evaluate
25:08
it on? What set of metrics do you
25:10
use? Because there is
25:12
a set of broad LN
25:15
benchmarks that are publicly
25:17
used. And we think
25:19
that those are often not the most
25:21
helpful when it comes to evaluating how
25:24
good is this model and the data
25:26
that it did produce for my specific
25:28
task and for my specific use case.
25:31
A bunch of different LN
25:33
leaderboards, benchmarks that are
25:35
used publicly. And those are good, I think,
25:38
for if you want to get a very
25:40
high level, low granularity view,
25:42
I would say, distinguishing
25:44
the good candidate models from the not
25:46
so good ones, which is
25:48
fine. I think it's a good first filter. But
25:51
then often, they're not either the set
25:53
of metrics that these measure are not
25:55
super well aligned with your specific task
25:57
or they're not discriminative enough. right? Like
25:59
more, like, okay, what does the performance
26:02
difference of like one, one and a
26:04
half percent mean on the specific benchmark,
26:06
like for my use case? So yeah,
26:08
we're big fans of tasks,
26:10
like what we think of as task specific
26:12
evaluations, where depending
26:14
on the kind of task that, that
26:17
you're asking the RMM to do, is it
26:19
the classification task extraction? Is it a free
26:21
form generation? Is it? Yeah, like, what is
26:23
the expected output would depend on
26:25
that task. And then like, there's a question of like,
26:28
what is the set of things that we should measure
26:30
there? Typically, like, yeah, think of this as something
26:33
that measures quality, and then
26:35
something that measures how faithful
26:37
is the model, like, basically, like, what
26:39
is the likelihood that it's hallucinating? So
26:42
those are like the two kind of important considerations for
26:44
teams that are using me to. Yeah.
26:46
And even who's the nations are sort of
26:48
context, like dependent, like you're
26:51
writing a story, like who's the nation?
26:53
But if I'm trying to pull Mark
26:55
Twain quotes, then maybe if not, it's
26:58
okay. And also, like, I mean, I think you're kind
27:00
of getting into this, like, there's a lot of nuance
27:02
in terms of like, how do you actually measure quality?
27:04
Because a lot of it
27:06
is task specific. And some
27:09
of it is probably dependent on like, you
27:11
actually need like human feedback in terms of,
27:13
is this meet the quality bar for the
27:16
thing that I'm trying to accomplish essentially? Yeah,
27:18
absolutely. So why can't I just,
27:20
you know, take my data and
27:23
use something like open ai's like API
27:25
directly to do something like this? It's
27:28
a great question. To be honest, like,
27:30
a lot of users, customers
27:32
that come to us do start there. And
27:34
I would say it's a very fun place
27:36
to start. To some extent, it's a testament
27:38
to like, just how good these models
27:40
are out of the box, right? And how
27:43
easy they are to use, just literally a
27:45
sign up at an API call away. So
27:47
it's a really great place to really understand
27:49
like, Hey, is this LM within
27:51
the realm of kind of potential solutions candidates for
27:53
like the use case that I have right there,
27:56
if I want to prototype something real
27:58
quick, that's often like really great. start.
28:01
I think when you spend some time
28:04
with these systems, typically we're seeing that
28:06
users turn into one
28:08
or more of the following challenges, right?
28:10
Which is there's a challenge of
28:13
output quality. Okay, yes,
28:15
open AI, I think probably all of
28:17
these kind of state-of-the-art but closed behind
28:19
API-only kind of models have more or
28:21
less like the same strengths and weaknesses,
28:23
right? They're very good channelists
28:26
at thousands of possible tasks,
28:28
but they are specialists at
28:30
that one specific task or like
28:32
a few specific tasks that you care about. And
28:35
so they're great to, you know, go from zero
28:37
to 75, 80, 85% accuracy, but then how do
28:39
you get it to 90 fly, 96, 98% accuracy and
28:46
reliably that you actually need to put this
28:48
in front of your users, right? Or like
28:51
to actually plug it into production. So
28:53
that's one challenge that we've seen. The other
28:55
challenges scale and throughput, GPT-4,
28:57
Cloud 3, especially like some of the
29:00
most powerful Cloud 3 models, they
29:02
ultimately like they are multiple
29:04
tens of billions of parameters, even with
29:07
like mature of expert type architecture and
29:09
so on, where yeah, there's this, I
29:12
mean, no getting around the fact that, you
29:14
know, it costs a certain amount of money
29:16
to, you know, to run them. And it
29:18
has some implications in terms of latency and
29:20
like the scale of throughput that they can
29:22
support. And then like, yeah,
29:25
I mean, one of the kind of challenge that
29:27
was challenged as consideration that we've seen is just
29:29
around privacy and security, especially for some domains. So
29:32
yeah, these are like a few kind of
29:34
observation that we've seen in terms of like,
29:36
okay, it's a great start there, but then
29:39
oftentimes it's not enough just to do that.
29:41
And you need like a set of layers on
29:43
top of these kind of core LLM APIs. Yeah,
29:46
so I would think also beyond just
29:49
the sort of tuning of the LLM
29:51
for this specific task, you also have
29:53
like the workflow support. Like, exactly. Yeah,
29:55
like sure, I can go to chat
29:57
GPT and even had it like help
30:00
me write code rather than using a
30:02
coding code pilot, but it's more effort
30:04
through the browser, like it's a less
30:07
integrated experience, and it's not really designed
30:09
for that specific workflow of in this
30:11
example, writing code, or in your example,
30:13
I think the workflow is probably even
30:15
more complicated where there's more tuning
30:18
feedback loop, and ultimately I need
30:20
to produce some sort of asset
30:23
that I could actually use for going and then
30:25
fine tuning my model or doing whatever it is
30:27
I need to do with it. Yeah, exactly, exactly,
30:30
Azure and Shana. So we think of
30:32
the core infrastructure in three layers. There's
30:34
the core kind of base LLM, that's
30:37
where its interface is pretty simple,
30:39
but it's very, very powerful at what it
30:42
does, right? It's input prompt, output tokens. Then
30:45
there's the data management layer on
30:47
top of it, right? That is
30:49
actually doing this collection of
30:51
feedback, indexing it, sampling
30:53
from it in real time for things
30:56
like few-shot learning. It's
30:58
doing the job of maintaining this data set that
31:00
is the evaluation data set, and
31:03
a lot of this, there's the integrations into
31:05
a bunch of external stores and so on.
31:07
So there is that layer. And then there's
31:09
the core product of the workflow layer on
31:11
top of it, right? Which is how users
31:14
mostly interact with this, which is that
31:16
is where you define the task, you
31:18
see these, yeah, you can iterate on
31:20
guidelines, you can provide feedback, you can
31:22
understand which predictions changed from one version
31:24
of the prompt to the next one.
31:26
And so a lot of this kind
31:28
of just data tooling that has catered
31:30
and tailored towards the kinds of use
31:32
cases that people want to use report.
31:36
What can you share about how the
31:39
LLM part of the infrastructure works?
31:41
Like how did you, were you
31:43
basically fine tuning a more foundational
31:45
model to be specifically
31:47
built for data cleaning and labeling? Or like
31:49
how does some of that stuff work? Yeah,
31:52
totally. I'll say there's
31:55
two components there. So
31:57
we have our own LLM. And
32:00
we'll share some details about that. But at the end of
32:02
the day, we are, at an end of the
32:04
day, we do support any of the state
32:07
of the art at an end of the
32:09
day. People might want to use explore tryout,
32:11
including opening AI cloud, Gemini, and so on.
32:14
That said, like, yes, what we've found
32:16
what we've seen is that none of
32:18
these models provide state
32:21
of the art performance when it comes to this
32:23
very specific set of tasks, right? Like around data
32:25
labeling and cleaning, which is why, I mean, we
32:27
have to basically go out and build our own
32:29
model to do this well. And to do this
32:32
also, there's like the quality consideration,
32:34
but there's also the consideration of, you know,
32:36
how do we get it to scale and
32:38
how do we build something that is like
32:40
that can then be customized further for specific
32:42
customers and use cases. So
32:45
yeah, this is our, we call it pre-fueled.
32:47
That is something that we built and released
32:49
a few months ago. We're training actually a
32:51
new version of that right now. So we'll
32:53
see maybe by the time this app circles
32:55
out, it might already be released or might
32:57
be on the verge of releasing. I don't
32:59
know. But yeah, that's part of building this
33:01
out. Like we don't start model pre-training from
33:03
scratch. We do start with a powerful based
33:06
model. Think of a llama to a mixture
33:08
of, a mixture of experts type architecture, but
33:10
then we do extensive instruction tuning on top
33:12
of it. So think of, so that, yeah,
33:14
that kinds of data sets that we've collected
33:16
amount of in something like in
33:18
the previous situation, this was, I think about 2,500
33:21
different tasks, data sets
33:24
that are very much in this kind of domain of
33:27
labeling, but then the problem area is
33:29
not sick. So that data sets from
33:31
public internet, from law,
33:33
from finance, from e-commerce, credit cards,
33:35
et cetera. Most of it is
33:38
publicly available. So like, yeah, we just had to go
33:40
out and license that data so that we can use
33:42
it. But yeah, that becomes sort of
33:44
like the raw kind of faces. And then
33:46
of course there's some amount of curation cleaning,
33:49
some amount of labeling that we do internally
33:51
as well to create this data asset that
33:53
we then use to tune like this base
33:55
model and then have it be purpose-built for
33:58
labeling enrichment cleaning type. What's
34:01
your toolchain behind the scenes in
34:03
order to go from creating a
34:06
new version of the model to
34:08
actually pushing it to production? What
34:10
is that, MLOps toolchain? Are
34:13
you using a combination of existing stuff
34:15
or have you had to build some
34:17
stuff to support the actual productionization and
34:19
pushing these models to production and using
34:22
them? Yeah, that's a great question.
34:25
I think the answer is a little
34:27
bit different for two workloads. There's
34:30
a workflow for training and
34:32
building Refuel LLM, which
34:34
is not a daily
34:36
or weekly type activity. It's at
34:38
least with a size and
34:40
scale that we're at, we'll probably do that once
34:42
every few months. Then
34:45
there's the customer-specific fine-tuning workflow that
34:47
is very much something that is
34:50
part of our product where customers
34:52
will, using the data that they
34:54
have collected within the platform and
34:57
label, they'll want to use all of
34:59
that to further customize any of the base
35:01
models that they're using, including the people that
35:03
are in them. That is a lot more
35:05
frequent. I think
35:07
the answer is a little bit different for both of these. Probably
35:10
the latter is the one that's probably more relevant
35:12
here. For that, we do
35:14
rely a lot on open source kind
35:17
of tools to do this. Like
35:19
we use platformers from Hugging Faith as
35:21
their base library to be able to
35:23
train these models along
35:26
with accelerated deep speed and
35:28
FSTP. Our training
35:30
infrastructure is, we use a
35:32
combination of training GPU providers, but
35:34
all of our core infrastructure is
35:36
on AWS. If
35:39
GPUs are available there, we'll train there. If
35:41
not, there's a few other different providers that
35:43
we use as well. And
35:45
then ten of them have actually serving these models.
35:48
We use TGI, like text
35:50
generation inference engine, which is an
35:52
open source project again by Hugging Faith. We
35:55
leverage that quite extensively. And
35:57
then there's a bunch of other tools that I'm like. monitoring
36:00
training runs with tools like maintenance devices.
36:02
And yeah, the thing that we have
36:04
had to build out of that customer
36:06
is everything to do with the evaluation
36:09
because yeah, like it's the one
36:11
that is very
36:14
task specific. And yeah, I mean, there really
36:16
isn't something that at least be found in
36:18
the current ecosystem that you can plug in
36:20
as it is directly. And that said, like
36:22
it is one of the more important problems
36:24
to solve for on behalf of our customers.
36:27
So yeah, that's something that we have to
36:29
do a little bit. So
36:31
the customer specific, you know, modeling that
36:33
you're doing, is there some like, you
36:36
know, versioning of those models as well? Yes,
36:39
absolutely. So any specific application
36:41
for which your customer might want to
36:44
fine tune a model, think of
36:46
like a lineage of models that
36:48
we maintain. And these are snapshots
36:50
and they're yeah, like with the
36:52
snapshot, there's some understanding of exactly
36:54
what data went into training these,
36:57
what are the kind of performance and
36:59
I can, as a user, you can decide
37:01
to, yeah, roll back, delete, switch
37:03
to all learners, snapshots, et cetera. Okay.
37:05
And then sort of like outside of
37:07
exactly what you're doing, but just based
37:09
on your experience and maybe some of
37:11
the customers that you're working with, like
37:14
what are some of the trends you're seeing
37:16
in terms of how companies are building with
37:18
our alarms? Like are most companies starting to
37:20
use your multiple models, private models, are they
37:22
sticking primarily with like public models, like a
37:24
GPT? Yeah, that's a great
37:26
question. I'll say
37:29
like, we'll probably see two broad categories
37:31
of use cases emerge that are
37:33
powered by adelums. There's the generate
37:36
and maybe I'm not sure if this is probably
37:38
there's better terminology for this, but I think of
37:40
it as the generated use cases and the product
37:43
use cases. Generative use cases are,
37:45
you know, the classic kind of
37:47
co-pilot for X kinds of use case, right,
37:49
where there is typically a human in the
37:51
loop. These are ultimately meant
37:53
for human consumption and you know, it's for
37:55
augmenting knowledge work, right? It's one way to
37:58
put it. Think of, you know. Coding
38:00
copilots, think of writing copilots, things
38:03
where it's traditionally the
38:05
domain of knowledge work, and we're
38:07
supercharging this by having this very
38:10
powerful assistant. The predictive
38:12
use cases are almost, in
38:14
some sense, these are problems that are
38:16
a lot more closer and relevant to
38:18
the review, but these are typically completely
38:20
automated. And almost like
38:23
these used to be done
38:25
traditionally either with large armies
38:27
of human operators or some
38:29
system of rules, like rule
38:31
engines, or even traditional ML
38:33
models. And we're seeing all of
38:35
that converge to basically just be
38:37
built on this new substrate of MLMs. And
38:40
these are combinator generated use cases.
38:42
These tend to be fairly high volume
38:45
and they're definitely completely automated. They're
38:47
for, yeah, it's fairly infeasible for, if there
38:49
were to be a human in the loop,
38:51
for example, for reviewing every single prediction that
38:53
comes out of this model type or the
38:56
powering that kind of use case. So
38:58
this is maybe one cut on kind of what
39:00
we're seeing as a trend and
39:02
how companies are leveraging MLMs. In
39:05
terms of like the how of it, yeah, you
39:08
started quite gradually. I think we're moving to a
39:10
multi model world where there's
39:13
going to be always some
39:15
powerful kind of frontier models. Think
39:17
of the GPT-4, Cloud 3, Gemini's of
39:19
the world. And then at the same
39:21
time, there is this very rich and
39:24
vibrant ecosystem of open source models that
39:27
are still very good out of the box.
39:29
But really like the value addition, I think
39:31
of, is like that there are a lot
39:33
more customizable and you have a lot more
39:35
control over that, right? So yeah, you'll typically
39:37
start with something that is very powerful and
39:39
then build on top of it for again,
39:41
like leveraging your own enterprise and script proprietary
39:43
data. And yeah, of course, advise you
39:45
things like control and potentially better cost
39:47
and so on. Isn't that a little
39:50
bit though, like running like
39:52
installed software on-prem versus using like
39:54
a managed service on the cloud?
39:57
I think so. So yes, in some
39:59
ways for sure. at least on the access of control. I think that
40:01
is like a very good way to look at it. I
40:03
do think that is, at least with where
40:05
the models are currently at, I
40:08
do think the kind of customizability piece
40:10
is quite important to get quality right
40:12
in most cases, where maybe
40:14
we don't quite see that
40:16
much need for customizability, let's say, you know, for
40:19
like traditional software, right? Where the only access kind
40:21
of to think about is like, hey, am I
40:23
okay with like a fast hosted somewhere else kind
40:25
of use of the software versus do I need
40:27
it to be on my premises? So
40:30
maybe we're, you know, a few years out,
40:32
like when these models get super super powerful,
40:34
like we might see the need to
40:36
do that last, but at least like my things tend to be,
40:38
I mean, most of these models do
40:40
need some form of customizability. Now it could be,
40:43
I'm not thinking of engineering is the only way to do it, but it
40:45
does tend to be a very powerful way at least today. There's
40:48
of course other options around like what
40:50
is thought of as retrical augmented generation,
40:52
where we're not modifying
40:55
any model parameters, but rather we're just focusing
40:57
on supplying it like the right context. And
41:00
it can reason to it. Yeah. I think that'd be a
41:02
lot of sense. Like it's sort of like a byproduct of
41:04
the immaturity of the market. Like I guess
41:07
it would be similar to even in
41:09
the sort of managed service versus like,
41:11
I need to get in and tweak
41:13
parameters. Yeah. So to experience days like
41:15
companies that are operating at a certain
41:18
level of scale, like let's say you're
41:20
running a Postgres database, then running that
41:22
on AWS RDS is probably going to
41:24
get you to a certain like level of scale.
41:26
But at some point when you reach beyond that,
41:28
then you have to do something that's a little
41:31
bit more custom where you can actually get in
41:33
like customize it. So you can do like, you
41:35
know, horizontal sharding or maybe even use something like
41:37
Postgres extensions to extend the database for your specific
41:40
use case or whatever. But you don't need to
41:42
do that from day one in sort of the
41:44
database world today, because databases has been around for
41:46
50 years. So they've done a lot of work
41:49
to make it work for most people, but LLMs
41:51
haven't been around that long. They've only been around
41:53
for like less than 10 years. So
41:55
there's a lot more work to be
41:57
done to get them to a place where they
41:59
just kind of work out. in a box for
42:02
most people. Yeah, yeah, exactly. That's exactly how I
42:04
do it. So with
42:06
LLMs, they're basically kind of swirping
42:08
up a lot of human-created content
42:10
for training material. But
42:12
now, LLMs are capable of generating
42:14
like a tremendous amount of content.
42:16
So at some point, the LLM-generated
42:19
content that exists on the internet
42:21
is gonna dwarf the amount of
42:23
human-generated content. So I was
42:25
curious about your thoughts on this. Is this
42:27
feedback loop where AI is trained mostly AI-generated
42:29
content going to be a problem at some
42:31
point? Yeah, it's such a great
42:33
question. So, okay, maybe let me answer the
42:37
more kind of, a slightly more constrained version
42:39
of that question, which is, is
42:41
there value in LLM-generated or
42:43
like synthetic data broadly to improve
42:45
model performance? And
42:47
yeah, this is kind of one of
42:49
these very active discussion debate areas within
42:51
the LLM community, I would say, but
42:53
we've definitely seen good signs of this.
42:56
And maybe to share a couple of examples,
42:58
like, so there's this paper that came out,
43:00
I believe last year or the year before,
43:02
called Textbooks Are All You Need. And
43:05
then it's kind of, you know, play on attention
43:07
is all you need kind of paper from
43:09
a few years back. But basically it made
43:11
the case that like, high quality training data
43:13
is important and you can prompt it correctly.
43:15
You can get LLMs to generate this high
43:17
quality data. And of course it needs some
43:19
amount of curation and post-processing downstream
43:21
of it to think like removal of
43:24
duplicates, removing things that are very likely
43:26
to be hallucinations. So there's of course
43:28
like some amount of expert kind
43:31
of curation involved downstream, but this problem
43:33
of kind of, like is there some
43:35
value in generated synthetic data? I
43:37
think it's like, I would say like the answer
43:39
to that is more like yes. I
43:41
don't know. And yeah, there have been subsequently like
43:43
few other papers kind of, you know, that just
43:46
research efforts that broadly kind of point to this
43:48
direction, like the self instruct
43:50
paper that kind of makes like a probably
43:52
like where the takeaway was, was very similar
43:55
and like, yeah, a few other things as well. Even if
43:57
that's the case, it's a bit hard to know exactly what
43:59
is the impact of. this at scale, which
44:01
is kind of where the original question was, like,
44:03
which is, okay, let's, you know, let's play this
44:05
outside 10 years in the future, or maybe in
44:07
two years in the future, where, you know, just
44:10
creating content, like, used
44:12
to be a lot more friction, a lot
44:14
more effort, and needed a lot more kind
44:16
of just creativity and like human hours. And
44:18
yeah, now that's kind of, you know, that
44:21
can be multiplied by a factor of 10,
44:23
100,000. And what does
44:25
that do? It's, I
44:27
wish I knew the answer to that question affirmatively.
44:29
I'm not sure. I think
44:31
what's more likely is that just the ways
44:33
in which we collect and curate
44:36
this data, prepare this for subsequent
44:38
kind of, you know, future trainings will need
44:41
to evolve and adapt to account for this.
44:43
Because yeah, like the distribution of data, the
44:45
kind of properties that it has, the kind
44:47
of strengths and weaknesses that it has, like,
44:49
is going to be different compared to human
44:51
generated data. And like, we do see that,
44:54
I mean, you know, probably, yeah, it was
44:56
like this fun study I recently came across
44:58
that kind of tried to look at, like
45:00
that tried to study what percentage of peer
45:02
reviews tend to be like, chat GPD generated
45:05
at these kind of academic conferences. And, you
45:07
know, it was like, quite evident, like,
45:09
by just the distribution of tokens, for example,
45:11
that have peer reviews, that, okay, there's this
45:13
massive spike since like last year or so.
45:15
But yeah, I mean, probably, I think what
45:17
this will mean in practical terms, at least
45:19
that's my kind of assumption is that we'll
45:21
just evolve how we collect and
45:24
kind of parse curate the data
45:26
so that ultimately it is, it does still
45:28
end up being useful for a lot of
45:30
training. Yeah, I think like at
45:32
the moment, anyway, like based on my own experience,
45:34
like the thing that you're talking about with the
45:37
peer reviews, like, if you're like a heavy chat
45:39
GPT user, I think you can see certain patterns
45:41
in there. Yeah, exactly. Once of
45:43
an paragraph, once of a sentence, certain words come
45:45
up more frequently than probably like a human would
45:47
write. So there are signals at
45:49
least today, and who knows, like in a
45:52
few years from now, the models get better,
45:54
will the variance level in terms of the
45:56
output get better. But today, there's definitely patterns
45:58
that are recognizable as like an
46:00
LM generated piece of content. Yeah,
46:02
yeah, absolutely. So as we start to
46:05
wrap up, what's next for Refuel and
46:07
is there anything else you'd like to share? Yeah, absolutely.
46:09
I mean, first of all team, we're eight people now.
46:11
And I mean, since the start of the year, we've
46:14
had something like a close to
46:16
like a thousand X growth in terms of like just
46:18
the volume of data that will be in processing on
46:20
a monthly basis. So yeah, just like
46:23
a lot of the team's efforts
46:25
today, at least on product and infrastructure
46:27
side are focused on scaling stability and
46:29
just ensuring that things kind of that
46:31
we can manage some of
46:33
this growth and our users
46:36
and customers don't have to kind of bear the
46:38
brunt of it, which occasionally happens. And sorry folks
46:40
for that, but yeah, we're doing the best we
46:42
can. So I think that's part of
46:45
it. And we see that kind of just being
46:47
an important area where the team invest in for
46:49
like the next three to six months. So
46:51
there's a lot of research happening
46:53
in the field of improving LM
46:55
output quality, reliability, and kind of
46:57
training efficiency. Things like low rank
46:59
adapters, for example, that make training
47:01
a lot more parameter efficient. There's
47:04
things around like reduced precision inference
47:07
that basically where you can very
47:09
aggressively quantize like the model weight
47:11
and still get to a good
47:14
output. So yeah, probably what
47:16
can we learn and incorporate
47:18
into our product and infrastructure is one
47:20
of the kind of ongoing area of
47:23
investment for us. And yeah,
47:25
that there is kind of training feature
47:27
and better versions of our own LM
47:29
that really powers a lot of our
47:31
product use cases. These are
47:33
like a few areas where I foresee the
47:36
team investing in the product and infrastructure. On
47:38
the infrastructure side, are there unique
47:40
like scaling challenges or maybe scale
47:43
challenges that get introduced earlier due
47:45
to the nature of doing work
47:47
with these AI models? Like
47:49
earlier compared to like building. Yeah, if you were doing,
47:51
you know, pretty sure anything else like a B2B, like
47:54
the standard non-AI based like
47:57
application, or is it sort of just business
47:59
as usual like we. We need to scale our
48:01
infrastructure. We're going to need more servers. We need
48:03
to run in more regions or something like that.
48:06
Yeah, I mean, okay. So I would say it
48:08
is ultimately like the same kinds of challenges,
48:10
which is like, yeah, just around resources and
48:13
managing resources to like match with throughput and
48:15
so on. But I think
48:17
just given how kind of how
48:19
young the kind of, you know, probably the ecosystem
48:21
is, it doesn't tend to be
48:23
harder. And you do have to face it
48:26
a lot more earlier because you are like,
48:28
for example, like cloud providers have fairly good,
48:30
I think managed offerings for a lot of
48:32
different software and infrastructure things.
48:34
I'd like the database, like Kafka,
48:39
QA solution, okay, I just don't use
48:41
Kinesis. You need a massive kind of
48:43
scaling key value server or you just use Dynamo. Of
48:46
course, like if you're actually innovating as
48:48
a business in those areas, of course, it would make sense to
48:50
like not use those, at least with
48:52
like when it comes to supporting LLM training
48:54
and inference at scale, like unfortunately, there just
48:57
isn't too many that are good out
48:59
of the box solutions. There's tools that
49:01
we can rely on leverage, which we
49:03
do, but yeah, beyond that, it's yeah,
49:06
like things like, okay, how do
49:08
you even benchmark LLM throughput, right? It's like,
49:10
it's not a very trivial question, I would
49:12
say, because there is kind of, yes, like
49:15
there's things that the level of requests, but
49:17
then you have to account for like, okay,
49:19
what are each of these types of requests,
49:21
like in, you know, in terms of input
49:23
and output tokens, as input tokens, it's much
49:26
more easy to scale. Output tokens are very
49:28
hard, you know, almost like the latency
49:30
increases linearly as a function of like the
49:32
length of the output. And then, yeah, like,
49:35
anyway, that's like an example of like, okay, these
49:37
are some of the best practices for how to
49:39
do this and how to do this well are
49:42
still being figured out and written. And so,
49:44
I said that, but it's also, you know,
49:47
part of the fun, I guess. Yeah, absolutely.
49:49
Well, me, thanks so much for
49:51
being here. This was a really interesting conversation
49:54
and I'm excited to see what Refuel continues
49:56
to develop and come out with. Certainly, yeah.
49:58
Thanks so much, Sean. chatting with you
50:00
and see you around. Yes, cheers. All right,
50:02
thank you.
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More