Using LLMs for Training Data Preparation with Nihit Desai by Software Engineering Daily | Podchaser

Episode from the podcastSoftware Engineering Daily

Using LLMs for Training Data Preparation with Nihit Desai

Released Tuesday, 30th April 2024

Good episode? Give it some love!

Using LLMs for Training Data Preparation with Nihit Desai

Using LLMs for Training Data Preparation with Nihit Desai

Tuesday, 30th April 2024

Good episode? Give it some love!

Rate Episode

Podchaser Pro

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:00

Machine learning models don't patterns and

0:02

relationships from data to make predictions

0:04

or decisions. The. Quality of

0:06

the data influences how all these models

0:08

can represent and generalize from the data.

0:11

The. Head to say is the cofounder

0:13

and city or at refuel ai.

0:16

The. Company is using Allah Lambs for

0:18

tasks such as data labeling, cleaning,

0:20

and enrichment. He joins the show

0:22

to talk about the platform and how to manage

0:25

data in the current Ai era. This.

0:27

Episode of Software Engineering Daily is

0:29

hosted by Sean Cow Connor. Check.

0:32

The show notes for more information on Sean's work

0:34

and where to find him. The.

0:46

Welcome to Show. They. Do Prince

0:48

George Romney? Yeah. I think so

0:51

much for being here. I'm really excited to talk

0:53

about refuel in some of the cool things that

0:55

you guys are doing over there and you're seeking

0:57

as I'm sort of preparing for. Us

0:59

having this conversation to a life. I

1:01

generally people can understand now that we're

1:03

entering this like a I revolution like

1:06

every was talking about Ai, vegetarian, Gen

1:08

Vi and alarms of the last year

1:10

and a half and rendering this a

1:12

Ira. But really, there's no way I

1:14

without data individually. high quality data. And

1:16

I've. Asked you started and I probably stole

1:18

this from someone but you'll data's them of

1:21

language of a I am. In.

1:23

A waterways with the less sexy headline. Is.

1:25

That we're entering massive quantity of quality A

1:27

I did air as probably like sounds less

1:30

he just wants clicks by. It's kind of

1:32

the reality so can you talk a little

1:34

bit of a why data for A I

1:36

support him wouldn't. Some of the challenges was

1:38

accessing quality data to them. So.

1:40

In such as data quality is everything it's

1:42

a source of on knowledge and behavior that

1:45

the money will learn from That any new

1:47

Ai system we learn from and. It

1:50

sounds the performance art. Any.

1:53

He I system is limited by how good

1:55

is a data and how much of it

1:57

is there and how. Represent. Data

1:59

is it. The final application to

2:01

final can abuse gates and much the system

2:03

would be fucking. There's a few

2:05

challenges when it comes to acquiring. Good.

2:08

Quality data for Ai systems today.

2:11

And roughly like, maybe I can

2:13

walk through. What? The challenges are

2:15

in each step up the pipeline. I think

2:17

I'd have at the top of that connection

2:19

or acquisition of data. There's. A

2:21

wide sources of it ate at a

2:24

profound. There's the public about data that

2:26

there's challenges around the skin of at

2:28

the freshness. Assessing. The reliability

2:30

of beta sources. There's a user

2:32

data that's publicly available but of

2:34

course my gets better at. There's

2:36

some challenges around I think like

2:38

privacy policies have made us platforms

2:41

in. Platforms. Like to and as

2:43

you tube, etc. There's. A lot

2:45

of creative. What's. Up Arts at

2:47

and of Data Iowa Images Music

2:49

Box Copyright cetera. But. Yet and

2:52

your project as collecting an aquatic one

2:54

of the seat as expensive right? it's

2:56

expensive to fly them to store them

2:58

to get benefits and disciplined. Sundance literally

3:00

just the first step to the process.

3:03

We haven't done anything meaningful that this

3:05

data yet and purchased. An

3:08

openness and cleaning and to nation. Here's

3:10

the bears to challenges and questions around.

3:13

Okay, how do we ensure that does

3:15

data as representative in some senses? Across

3:17

Geography the cross cultural no one says across

3:20

languages said brought me A lot of the

3:22

questions here are tend to be focused on

3:24

like how do we ensure that Decatur that

3:27

they're feeding to D C I Systems. Is

3:29

representative of the audience or like

3:31

the user base and the kind

3:34

of applications people that abusing this

3:36

in the future. And. Then like

3:38

to eat it straight and as important as for

3:40

efficient training for me. A lot of

3:42

them in this weekend's a big and

3:44

the real world tend to be not

3:46

cited. There's overrepresentation and under representation of

3:49

radiance slices of it and it's really

3:51

important to I am do things like

3:53

a duplicate normalize it. So I to

3:55

make sure that your. Model. Isn't wasting time

3:57

in some senses looking at and learning

3:59

from. data that's quite redundant. And

4:02

then there's like the last kind of challenge of

4:04

just enrichment and labeling, right? Like you've collected all

4:06

of your data, you've cleaned it, normalized it, curated

4:09

it. And now there's the question of,

4:11

okay, how do we label this data so

4:13

that the model can actually

4:16

learn something meaningful from it? Traditionally, like

4:18

all the data labeling has been kind

4:20

of very, very human, operationally intensive manual

4:22

labeling. It's kind of both time consuming,

4:24

it's prone to errors, it's prone to

4:26

biases. And there's like the question

4:28

of how do we ensure that these human

4:30

preferences again are representative of the entire user

4:32

base. These are like some of

4:35

the kind of challenges I would highlight. Yeah,

4:37

so there's a lot of impact there. So just going

4:39

back to like data

4:41

quality, would you see that the quality of

4:43

the data is also one of the things

4:45

that's like sort of like a competitive edge?

4:48

If we're thinking about like LLMs and particular,

4:50

like there's all these, you know, there's a

4:52

ton of models available and like is really

4:54

the sort of separator in some

4:56

sense between, you know, I don't know, a

4:58

llama, true and a minstrel, the quality of

5:00

the inputs, because this is kind of like,

5:03

you know, garbage in garbage out. You

5:05

describe it very well. In some sense, like we

5:08

can think of the two

5:10

axes for improving performance

5:12

of any AI system is the data

5:15

access and the model access. And in

5:17

the limit, broadly, you know,

5:19

the kinds of model architectures,

5:22

the training schemes, just broadly,

5:24

how do we get these models to learn all

5:26

of that, we see a convergence of

5:28

right everything from chief 3.5

5:30

to quad to llama to minstrel,

5:32

all of them are probably the

5:35

same architectures. And it's the

5:37

same architecture. It appears on this paper that

5:39

came out in 2017 2018,

5:41

I believe from Google that introduced the

5:43

transformer architecture. Right.

5:46

Yeah, exactly. Yeah, attention is all you need.

5:49

And so, really, the biggest access

5:51

that we I mean, we've seen kind of in

5:54

our user customer base that people and probably what

5:56

we see at large in the ecosystem is that

5:58

data access, like, how do you you

6:00

acquire, collect, clean

6:02

data at scale, and then

6:04

that becomes like the differentiator

6:06

over time, because that's what

6:08

leads to differentiator performance behavior

6:11

in the model. So this

6:13

seems like there's a lot of, I

6:16

guess like problems with actually getting the data to

6:18

a state where you would want to train on, because

6:20

part of it's like, okay, first how do you

6:22

get access to the data? But then

6:24

there's also you need to navigate potentially things

6:26

like, especially if you're scraping data from the

6:28

web, copyright, I know

6:30

for example, there's been issues

6:33

around like pulling source code

6:35

from GitHub and what open source licenses it

6:37

on, and how does that impact

6:39

code that gets generated? Is it under the

6:41

same licenses that was inspired by that? Or

6:43

if I pull information from a book that

6:45

was scraped, is that, you know, copyright infringement

6:48

and stuff. And then there's ethical issues on

6:50

top of that. And then there's also all

6:52

the labeling data. So what are some of

6:54

the things that essentially companies are doing today

6:56

to try to like navigate this, the data

6:58

collection and cleaning and labeling

7:00

process? Like how is that done essentially

7:03

without, are they using tools without

7:05

it? Is it mostly like a manual process in

7:07

some sense? Yeah, it's a good question. So I

7:09

would say there's a few categories of problems that

7:11

we just highlighted here. Everything to do

7:14

with collection of

7:16

vast quantities of data, some of which is

7:19

copyrighted, some of which has licensing issues, some of

7:21

it like has kind of questions associated with them

7:23

as well. I think a lot of this is

7:26

in the domain of what we think

7:28

of as language model pre-training. We're

7:31

starting completely from scratch in

7:34

the sort of model parameters or model weights.

7:37

And we're feeding it just trillions

7:39

of tokens of these kind of,

7:41

in some sense, human generated data, because

7:44

that is ultimately kind of what the sum total

7:46

of the internet represents. And

7:48

we're just training the model

7:50

to learn this representation. It's

7:52

completely task-agnostic, it's use-case-agnostic. We're

7:54

just getting it to learn our

7:57

language and hence the language model. And

8:00

of course, I mean, I use the term

8:02

language model a little bit more. Yeah,

8:05

it is meant to cover other modalities of data

8:07

as well. Audio, video, text,

8:09

images, all of them broadly work the same

8:11

way as the transformer side. So

8:13

I think there's a specific set of,

8:15

well, there aren't that many companies doing

8:18

large scale model pre-training yet. So I think

8:20

those challenges do tend to be focused and

8:22

limited to the open AI, the end of

8:25

the matters of the world. And I

8:28

believe, yeah, that it's of course, like an

8:30

active ongoing, I think area of discussion and

8:32

debate around what's okay, what's not okay, how

8:34

should artists, writers, et

8:37

cetera, whose work is being used directly

8:39

or indirectly, how should they be compensated in software,

8:42

should be asked for permission, et cetera. And

8:44

then there's like a set of problems that's

8:46

a little bit downstream of that, which where

8:49

the prevalence of that problem is much

8:51

more widespread. It's pretty much every organization,

8:53

every team that wants to use AI

8:55

systems in some way, which is around

8:58

how do we, once we

9:00

take this pre-trained model off the shelves, chances

9:02

are it's going to be great

9:04

for prototyping, but then it's not going to

9:06

be good enough to take into production as

9:08

it is, right? And it typically needs some

9:10

tweaking, some customization, either in the form

9:12

of what we call in

9:14

context, learning, fine tuning, potentially some

9:16

reward modeling on top of it,

9:18

where the type of data that

9:20

we need to label and collect

9:22

is much more use case specific.

9:25

And there are of course, a set of challenges out like a little

9:27

bit different in terms of how do you

9:29

label this data? How do you ensure that the

9:31

evaluation that we're doing is good and relevant to

9:33

your task? So maybe these are like

9:35

some of the two things that I like. I'm a lot

9:37

more familiar with the latter, so I'm happy to like that.

9:41

Yeah. I mean, I think you raise the point

9:43

there in terms of the thing with the foundation

9:45

model is probably going to be only

9:47

like an activity that's taken on by very

9:50

specialized companies that are able to do that

9:52

at scale, have the means to do it.

9:54

Kind of like public cloud, like not everybody's

9:56

public. It's like four companies that are doing

9:58

that, right? So you're... eventually, they'll probably

10:00

be even more convergence from the foundation models

10:03

where maybe there's only going to be, you

10:05

know, sort of four or five companies that

10:07

are really doing that, that have the means

10:09

to keep that going and continually update them

10:12

and do that scale and purchase the GPUs

10:14

and all that sort of stuff. So, but

10:17

a lot of companies are going to be able to be

10:19

stiff and will be starting with that as a base. And then

10:21

they're sort of, you know, modifying them through

10:23

fine tuning or other means to

10:25

build more domain specific things that solve

10:28

like application problems that are

10:31

for their companies. As

10:38

a listener of software engineering daily, you

10:40

understand the impact of generative AI. On

10:43

the podcast, we've covered many exciting aspects

10:46

of Gen AI technologies, as well as

10:48

the new vulnerabilities and risks they bring.

10:51

HackerOne's AI Red teaming addresses the

10:53

novel challenges of AI safety and

10:55

security for businesses launching new AI

10:58

deployments. Their approach involves stress

11:00

testing AI models and deployments to make

11:02

sure they can't be tricked into providing

11:04

information beyond their intended use and that

11:07

security flaws can't be exploited to access

11:09

confidential data or systems. Within

11:11

the HackerOne community, over 750

11:14

active hackers specialize in prompt hacking and

11:16

other AI security and safety testing. In

11:19

a single recent engagement, a team of

11:21

18 HackerOne hackers quickly

11:23

identified 26 valid findings

11:25

within the initial 24 hours

11:27

and accumulated over 100 valid findings

11:30

in the two week

11:32

engagement. HackerOne offers strategic

11:34

flexibility, rapid deployment, and

11:36

a hybrid talent strategy.

11:38

Learn more at hackerone.com/AI.

11:40

That's hackerone.com/AI.

11:43

So

11:53

we've been talking a lot about like, you know,

11:55

labeled data and some of these other challenges, but

11:57

like for those that maybe are less, you know,

11:59

familiar. with the world

12:01

of AI and how training works. Can you give a

12:03

little bit more of an explanation

12:05

of what is way more data and why it's

12:07

important for AI? It's

12:10

best to think of these AI

12:12

models or the architectures behind them,

12:14

probably, that these are senior networks,

12:16

as function

12:19

approximation machines. What I mean

12:21

by that is, let's imagine that all

12:23

of these AI systems fundamentally take, that

12:26

input some observations about the real

12:28

world, about users, about customers, about

12:31

something. Then, they're trying to make

12:34

some meaningful production from that. This

12:37

relationship between what we

12:39

observe and what we want the model to

12:41

predict is, in some sense, it

12:43

can be, think of it conceptually as some

12:46

function. This function could be very

12:48

high dimensional, it could be indeterministic, it

12:50

could be, and sometimes it

12:52

can't even be enumerated in many cases.

12:56

That is the function that we're trying these

12:58

models to get to approximate as best

13:00

as they can. Really,

13:03

the only way to do this is for

13:06

the model, for these AI systems to see

13:09

kinds of labeled data and then have

13:12

some systematic way that they can learn from it,

13:14

which is probably what the training and the optimization

13:16

for these models is about. At

13:18

the core of it, that's why labeled data, fast

13:20

quantities of it and good quality of

13:22

it is important because that is what

13:24

they feed to these AI systems so

13:27

that they can learn from it and

13:30

approximate this functional mapping and generalize

13:32

it to unseen, to unknown use

13:34

cases. Conceptually, that's why

13:36

labeled data is important. There's

13:39

many instantiations of it. If you want

13:42

a trainer, a self-driving car, it has

13:44

to learn from millions of

13:46

hours of humans driving cars and

13:49

seeing what is the right next move, how

13:51

do they anticipate the next action and reactions

13:53

down the line and so on. You

13:56

want to train a really good

13:58

chatbot to reply to customer support tickets.

14:00

Okay. Yeah. It has learned that

14:02

behavior by seeing tons of it in

14:05

action. Right. We want to, well, I mean, we want

14:07

to build the best search engine in the world. Okay.

14:10

Yeah. Actually that happens because Google

14:12

has just billions and billions of search

14:15

and user action data, right? That it can learn

14:17

from. I mean, what counts

14:19

as a label, I should say is like

14:21

it varies a little bit. And that's what

14:23

kind of in the realm of pre-training versus

14:25

continuing. But pre-training like we're starting

14:28

with kind of this very large

14:30

corpus of what is unlabeled

14:32

data and then transforming it so that

14:34

there is still some supervision from it,

14:37

right? So there's like a few different

14:39

ways to do pre-training, but a common

14:41

way to do this is to mask

14:43

out some specific parts

14:45

of that input, right? Could be

14:47

words, characters, tokens, entire sentences, and

14:49

then ask the model

14:52

to predict what is

14:54

the thing that is masked out, right? And

14:56

in some sense, then you have the model's

14:58

predictions and then the ground growth,

15:01

which is kind of what is actually known,

15:03

but just masked out from the model and that

15:05

gives, that is like some way of creating label

15:07

data in some sense for the models to learn

15:09

from, right? And then when it comes to kind

15:11

of fine tuning and kind of

15:13

just training anything task specific, that's where

15:15

a lot of the things like human

15:17

labeling and expert human judgment and data

15:19

generated from that kind of comes into

15:21

the picture. WorkOS

15:30

is a modern identity platform built for

15:32

B2B SaaS. It provides seamless

15:35

APIs for authentication, user identity, and

15:37

complex enterprise features like SSO and

15:39

skin provisioning. It's a drop in

15:41

replacement for Auth0 and supports up

15:44

to 1 million monthly active users

15:46

for free. It's perfect for B2B

15:48

SaaS companies frustrated with high costs,

15:50

opaque pricing, and lack of enterprise

15:52

capabilities supported by legacy auth vendors.

15:54

The APIs are flexible and easy

15:57

to use designed to provide an

15:59

effortless experience. from your first

16:01

user all the way to your largest

16:03

enterprise customer. Today, hundreds of high-growth scale-ups

16:05

are already powered by WorkOS, including

16:08

ones you probably know like

16:10

Vercel, Webflow, and Loom. Check

16:12

out workos.com/SED to learn more.

16:21

So a lot of times with labeling sort

16:24

of fall into like I'm like basically

16:26

giving a categorization for something so like

16:28

if you take the autonomous vehicle example,

16:30

like maybe I have I

16:32

don't know footage of an accident and I got

16:34

labeled as an accident and essentially use

16:37

that as a way to train the autonomous

16:39

vehicle to maybe avoid accidents or those types

16:41

of situations. So maybe

16:43

in that example what labeling would mean

16:45

is okay, Vercel, there's labeling

16:48

for specific objects

16:51

and parts of what you know what driving on the

16:53

road would look like. So here's

16:56

the road, here's the kind of pedestrian sidewalk,

16:58

here's the tree, here's other cars, here's the

17:00

truck, etc. There's a lot of

17:02

that kind of labeling and then there's like labeling

17:04

for specific events or scenarios, right? Which is kind

17:06

of very much like the kind of thing that's

17:08

highlighting. So I want to start

17:10

to talk a little bit about some of the

17:13

stuff that you're doing over at Refuel. So a

17:15

lot of companies are like sitting on like mappings

17:17

of data that and they don't really know how

17:19

to like use it. So it's structured, maybe it's

17:21

encrypted, stuck in, you know, it has three buckets

17:23

somewhere. And there's data leaks and

17:25

wake house and so forth, but those take

17:28

a lot of initial work and maintenance to

17:30

actually get up and drive value from them.

17:32

So how do you get a

17:34

situation of data into a form that's immediately useful

17:37

without engineering and manual work and what

17:39

are some of the things that Refuel is doing

17:41

to assist with that workflow? So

17:44

at a high level, like Refuel is a

17:47

platform to help enterprises, teams label,

17:49

clean, enrich their data at scale with the

17:52

power of L&Ms, right? And

17:54

so we can think of working

17:56

with Refuel as a free-time process

17:58

where you point

18:01

to where your data sets. It could

18:03

be a database, it could be a

18:06

data lake, it could be a set

18:08

of objects sitting in S3. Typically, this

18:10

data is unstructured, it's kind of just

18:12

coming from either some production system that's

18:14

logging the data there, or it's some

18:17

dump of data that you're getting from

18:19

the external source. And this

18:21

is typically the starting point for most teams

18:23

that want to use leverage refuel in some

18:25

way. And the first step there is, define

18:28

the thing that you want to do in

18:30

natural language. It could be something

18:32

as simple as, it classified

18:34

the sentiment in this piece of

18:36

text into one of these three

18:38

categories, but it could be arbitrarily

18:40

complex, right? Imagine you have a

18:43

large taxonomy of hundreds of different

18:45

classes, and you want to make

18:47

a determination for, yeah, there's three

18:49

layers of the taxonomy to first

18:51

categorize this input into layer one,

18:54

then depending on what that answer looks

18:56

like, maybe do something conditionally downstream. But

18:58

broadly think of this as very much

19:01

guidelines that you would describe for

19:04

a domain expert, or

19:06

like for a human reviewer, if they were

19:08

going ahead and labeling this data, right? That

19:10

is what it feels like. Just like define the rules.

19:13

How do you define those rules? Broadly, like just

19:15

natural language. Within the product, like

19:17

we have an interface where it looks, feels

19:20

very much like you're writing guidelines for a

19:22

human reviewer, or like for a

19:24

human annotator to write it. In fact, the annotator

19:26

really happens to be an LLM that they train,

19:28

that they're customized for these kinds of

19:30

tasks. But that's what it starts

19:33

with, right? Which is, so the assumption is that

19:35

you as a user, as a domain expert have

19:37

a very good idea of like, what do you

19:39

want to do with this data? And so like

19:41

just help us kind of codify some of that

19:44

expertise in the form of a set of

19:46

guidelines. And then we'll take

19:48

these guidelines, we'll take the data

19:50

that you pointed us to, and

19:53

the tools that are at the end will start running

19:56

the labeling job and produce a set of

19:58

initial outputs, right? Along with

20:00

this, like we'll do a few things

20:02

like, okay, flag things that are potentially

20:05

low consonants, we'll flag things that where

20:07

the input is maybe weird

20:09

or outlier or noisy in some way.

20:12

So essentially like looking at bubbling up things

20:14

that would be good for you to review

20:16

and provide feedback on. And

20:18

once you give us like this initial round of

20:20

feedback, we use it in

20:23

real time to improve the model's predictions

20:25

for the remainder of the data, right?

20:27

So almost think of this active labeling

20:29

type approach where you define a set

20:31

of guidelines, you produce a set of

20:33

results, you give us feedback, you correct

20:35

potentially some of the like your review

20:37

from the low confidence labels and you

20:39

potentially correct ones that are the LMM

20:41

might've made a mistake. We collect all

20:43

that feedback and then label the next

20:45

batch of data. And typically

20:48

like what you've seen is within 30,

20:50

45 minutes of interacting with the system, we

20:53

can get most teams, most use

20:55

cases to a place where the result

20:57

is at parity, potentially better

21:00

than human annotators, right? And in some

21:02

sense, like this is the first part

21:04

of the kind of workflow where teams

21:06

see a lot of value where, traditionally,

21:08

if you're doing this

21:11

with hiring, maintaining training,

21:13

like a team of human reviewers,

21:15

that process of defining the initial

21:17

guidelines, getting them to label it,

21:19

reviewing that work, sharing

21:21

some mistakes and getting, you know, like those

21:23

iterations typically tend to be on the order

21:26

of days to weeks. Whereas, you know, given

21:28

that many, many times faster, you can get

21:30

that process done to something like an hour.

21:33

That's the first part. And then like, depending

21:35

on kind of exactly what the final goal

21:37

is, typically like most teams can do one

21:40

or the other, which is sometimes like

21:42

this task that they've built, they'll

21:45

want to deploy that and just

21:47

start using it online in some

21:49

fashion and start collecting, you know,

21:51

telemetry and usage data. And

21:53

then at some sequence, people want to review

21:55

this data within the platform, again,

21:58

provided feedback and. This becomes

22:00

this data engine where data is

22:02

being collected in real time. Some

22:04

frequency you're reviewing it, you're providing

22:06

it feedback. And we're using all

22:08

of this to improve the model's

22:11

output on an ongoing basis. And

22:13

then there's like another set of

22:15

use cases where just deploying this task

22:17

with like a fairly big, what ultimately

22:19

is like a fairly big model, multiple

22:21

billions of parameters that has some implications

22:23

in terms of just what we can

22:25

do there that are in terms of

22:27

latency, supporting throughput and so on. So

22:30

if that is not something that is

22:32

feasible for many applications, then teams would

22:34

want to distill that, probably all of

22:36

that knowledge into like a much smaller,

22:38

less specific model. And that's where a

22:40

lot of the kind of fine tuning

22:42

comes in as well. Rutter

22:56

Stack is the warehouse native customer

22:58

data platform. With Rutter Stack,

23:00

you can collect data from every source,

23:02

unified in your data warehouse or data

23:04

lake to create a customer 360 and

23:07

deliver it to every team and every tool

23:09

for activation. Rutter Stack provides

23:11

tools to help you guarantee data quality at

23:13

the source, ensure compliance across

23:16

the data lifecycle and create model

23:18

ready data for AI and ML

23:20

teams. With Rutter Stack,

23:22

you can spend less time on

23:24

low value work and more

23:26

time driving better business outcomes. Visit

23:29

rudderstack.com/SED to learn more. As

23:42

a user, what is the output of this process

23:45

that I'm getting? And then how do I know

23:47

when I'm done? How do I know essentially

23:50

how good the resulting output is? So

23:52

the output at the end of this process

23:54

is transform Android label

23:56

data that's used, that's

23:58

ready to be. them to some

24:00

downstream application that you have in mind as

24:02

a user. It could be for

24:05

training downstream models. It could

24:07

be for powering a set of product

24:09

features in whatever product that you're building.

24:12

In some cases, it could be a lot

24:14

of our users, customers are data providers in

24:16

some ways, where the data that

24:19

they clean and reach using

24:21

the refuels platform is valuable just as

24:23

a product offering for them. How

24:26

do I know it's any good? It's broadly

24:28

like this area of, as

24:30

it's called colloquially in the LN

24:33

ecosystem, I'd say it's a fairly

24:36

active area of discussion, debate,

24:38

development. There's

24:40

a few reasons for having a fairly

24:42

new, I think we're just beginning to

24:44

learn how

24:49

do we evaluate LN. But

24:52

at the core of it, it does have to be,

24:54

at least nowadays, it has to be some

24:57

comparison in some fashion to what

24:59

the expected output is. In

25:02

some cases, humans should be the judge at the end of

25:04

it. And there is

25:06

the question of what data do you evaluate

25:08

it on? What set of metrics do you

25:10

use? Because there is

25:12

a set of broad LN

25:15

benchmarks that are publicly

25:17

used. And we think

25:19

that those are often not the most

25:21

helpful when it comes to evaluating how

25:24

good is this model and the data

25:26

that it did produce for my specific

25:28

task and for my specific use case.

25:31

A bunch of different LN

25:33

leaderboards, benchmarks that are

25:35

used publicly. And those are good, I think,

25:38

for if you want to get a very

25:40

high level, low granularity view,

25:42

I would say, distinguishing

25:44

the good candidate models from the not

25:46

so good ones, which is

25:48

fine. I think it's a good first filter. But

25:51

then often, they're not either the set

25:53

of metrics that these measure are not

25:55

super well aligned with your specific task

25:57

or they're not discriminative enough. right? Like

25:59

more, like, okay, what does the performance

26:02

difference of like one, one and a

26:04

half percent mean on the specific benchmark,

26:06

like for my use case? So yeah,

26:08

we're big fans of tasks,

26:10

like what we think of as task specific

26:12

evaluations, where depending

26:14

on the kind of task that, that

26:17

you're asking the RMM to do, is it

26:19

the classification task extraction? Is it a free

26:21

form generation? Is it? Yeah, like, what is

26:23

the expected output would depend on

26:25

that task. And then like, there's a question of like,

26:28

what is the set of things that we should measure

26:30

there? Typically, like, yeah, think of this as something

26:33

that measures quality, and then

26:35

something that measures how faithful

26:37

is the model, like, basically, like, what

26:39

is the likelihood that it's hallucinating? So

26:42

those are like the two kind of important considerations for

26:44

teams that are using me to. Yeah.

26:46

And even who's the nations are sort of

26:48

context, like dependent, like you're

26:51

writing a story, like who's the nation?

26:53

But if I'm trying to pull Mark

26:55

Twain quotes, then maybe if not, it's

26:58

okay. And also, like, I mean, I think you're kind

27:00

of getting into this, like, there's a lot of nuance

27:02

in terms of like, how do you actually measure quality?

27:04

Because a lot of it

27:06

is task specific. And some

27:09

of it is probably dependent on like, you

27:11

actually need like human feedback in terms of,

27:13

is this meet the quality bar for the

27:16

thing that I'm trying to accomplish essentially? Yeah,

27:18

absolutely. So why can't I just,

27:20

you know, take my data and

27:23

use something like open ai's like API

27:25

directly to do something like this? It's

27:28

a great question. To be honest, like,

27:30

a lot of users, customers

27:32

that come to us do start there. And

27:34

I would say it's a very fun place

27:36

to start. To some extent, it's a testament

27:38

to like, just how good these models

27:40

are out of the box, right? And how

27:43

easy they are to use, just literally a

27:45

sign up at an API call away. So

27:47

it's a really great place to really understand

27:49

like, Hey, is this LM within

27:51

the realm of kind of potential solutions candidates for

27:53

like the use case that I have right there,

27:56

if I want to prototype something real

27:58

quick, that's often like really great. start.

28:01

I think when you spend some time

28:04

with these systems, typically we're seeing that

28:06

users turn into one

28:08

or more of the following challenges, right?

28:10

Which is there's a challenge of

28:13

output quality. Okay, yes,

28:15

open AI, I think probably all of

28:17

these kind of state-of-the-art but closed behind

28:19

API-only kind of models have more or

28:21

less like the same strengths and weaknesses,

28:23

right? They're very good channelists

28:26

at thousands of possible tasks,

28:28

but they are specialists at

28:30

that one specific task or like

28:32

a few specific tasks that you care about. And

28:35

so they're great to, you know, go from zero

28:37

to 75, 80, 85% accuracy, but then how do

28:39

you get it to 90 fly, 96, 98% accuracy and

28:46

reliably that you actually need to put this

28:48

in front of your users, right? Or like

28:51

to actually plug it into production. So

28:53

that's one challenge that we've seen. The other

28:55

challenges scale and throughput, GPT-4,

28:57

Cloud 3, especially like some of the

29:00

most powerful Cloud 3 models, they

29:02

ultimately like they are multiple

29:04

tens of billions of parameters, even with

29:07

like mature of expert type architecture and

29:09

so on, where yeah, there's this, I

29:12

mean, no getting around the fact that, you

29:14

know, it costs a certain amount of money

29:16

to, you know, to run them. And it

29:18

has some implications in terms of latency and

29:20

like the scale of throughput that they can

29:22

support. And then like, yeah,

29:25

I mean, one of the kind of challenge that

29:27

was challenged as consideration that we've seen is just

29:29

around privacy and security, especially for some domains. So

29:32

yeah, these are like a few kind of

29:34

observation that we've seen in terms of like,

29:36

okay, it's a great start there, but then

29:39

oftentimes it's not enough just to do that.

29:41

And you need like a set of layers on

29:43

top of these kind of core LLM APIs. Yeah,

29:46

so I would think also beyond just

29:49

the sort of tuning of the LLM

29:51

for this specific task, you also have

29:53

like the workflow support. Like, exactly. Yeah,

29:55

like sure, I can go to chat

29:57

GPT and even had it like help

30:00

me write code rather than using a

30:02

coding code pilot, but it's more effort

30:04

through the browser, like it's a less

30:07

integrated experience, and it's not really designed

30:09

for that specific workflow of in this

30:11

example, writing code, or in your example,

30:13

I think the workflow is probably even

30:15

more complicated where there's more tuning

30:18

feedback loop, and ultimately I need

30:20

to produce some sort of asset

30:23

that I could actually use for going and then

30:25

fine tuning my model or doing whatever it is

30:27

I need to do with it. Yeah, exactly, exactly,

30:30

Azure and Shana. So we think of

30:32

the core infrastructure in three layers. There's

30:34

the core kind of base LLM, that's

30:37

where its interface is pretty simple,

30:39

but it's very, very powerful at what it

30:42

does, right? It's input prompt, output tokens. Then

30:45

there's the data management layer on

30:47

top of it, right? That is

30:49

actually doing this collection of

30:51

feedback, indexing it, sampling

30:53

from it in real time for things

30:56

like few-shot learning. It's

30:58

doing the job of maintaining this data set that

31:00

is the evaluation data set, and

31:03

a lot of this, there's the integrations into

31:05

a bunch of external stores and so on.

31:07

So there is that layer. And then there's

31:09

the core product of the workflow layer on

31:11

top of it, right? Which is how users

31:14

mostly interact with this, which is that

31:16

is where you define the task, you

31:18

see these, yeah, you can iterate on

31:20

guidelines, you can provide feedback, you can

31:22

understand which predictions changed from one version

31:24

of the prompt to the next one.

31:26

And so a lot of this kind

31:28

of just data tooling that has catered

31:30

and tailored towards the kinds of use

31:32

cases that people want to use report.

31:36

What can you share about how the

31:39

LLM part of the infrastructure works?

31:41

Like how did you, were you

31:43

basically fine tuning a more foundational

31:45

model to be specifically

31:47

built for data cleaning and labeling? Or like

31:49

how does some of that stuff work? Yeah,

31:52

totally. I'll say there's

31:55

two components there. So

31:57

we have our own LLM. And

32:00

we'll share some details about that. But at the end of

32:02

the day, we are, at an end of the

32:04

day, we do support any of the state

32:07

of the art at an end of the

32:09

day. People might want to use explore tryout,

32:11

including opening AI cloud, Gemini, and so on.

32:14

That said, like, yes, what we've found

32:16

what we've seen is that none of

32:18

these models provide state

32:21

of the art performance when it comes to this

32:23

very specific set of tasks, right? Like around data

32:25

labeling and cleaning, which is why, I mean, we

32:27

have to basically go out and build our own

32:29

model to do this well. And to do this

32:32

also, there's like the quality consideration,

32:34

but there's also the consideration of, you know,

32:36

how do we get it to scale and

32:38

how do we build something that is like

32:40

that can then be customized further for specific

32:42

customers and use cases. So

32:45

yeah, this is our, we call it pre-fueled.

32:47

That is something that we built and released

32:49

a few months ago. We're training actually a

32:51

new version of that right now. So we'll

32:53

see maybe by the time this app circles

32:55

out, it might already be released or might

32:57

be on the verge of releasing. I don't

32:59

know. But yeah, that's part of building this

33:01

out. Like we don't start model pre-training from

33:03

scratch. We do start with a powerful based

33:06

model. Think of a llama to a mixture

33:08

of, a mixture of experts type architecture, but

33:10

then we do extensive instruction tuning on top

33:12

of it. So think of, so that, yeah,

33:14

that kinds of data sets that we've collected

33:16

amount of in something like in

33:18

the previous situation, this was, I think about 2,500

33:21

different tasks, data sets

33:24

that are very much in this kind of domain of

33:27

labeling, but then the problem area is

33:29

not sick. So that data sets from

33:31

public internet, from law,

33:33

from finance, from e-commerce, credit cards,

33:35

et cetera. Most of it is

33:38

publicly available. So like, yeah, we just had to go

33:40

out and license that data so that we can use

33:42

it. But yeah, that becomes sort of

33:44

like the raw kind of faces. And then

33:46

of course there's some amount of curation cleaning,

33:49

some amount of labeling that we do internally

33:51

as well to create this data asset that

33:53

we then use to tune like this base

33:55

model and then have it be purpose-built for

33:58

labeling enrichment cleaning type. What's

34:01

your toolchain behind the scenes in

34:03

order to go from creating a

34:06

new version of the model to

34:08

actually pushing it to production? What

34:10

is that, MLOps toolchain? Are

34:13

you using a combination of existing stuff

34:15

or have you had to build some

34:17

stuff to support the actual productionization and

34:19

pushing these models to production and using

34:22

them? Yeah, that's a great question.

34:25

I think the answer is a little

34:27

bit different for two workloads. There's

34:30

a workflow for training and

34:32

building Refuel LLM, which

34:34

is not a daily

34:36

or weekly type activity. It's at

34:38

least with a size and

34:40

scale that we're at, we'll probably do that once

34:42

every few months. Then

34:45

there's the customer-specific fine-tuning workflow that

34:47

is very much something that is

34:50

part of our product where customers

34:52

will, using the data that they

34:54

have collected within the platform and

34:57

label, they'll want to use all of

34:59

that to further customize any of the base

35:01

models that they're using, including the people that

35:03

are in them. That is a lot more

35:05

frequent. I think

35:07

the answer is a little bit different for both of these. Probably

35:10

the latter is the one that's probably more relevant

35:12

here. For that, we do

35:14

rely a lot on open source kind

35:17

of tools to do this. Like

35:19

we use platformers from Hugging Faith as

35:21

their base library to be able to

35:23

train these models along

35:26

with accelerated deep speed and

35:28

FSTP. Our training

35:30

infrastructure is, we use a

35:32

combination of training GPU providers, but

35:34

all of our core infrastructure is

35:36

on AWS. If

35:39

GPUs are available there, we'll train there. If

35:41

not, there's a few other different providers that

35:43

we use as well. And

35:45

then ten of them have actually serving these models.

35:48

We use TGI, like text

35:50

generation inference engine, which is an

35:52

open source project again by Hugging Faith. We

35:55

leverage that quite extensively. And

35:57

then there's a bunch of other tools that I'm like. monitoring

36:00

training runs with tools like maintenance devices.

36:02

And yeah, the thing that we have

36:04

had to build out of that customer

36:06

is everything to do with the evaluation

36:09

because yeah, like it's the one

36:11

that is very

36:14

task specific. And yeah, I mean, there really

36:16

isn't something that at least be found in

36:18

the current ecosystem that you can plug in

36:20

as it is directly. And that said, like

36:22

it is one of the more important problems

36:24

to solve for on behalf of our customers.

36:27

So yeah, that's something that we have to

36:29

do a little bit. So

36:31

the customer specific, you know, modeling that

36:33

you're doing, is there some like, you

36:36

know, versioning of those models as well? Yes,

36:39

absolutely. So any specific application

36:41

for which your customer might want to

36:44

fine tune a model, think of

36:46

like a lineage of models that

36:48

we maintain. And these are snapshots

36:50

and they're yeah, like with the

36:52

snapshot, there's some understanding of exactly

36:54

what data went into training these,

36:57

what are the kind of performance and

36:59

I can, as a user, you can decide

37:01

to, yeah, roll back, delete, switch

37:03

to all learners, snapshots, et cetera. Okay.

37:05

And then sort of like outside of

37:07

exactly what you're doing, but just based

37:09

on your experience and maybe some of

37:11

the customers that you're working with, like

37:14

what are some of the trends you're seeing

37:16

in terms of how companies are building with

37:18

our alarms? Like are most companies starting to

37:20

use your multiple models, private models, are they

37:22

sticking primarily with like public models, like a

37:24

GPT? Yeah, that's a great

37:26

question. I'll say

37:29

like, we'll probably see two broad categories

37:31

of use cases emerge that are

37:33

powered by adelums. There's the generate

37:36

and maybe I'm not sure if this is probably

37:38

there's better terminology for this, but I think of

37:40

it as the generated use cases and the product

37:43

use cases. Generative use cases are,

37:45

you know, the classic kind of

37:47

co-pilot for X kinds of use case, right,

37:49

where there is typically a human in the

37:51

loop. These are ultimately meant

37:53

for human consumption and you know, it's for

37:55

augmenting knowledge work, right? It's one way to

37:58

put it. Think of, you know. Coding

38:00

copilots, think of writing copilots, things

38:03

where it's traditionally the

38:05

domain of knowledge work, and we're

38:07

supercharging this by having this very

38:10

powerful assistant. The predictive

38:12

use cases are almost, in

38:14

some sense, these are problems that are

38:16

a lot more closer and relevant to

38:18

the review, but these are typically completely

38:20

automated. And almost like

38:23

these used to be done

38:25

traditionally either with large armies

38:27

of human operators or some

38:29

system of rules, like rule

38:31

engines, or even traditional ML

38:33

models. And we're seeing all of

38:35

that converge to basically just be

38:37

built on this new substrate of MLMs. And

38:40

these are combinator generated use cases.

38:42

These tend to be fairly high volume

38:45

and they're definitely completely automated. They're

38:47

for, yeah, it's fairly infeasible for, if there

38:49

were to be a human in the loop,

38:51

for example, for reviewing every single prediction that

38:53

comes out of this model type or the

38:56

powering that kind of use case. So

38:58

this is maybe one cut on kind of what

39:00

we're seeing as a trend and

39:02

how companies are leveraging MLMs. In

39:05

terms of like the how of it, yeah, you

39:08

started quite gradually. I think we're moving to a

39:10

multi model world where there's

39:13

going to be always some

39:15

powerful kind of frontier models. Think

39:17

of the GPT-4, Cloud 3, Gemini's of

39:19

the world. And then at the same

39:21

time, there is this very rich and

39:24

vibrant ecosystem of open source models that

39:27

are still very good out of the box.

39:29

But really like the value addition, I think

39:31

of, is like that there are a lot

39:33

more customizable and you have a lot more

39:35

control over that, right? So yeah, you'll typically

39:37

start with something that is very powerful and

39:39

then build on top of it for again,

39:41

like leveraging your own enterprise and script proprietary

39:43

data. And yeah, of course, advise you

39:45

things like control and potentially better cost

39:47

and so on. Isn't that a little

39:50

bit though, like running like

39:52

installed software on-prem versus using like

39:54

a managed service on the cloud?

39:57

I think so. So yes, in some

39:59

ways for sure. at least on the access of control. I think that

40:01

is like a very good way to look at it. I

40:03

do think that is, at least with where

40:05

the models are currently at, I

40:08

do think the kind of customizability piece

40:10

is quite important to get quality right

40:12

in most cases, where maybe

40:14

we don't quite see that

40:16

much need for customizability, let's say, you know, for

40:19

like traditional software, right? Where the only access kind

40:21

of to think about is like, hey, am I

40:23

okay with like a fast hosted somewhere else kind

40:25

of use of the software versus do I need

40:27

it to be on my premises? So

40:30

maybe we're, you know, a few years out,

40:32

like when these models get super super powerful,

40:34

like we might see the need to

40:36

do that last, but at least like my things tend to be,

40:38

I mean, most of these models do

40:40

need some form of customizability. Now it could be,

40:43

I'm not thinking of engineering is the only way to do it, but it

40:45

does tend to be a very powerful way at least today. There's

40:48

of course other options around like what

40:50

is thought of as retrical augmented generation,

40:52

where we're not modifying

40:55

any model parameters, but rather we're just focusing

40:57

on supplying it like the right context. And

41:00

it can reason to it. Yeah. I think that'd be a

41:02

lot of sense. Like it's sort of like a byproduct of

41:04

the immaturity of the market. Like I guess

41:07

it would be similar to even in

41:09

the sort of managed service versus like,

41:11

I need to get in and tweak

41:13

parameters. Yeah. So to experience days like

41:15

companies that are operating at a certain

41:18

level of scale, like let's say you're

41:20

running a Postgres database, then running that

41:22

on AWS RDS is probably going to

41:24

get you to a certain like level of scale.

41:26

But at some point when you reach beyond that,

41:28

then you have to do something that's a little

41:31

bit more custom where you can actually get in

41:33

like customize it. So you can do like, you

41:35

know, horizontal sharding or maybe even use something like

41:37

Postgres extensions to extend the database for your specific

41:40

use case or whatever. But you don't need to

41:42

do that from day one in sort of the

41:44

database world today, because databases has been around for

41:46

50 years. So they've done a lot of work

41:49

to make it work for most people, but LLMs

41:51

haven't been around that long. They've only been around

41:53

for like less than 10 years. So

41:55

there's a lot more work to be

41:57

done to get them to a place where they

41:59

just kind of work out. in a box for

42:02

most people. Yeah, yeah, exactly. That's exactly how I

42:04

do it. So with

42:06

LLMs, they're basically kind of swirping

42:08

up a lot of human-created content

42:10

for training material. But

42:12

now, LLMs are capable of generating

42:14

like a tremendous amount of content.

42:16

So at some point, the LLM-generated

42:19

content that exists on the internet

42:21

is gonna dwarf the amount of

42:23

human-generated content. So I was

42:25

curious about your thoughts on this. Is this

42:27

feedback loop where AI is trained mostly AI-generated

42:29

content going to be a problem at some

42:31

point? Yeah, it's such a great

42:33

question. So, okay, maybe let me answer the

42:37

more kind of, a slightly more constrained version

42:39

of that question, which is, is

42:41

there value in LLM-generated or

42:43

like synthetic data broadly to improve

42:45

model performance? And

42:47

yeah, this is kind of one of

42:49

these very active discussion debate areas within

42:51

the LLM community, I would say, but

42:53

we've definitely seen good signs of this.

42:56

And maybe to share a couple of examples,

42:58

like, so there's this paper that came out,

43:00

I believe last year or the year before,

43:02

called Textbooks Are All You Need. And

43:05

then it's kind of, you know, play on attention

43:07

is all you need kind of paper from

43:09

a few years back. But basically it made

43:11

the case that like, high quality training data

43:13

is important and you can prompt it correctly.

43:15

You can get LLMs to generate this high

43:17

quality data. And of course it needs some

43:19

amount of curation and post-processing downstream

43:21

of it to think like removal of

43:24

duplicates, removing things that are very likely

43:26

to be hallucinations. So there's of course

43:28

like some amount of expert kind

43:31

of curation involved downstream, but this problem

43:33

of kind of, like is there some

43:35

value in generated synthetic data? I

43:37

think it's like, I would say like the answer

43:39

to that is more like yes. I

43:41

don't know. And yeah, there have been subsequently like

43:43

few other papers kind of, you know, that just

43:46

research efforts that broadly kind of point to this

43:48

direction, like the self instruct

43:50

paper that kind of makes like a probably

43:52

like where the takeaway was, was very similar

43:55

and like, yeah, a few other things as well. Even if

43:57

that's the case, it's a bit hard to know exactly what

43:59

is the impact of. this at scale, which

44:01

is kind of where the original question was, like,

44:03

which is, okay, let's, you know, let's play this

44:05

outside 10 years in the future, or maybe in

44:07

two years in the future, where, you know, just

44:10

creating content, like, used

44:12

to be a lot more friction, a lot

44:14

more effort, and needed a lot more kind

44:16

of just creativity and like human hours. And

44:18

yeah, now that's kind of, you know, that

44:21

can be multiplied by a factor of 10,

44:23

100,000. And what does

44:25

that do? It's, I

44:27

wish I knew the answer to that question affirmatively.

44:29

I'm not sure. I think

44:31

what's more likely is that just the ways

44:33

in which we collect and curate

44:36

this data, prepare this for subsequent

44:38

kind of, you know, future trainings will need

44:41

to evolve and adapt to account for this.

44:43

Because yeah, like the distribution of data, the

44:45

kind of properties that it has, the kind

44:47

of strengths and weaknesses that it has, like,

44:49

is going to be different compared to human

44:51

generated data. And like, we do see that,

44:54

I mean, you know, probably, yeah, it was

44:56

like this fun study I recently came across

44:58

that kind of tried to look at, like

45:00

that tried to study what percentage of peer

45:02

reviews tend to be like, chat GPD generated

45:05

at these kind of academic conferences. And, you

45:07

know, it was like, quite evident, like,

45:09

by just the distribution of tokens, for example,

45:11

that have peer reviews, that, okay, there's this

45:13

massive spike since like last year or so.

45:15

But yeah, I mean, probably, I think what

45:17

this will mean in practical terms, at least

45:19

that's my kind of assumption is that we'll

45:21

just evolve how we collect and

45:24

kind of parse curate the data

45:26

so that ultimately it is, it does still

45:28

end up being useful for a lot of

45:30

training. Yeah, I think like at

45:32

the moment, anyway, like based on my own experience,

45:34

like the thing that you're talking about with the

45:37

peer reviews, like, if you're like a heavy chat

45:39

GPT user, I think you can see certain patterns

45:41

in there. Yeah, exactly. Once of

45:43

an paragraph, once of a sentence, certain words come

45:45

up more frequently than probably like a human would

45:47

write. So there are signals at

45:49

least today, and who knows, like in a

45:52

few years from now, the models get better,

45:54

will the variance level in terms of the

45:56

output get better. But today, there's definitely patterns

45:58

that are recognizable as like an

46:00

LM generated piece of content. Yeah,

46:02

yeah, absolutely. So as we start to

46:05

wrap up, what's next for Refuel and

46:07

is there anything else you'd like to share? Yeah, absolutely.

46:09

I mean, first of all team, we're eight people now.

46:11

And I mean, since the start of the year, we've

46:14

had something like a close to

46:16

like a thousand X growth in terms of like just

46:18

the volume of data that will be in processing on

46:20

a monthly basis. So yeah, just like

46:23

a lot of the team's efforts

46:25

today, at least on product and infrastructure

46:27

side are focused on scaling stability and

46:29

just ensuring that things kind of that

46:31

we can manage some of

46:33

this growth and our users

46:36

and customers don't have to kind of bear the

46:38

brunt of it, which occasionally happens. And sorry folks

46:40

for that, but yeah, we're doing the best we

46:42

can. So I think that's part of

46:45

it. And we see that kind of just being

46:47

an important area where the team invest in for

46:49

like the next three to six months. So

46:51

there's a lot of research happening

46:53

in the field of improving LM

46:55

output quality, reliability, and kind of

46:57

training efficiency. Things like low rank

46:59

adapters, for example, that make training

47:01

a lot more parameter efficient. There's

47:04

things around like reduced precision inference

47:07

that basically where you can very

47:09

aggressively quantize like the model weight

47:11

and still get to a good

47:14

output. So yeah, probably what

47:16

can we learn and incorporate

47:18

into our product and infrastructure is one

47:20

of the kind of ongoing area of

47:23

investment for us. And yeah,

47:25

that there is kind of training feature

47:27

and better versions of our own LM

47:29

that really powers a lot of our

47:31

product use cases. These are

47:33

like a few areas where I foresee the

47:36

team investing in the product and infrastructure. On

47:38

the infrastructure side, are there unique

47:40

like scaling challenges or maybe scale

47:43

challenges that get introduced earlier due

47:45

to the nature of doing work

47:47

with these AI models? Like

47:49

earlier compared to like building. Yeah, if you were doing,

47:51

you know, pretty sure anything else like a B2B, like

47:54

the standard non-AI based like

47:57

application, or is it sort of just business

47:59

as usual like we. We need to scale our

48:01

infrastructure. We're going to need more servers. We need

48:03

to run in more regions or something like that.

48:06

Yeah, I mean, okay. So I would say it

48:08

is ultimately like the same kinds of challenges,

48:10

which is like, yeah, just around resources and

48:13

managing resources to like match with throughput and

48:15

so on. But I think

48:17

just given how kind of how

48:19

young the kind of, you know, probably the ecosystem

48:21

is, it doesn't tend to be

48:23

harder. And you do have to face it

48:26

a lot more earlier because you are like,

48:28

for example, like cloud providers have fairly good,

48:30

I think managed offerings for a lot of

48:32

different software and infrastructure things.

48:34

I'd like the database, like Kafka,

48:39

QA solution, okay, I just don't use

48:41

Kinesis. You need a massive kind of

48:43

scaling key value server or you just use Dynamo. Of

48:46

course, like if you're actually innovating as

48:48

a business in those areas, of course, it would make sense to

48:50

like not use those, at least with

48:52

like when it comes to supporting LLM training

48:54

and inference at scale, like unfortunately, there just

48:57

isn't too many that are good out

48:59

of the box solutions. There's tools that

49:01

we can rely on leverage, which we

49:03

do, but yeah, beyond that, it's yeah,

49:06

like things like, okay, how do

49:08

you even benchmark LLM throughput, right? It's like,

49:10

it's not a very trivial question, I would

49:12

say, because there is kind of, yes, like

49:15

there's things that the level of requests, but

49:17

then you have to account for like, okay,

49:19

what are each of these types of requests,

49:21

like in, you know, in terms of input

49:23

and output tokens, as input tokens, it's much

49:26

more easy to scale. Output tokens are very

49:28

hard, you know, almost like the latency

49:30

increases linearly as a function of like the

49:32

length of the output. And then, yeah, like,

49:35

anyway, that's like an example of like, okay, these

49:37

are some of the best practices for how to

49:39

do this and how to do this well are

49:42

still being figured out and written. And so,

49:44

I said that, but it's also, you know,

49:47

part of the fun, I guess. Yeah, absolutely.

49:49

Well, me, thanks so much for

49:51

being here. This was a really interesting conversation

49:54

and I'm excited to see what Refuel continues

49:56

to develop and come out with. Certainly, yeah.

49:58

Thanks so much, Sean. chatting with you

50:00

and see you around. Yes, cheers. All right,

50:02

thank you.

Rate

Get this podcast via API

From The Podcast

Software Engineering Daily

Technical interviews about software topics.

Join Podchaser to...

Rate podcasts and episodes
Follow podcasts and creators
Create podcast and episode lists
& much more

Episode Tags

Do you host or manage this podcast?
Claim and edit this page to your liking.

,

Unlock more with Podchaser Pro

Audience Insights

Contact Information

Demographics

Charts

Sponsor History

and More!

Pro Features

Resources
Help Center
Blog
API

Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More