Podchaser Logo
Home
Data synthesis for SOTA LLMs

Data synthesis for SOTA LLMs

Released Tuesday, 6th February 2024
Good episode? Give it some love!
Data synthesis for SOTA LLMs

Data synthesis for SOTA LLMs

Data synthesis for SOTA LLMs

Data synthesis for SOTA LLMs

Tuesday, 6th February 2024
Good episode? Give it some love!
Rate Episode

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:06

Welcome to Practical AI. If

0:09

you work in artificial intelligence, aspire

0:11

to, or are curious how AI-related

0:14

tech is changing the world, this

0:16

is the show for you. We

0:19

just dropped Dance Party, our

0:21

third full-length album on changelog

0:23

beats. Buy it on Bandcamp

0:25

and iTunes, or stream it on Spotify,

0:28

Apple Music, and the rest. Link in

0:30

the show notes. Thank you to our

0:32

partners at fly.io. Launch

0:34

your app close to your users. Find

0:36

out how at fly.io. Welcome

0:44

to another episode of Practical AI.

0:46

This is Daniel Wightnack. I am

0:48

the CEO and founder at Prediction

0:51

Guard. I'm joined as always

0:53

by my co-host, Chris Benson, who is

0:55

a tech strategist at Lockheed Martin. How

0:58

are you doing, Chris? Doing great today. It was nice seeing

1:00

you a few days ago in person. In

1:02

the flesh. In the flesh. Yeah, that

1:04

was great. I think you posted a

1:07

picture on LinkedIn, so if

1:09

anybody doesn't know what we look like and has

1:11

some crazy reason to want to know, there's a

1:13

smiling mug of us on Daniel's profile.

1:16

Yes. Yes. The

1:19

reason we met is I was on a client

1:22

visit on site and we were

1:24

prototyping out some stuff like chat

1:26

over your docs and natural language

1:28

to SQL stuff and all sorts

1:31

of things with Prediction Guard. One

1:34

of the models that we were using was

1:36

from Noose Research. That

1:38

works out great because we

1:40

have Karen Mohocha here who

1:42

is from Noose Research, co-founder

1:45

and researcher there. Welcome.

1:48

Glad to have you, Karen. Hey, all. Thanks

1:50

for having me. I'm extremely excited to chat with you

1:52

guys. Yeah, like I said, I'm

1:55

a huge... Well, this is

1:57

our first time meeting, but I feel like we're

1:59

already friends. because I've had

2:01

so much of my own benefit

2:03

and interaction in working with models

2:05

from Noose Research, a lot of

2:07

amazing models that you posted on

2:09

Hugging Face and research that you're

2:12

doing. I'm wondering if you could

2:14

just give us a little bit

2:16

of a background about Noose

2:18

specifically and kind of

2:21

how you came together as

2:23

researchers and started to me

2:26

from the sidelines that seem like, oh, all of

2:28

a sudden there's these amazing models on Hugging

2:30

Face and I don't know who these people are,

2:32

these Noose Research people, but they're amazing. So

2:35

give us a little bit of the

2:37

backstory there. Absolutely. Yeah. So just as

2:39

a general overview, we are one part

2:42

like open source research organization. We put

2:44

these models out for free. We

2:46

put a lot of research out for free, some data

2:48

sets so people can build on top of these open

2:50

models. On the other

2:52

hand, we're very recently a company

2:55

as well as C Corp. So

2:57

we've been working pretty hard after

2:59

getting some seed funding on building

3:02

together some exciting stuff I won't go

3:04

into on during the overview point, but

3:07

we're continuing to do our open source

3:09

research and development and release of models

3:11

indefinitely. The way we started

3:14

is very interesting. And it would be

3:16

pretty out of nowhere to the outside

3:18

for sure. It was it was extremely

3:20

fast for us. We're a

3:22

collective of people who have been playing

3:24

around in the open source language

3:26

model space for a while, ranging

3:28

from like GPT to release to llama

3:30

release to like the first Transformers paper,

3:33

we've got people from various

3:35

eras of gen AI of when they

3:37

came in. And for myself, it was

3:40

GPT to I stumbled upon a

3:42

colab notebook and started fine

3:44

tuning made some Edgar Allen Poe and

3:46

Lovecraft tunes. I've

3:48

done the same. That's awesome. And we

3:50

just got pulled into this

3:53

world of look at these next

3:55

token predictors that are just managing this

3:57

matter together the most wonderful and amazing

3:59

stories. that slowly turn into

4:01

a deeper and deeper dive of, well,

4:03

how can I use this for learning

4:05

information? How can I learn to use

4:08

this for production and automation? It's evolved

4:10

over time. For us, we started

4:13

off just working with

4:15

different open source collectives actually.

4:17

Once OpenAI released GPT-3 and

4:19

had closed sourced it, we

4:21

were used to open source GPT-2. We're like, oh man, what are

4:24

we going to do? How are we

4:26

going to continue to play with a level

4:28

of customization and interactivity that we had with

4:30

GPT-2? Then Iluther had

4:32

released GPT-J6B, the

4:35

cobalt AI community, this community of people

4:37

who tune models and inference models, started

4:39

to pop up, I think, around 2020,

4:41

2021 in the face of this. A

4:47

lot of us started to have places to

4:49

centralize and play with these models. We

4:51

got to contribute and learn how to

4:53

become better, open source AI developers, etc.

4:57

Eventually, there was a need for

5:00

more concrete organizations to do this

5:02

focused work on the creation of

5:04

these models. We were

5:07

stuck with OK architectures for a

5:09

while, like Pythia. But thanks

5:11

to Meta, we wouldn't be

5:13

here without Meta. I'll say that

5:16

first and foremost. The great Llama.

5:18

Yeah. Prior to Llama, everyone's like,

5:20

oh, Facebook evil, my

5:22

data, etc. Here

5:24

we are. They are like the

5:26

shepherds of this new era of the open

5:28

source AI movement. When Llama

5:30

came out, there was a paper that

5:33

came out called Alpaca by Stanford Lab.

5:36

This was about distilling data

5:38

from bigger models like

5:40

GPT-3, chat GPT, GPT-4, and being

5:42

able to train smaller

5:45

models on that distilled synthetic data,

5:48

something they called the instruction data. The

5:51

Alpaca format really opened up the

5:53

playing field for everybody to start making

5:55

these instruct style models, these actual four

5:58

prod use style models. So

6:01

there was an idea I had in my

6:03

head of well the alpaca guys are using

6:05

only GPT 3.5 outputs. What

6:08

if I only generated GPT 4 outputs? It'll

6:10

be a little expensive, but you'll

6:12

probably get a better model out of it than

6:14

alpaca. At the same time that I

6:16

was looking at this, there was a

6:18

guy on Twitter named Technium who had just

6:21

started putting together his own synthetic data set

6:23

based off alpaca and the GPT 4 only

6:25

as well. So I was

6:27

working with a group at the time called

6:29

Open Assistant under Lion. They're

6:32

a really big nonprofit. And

6:34

while I was working on that, we had

6:36

some GPUs. They were cool with us using

6:38

towards the development of new models. So I

6:40

reached out to Technium and they said, hey,

6:42

I have a little bit of compute. You

6:44

have GPT 4 data in the same format.

6:46

I have GPT 4 data in the same

6:49

format. Let's train a model. So

6:51

we trained a model called GPT 4 X.

6:53

Vicuña. This model was on

6:56

the Vicuña fine tune. We fine tuned to

6:58

fine tune basically. The Vicuña model was

7:00

a alpaca style fine tune and we tried our

7:02

data set on top of it. It

7:05

was good. It was okay. Then we

7:07

thought, you know, we'll probably get a better result

7:09

if we just train on the base llama model.

7:12

And the resulting model was the

7:14

very first Hermes model. Gotcha.

7:17

The OG. The OG. And that's kind

7:19

of how it started to come together

7:22

was we both had

7:24

a data thesis on use GPT 4

7:26

only and follow alpaca. And

7:28

we trained on llama and we got an Hermes. And

7:31

we didn't know what benchmarks were. We didn't

7:33

know anything about any of this

7:35

stuff. We just made a model. And

7:38

it got a ton of attention. We

7:40

put it out under this name noose research. Noose

7:43

comes from the Greek word for intellect.

7:45

We thought it was a good name

7:47

for AI company. But

7:50

it was just a place for, you know, fun

7:52

projects and fine tunes and stuff. It was just

7:54

a name we were using for our collaboration. And

7:57

people started swarming and asking, you know, what's

7:59

your name? news research? Like what's this sudden

8:02

like mystical like open source

8:04

organization that like put out this like best

8:06

model and we're like, yeah, best model like

8:08

we just you know, we just tried something.

8:11

It was it was really organic. And

8:13

it got to the point that people started telling

8:16

us, you know, you must have trained on the

8:18

benchmarks, like these are doing too well. And we

8:20

were like, what's benchmarks? We're not really like

8:24

coming from an academic place as much

8:26

as from like a enthusiast that became

8:28

so committed that it became our life, right?

8:31

It became our day to day. Yeah. So from

8:33

there, people started to ask us,

8:35

can I join news research? Now,

8:37

there wasn't a news research to join.

8:40

Just two guys, right? What

8:42

ended up happening was we formed a

8:44

private discord server. And we thought there's

8:46

a lot of people who range

8:48

from somebody who's like

8:50

1617 years old savant on

8:52

Twitter, hasn't even been a

8:54

college yet insane at transformer stuff to

8:57

mid 30s, you

9:00

know, working a really, really good fanged

9:02

ask job, and just wants to

9:04

really create and let loose. That was another class

9:06

of volunteer. And then you have, you know, older

9:09

gentlemen who has already exited a company or

9:11

something who has just been playing with code

9:13

for a while and wants to jump in

9:15

and hang out. So we ended up being

9:17

this really eclectic group, you know, we don't

9:19

know what your name is, we don't know

9:21

what your race is, we don't know your

9:23

gender or anything. It's just Discord profile picture,

9:25

Twitter profile picture, right? So we

9:28

came together, grew to about like 40

9:30

people all working together on various

9:33

different projects like Hermes tunes, data

9:35

synthesis, the capybara series, context length

9:37

extension, etc. And just from

9:41

this kind of interaction between Twitter and discord and

9:43

bringing people in that we thought were cool, we ended

9:46

up becoming what people will call open

9:48

source research org. Yeah,

9:51

that you sort of stumbled into

9:53

creating this amazing research

9:55

organization, which is ruling the world,

9:58

which is It's

10:01

what OpenAI might have been. Oh, well,

10:03

yeah. It's really sweet.

10:05

Thank you, guys. Yeah. And

10:07

I love it. It's so cool to

10:09

hear that story and that background. And

10:12

I see, like, in my own sort

10:14

of little snapshots here and there, like,

10:16

connecting that in my mind over the

10:18

past couple of years as I've seen

10:20

you all post different models and that

10:23

sort of thing. This is something, you

10:25

know, we've definitely touched on on the

10:27

show before, but some of our listeners

10:29

might not kind of fully

10:31

grasp when you say this sort

10:33

of, like, synthetic data sets that

10:35

you were focused on in this

10:37

alpaca format. Could you kind of

10:39

explain a little bit, like, we've

10:41

talked a lot about fine tuning

10:43

and, you know, preference tuning and

10:45

RLHF and different things, but what

10:47

does it specifically mean that, like,

10:49

you would take synthetic data?

10:51

What does that mean in your case?

10:54

And like, why does that result

10:57

in something good in fine tuning and

10:59

open model? People might think, oh, this

11:01

is synthetic data. Why should I expect

11:03

it to, like, be any good? So

11:05

could you kind of help explain that subject

11:07

a little bit? Yeah, absolutely.

11:10

So, I mean, out of context, synthetic is

11:13

like as meaningless as, like, artificial, right? It

11:15

could be data is data. But

11:17

in this case, it's referring to a particular

11:19

class of data that's been generated by another

11:22

language model or another AI, another

11:24

diffusion model, etc., that can actually

11:26

be used to further train models. Now, you might

11:28

say, why would you want to do something like

11:30

that? How is it helpful? What

11:32

was important to us is we were all GPU poor,

11:35

right? We were all running on laptops or maybe a

11:37

3090, maybe a 4090. As

11:40

individuals, we don't have data centers. So

11:43

training or even tuning, like, a large

11:45

model in the early days, like, 70

11:47

billion parameters, something like that, was just

11:49

unfeasible for us. And knowing

11:51

that GPT-3 is like something like 175 billion parameters

11:53

and 3.5 and 4 can only

11:57

go up from there, the

11:59

question became how can we make these

12:01

small 7 billion parameter

12:03

models even compete with

12:05

these massive ones? These

12:08

ones that I want to run offline, these ones that

12:10

I might want to run on an edge device, on

12:12

a phone, on a drone, etc. How can

12:14

I make them even useful? So

12:16

there's two things to talk about here. One

12:18

is synthetic data and the other is distillation.

12:22

Synthetic data is just referring to any

12:24

kind of data that's created by a

12:27

model in this case. The

12:29

reason that's useful is in

12:31

particular distillation. So if I told

12:33

you to go

12:36

study comp sci for 10

12:38

years, for example, and put in that massive

12:41

time investment and really focus on general

12:43

programming, and then I told you

12:45

now it's time for you to learn about AI and

12:47

transformers and stuff and put you through all the math

12:50

prerequisites, etc. You're going to come

12:52

out with a really strong foundation of

12:54

how to do the work, but the problem

12:56

is you've put in a massive time investment.

12:58

Now, if I take that guy who spent

13:01

10 years doing engineering, another

13:03

five years doing AI, and I ask

13:05

him, hey, can you teach somebody like

13:08

just really important, like compressed tidbits that

13:10

will help them just get up and

13:12

running to do the work? That's data

13:15

distillation, right? That's knowledge distillation. So you

13:17

look at these big models, like

13:19

a cloud or a 70B model or GPT-4,

13:21

and you can see like, they're amazing. They're

13:24

brilliant at everything. They have a bunch of

13:26

high-quality data they're trained on, and they have

13:28

a bunch of low-quality data

13:30

they're trained on that they

13:32

can interact with and express

13:34

in a high-quality form. So

13:36

instead of me having to read a massive

13:39

10 pager for why some

13:42

chemical reaction or some like tax-based process, whatever

13:44

you want it to be, like, instead of

13:46

reading a massive document on that and then

13:49

feeding that to a language model, we

13:51

can just have that really smart model

13:53

that already understands it really well, compress

13:55

that information into an instruction

13:59

or into a context. conversation until like

14:01

two sentences, three sentences, five sentences,

14:03

like half a page. And

14:06

we can just train a much smaller model

14:08

on that compressed information.

14:12

And it will learn the compressed

14:14

information, you know, to the degree

14:16

that a language model learned something, you know,

14:18

not perfectly. But because of that, what

14:21

the alpaca guys did was they generated

14:23

a bunch of seed tasks from GPT

14:25

3.5 various different domains

14:27

and topics and created these kind

14:29

of compressed instructions with the instruction

14:31

and input question from the user

14:33

and then an answer. So

14:36

the instruction could be like, given the following

14:38

math equation, explain step by step why

14:40

this is the answer. And then

14:43

the input is the equation, which is your

14:45

question. And then the output is the compressed

14:48

answer. So all of that we can

14:50

take as one sample in the data

14:52

set, and we can make hundreds of

14:54

thousands or millions of samples like that

14:56

of various different domains and various different

14:58

tasks. So the alpaca guys did this, less

15:01

than 100k examples, I believe, and they

15:03

trained the llama models on

15:06

these, and they found massive

15:08

boosts to performance that this

15:10

distilled information like a human

15:12

successfully compresses and transfers over.

15:15

So when I saw that, and then independently when

15:17

Technium saw that, and then independently with many others

15:19

saw that we were like, this

15:21

is so intuitive. This is exactly

15:24

how I've learned anything by just

15:26

going on discord and Twitter and bothering people to give

15:28

me the compressed bit of how I do something. We

15:31

should try doing this with even higher quality models than

15:33

3.5. So we

15:36

created, I can't remember the exact

15:38

number at the moment, but at least 50,000, maybe

15:42

100,000 examples originally for Hermes one

15:44

like this just using GPT 4.

15:47

And then we trained on that

15:49

and ended up getting performance that

15:52

was extremely, extremely like massive boost

15:54

compared to the other models that

15:56

were not trained using this kind

15:58

of method. So without

16:01

these giants that have

16:03

already established themselves in the space, we wouldn't

16:05

be here. Like without open AI, without meta,

16:07

like we literally wouldn't have the model and

16:10

the data to do the kind of work

16:12

that we did to make Hermes. What

16:14

it allowed for us is like for

16:17

local models to finally be like comprehensible

16:19

and for us to finally have like

16:21

offline capabilities to kind of take the

16:23

good stuff from something like GPT four

16:26

or something else and make

16:28

it uncensored. So it still has all

16:30

this understanding of all these

16:32

topics, but it doesn't have all that

16:34

RLHF inside it necessarily that safety eyes

16:37

is it so that when people

16:39

utilize the model has all this intelligence, but

16:41

it's has more freedom of thought to kind

16:43

of converse with you on topics that open

16:45

AI may reject. Gotcha. One of the things

16:47

I was curious about as you were going

16:49

through that was a few episodes

16:51

back, Daniel and I were kind of talking

16:53

about the effect of model licensing, you know,

16:55

on the community and the different

16:58

kind of licensing concerns that were coming out

17:00

from whether it be, you know, meta open

17:02

AI, you name the organization, is that ever

17:04

a challenge for you since you're kind of

17:06

using those to get started in terms of the

17:09

inputs, is that been a

17:11

concern or do you anticipate it

17:13

being a concern? I think that

17:15

of course, like generally like the

17:17

US international regulation on this stuff

17:19

is evolving. The conversation is evolving

17:21

very much. So naturally there's like,

17:23

you have to keep it top of mind. You

17:26

have to think about these kinds of things. But

17:28

thankfully, because all of our model releases are like

17:30

open source and we don't profit from them. Like

17:32

if somebody goes off and creates a product using

17:34

our model, you know, good for

17:37

them, but we don't necessarily take on

17:39

that liability or that worry of saying,

17:41

Hey, like we're going to sell you

17:43

this model that was created with GPT

17:45

for outputs. We, we actually actively try

17:47

to stay away from doing that. But

17:50

because the data distillation paradigm is so

17:52

effective, you know, if a model comes

17:55

out, that's better than GPT for, and it's open

17:57

source and I can use it locally. And

18:00

in their TOS, it says, you know, you

18:02

can use this to make a commercial model,

18:04

that we can apply the same techniques that

18:07

we've been preparing and researching and understanding from

18:09

these close models and use it there. So

18:11

right now, like, we don't stand to or

18:13

try to or have any plans

18:16

to profit from using any of these outputs.

18:19

We're not about that, because we want to

18:21

be careful and respectful of these model creators.

18:23

But that and these companies, but that being

18:25

said, we're learning all these techniques

18:27

and developing all these techniques that will be

18:30

useful for when that time comes and for

18:32

when that's available, especially with the advent of

18:34

something like Mistral. If we

18:36

do distillation from a Mistral model like Mistral

18:38

medium or something like that, that's

18:41

completely, from my understanding, you know,

18:43

barring their TOS saying otherwise, but

18:45

I believe it doesn't. It's

18:47

completely okay in that situation

18:49

for us to create models like this

18:51

that can be used commercially, etc. Regarding

18:54

the TOS stuff, though, like, as

18:57

much as we err on the side of caution, I

19:00

find it hard to see

19:02

a company enforce their TOS

19:05

when these larger models

19:07

are likely trained on

19:11

not all copyright free stuff.

19:13

Like, I find it

19:15

hard pressed to believe that these

19:17

closed source companies, their models are,

19:19

you know, totally copyright free and

19:21

totally copyright clean. So if

19:24

some other company that was feeling a little

19:26

more rank functions than ourselves was to say,

19:28

you know, we're going to commercially release some

19:30

of this, I imagine it'd

19:33

be difficult for them to

19:35

become after without the other group

19:37

opening their books. And there's actually

19:39

pretty interesting interaction that happened regarding

19:41

this between Google and open

19:44

AI, if you guys are familiar. So

19:47

yeah, I saw this interesting picture the other

19:49

day, it was like the interesting web of

19:51

AI, and it was like how Microsoft,

19:54

Google, open AI, like, it's like on

19:56

one side, there's the ones and it

19:58

shows how they're connected. to

20:00

the other ones is like

20:02

this visualization and like how

20:05

many of them overlap in

20:07

these strange ways between like,

20:09

whether it's together or Mistral

20:11

or meta, Google, Microsoft, OpenAI

20:13

is sort of very interesting

20:16

web of connections that probably

20:18

makes some of these things rather difficult.

20:20

Leave it for the lawyers to sort

20:22

out. Yeah. Yeah, that's the

20:24

thing is like, we can look at an

20:27

example, right? Like you hear that phrase like

20:29

good artists copy great artists right? Like, so

20:31

the data distillers, we're copying, right? Like, we're

20:34

just distilling this information, like we're trying to

20:36

like, make our models more like those. And

20:38

we don't really plan to commercialize, we're just

20:40

doing it for free for everyone. But the

20:43

great artists are, you know, Google, you know,

20:45

like, you look at Bard, and

20:47

it tells you, you know, I was made by OpenAI.

20:49

Now, it's fine for our open source model. So I

20:51

was made by OpenAI, because we're very transparent about this

20:53

is trained on GPT outputs. But when

20:55

Bard violates the TOS with a paid

20:57

product, bold,

21:00

yeah, that sounds like I was trained by OpenAI, right?

21:03

You think that OpenAI would come

21:05

after this multi billion dollar company,

21:07

like immediately, right? Instead, you see

21:09

a tweet from first, you see

21:11

Google deny it, then you see

21:13

a tweet from Sam Altman, which was something

21:15

along the lines of I'm paraphrasing

21:17

or something along the lines of I'm not

21:19

mad that they trained on our outputs. I'm

21:21

mad that they lied about it. And

21:24

I'm sitting there like, okay, you're mad

21:26

about this. But like, don't you aren't

21:29

you going to pursue the legal action in

21:31

your terms of services? No, no, because everyone

21:33

would have to open their books up to that

21:35

being said, I don't condone the

21:38

commercial use of that kind of stuff.

21:40

Like they release like making a paid model

21:43

from GPT for outputs. Like, I wouldn't advise

21:45

anyone sell a model made with them, just

21:47

because like, you know, we want to

21:49

respect people's like TOS and stuff. They worked

21:51

hard and spent billions to make this stuff

21:53

for hundreds of millions, however much they spent.

21:56

But there is certainly room for

21:58

hypothesis. in

22:00

that realm of the large

22:03

courts. So that's my thoughts

22:05

on the licensing stuff. And that's definitely my

22:08

own individual thoughts. Like we're

22:10

a pretty decentralized collective at News.

22:12

So you'll find people with all

22:14

sorts of opinions all over the

22:16

place. And as a company, we

22:18

don't hold any view whatsoever on

22:20

that. Yeah, I'm wondering, maybe this

22:22

gets a little bit to the

22:24

distributed nature of this, but I

22:26

know that there's sort of various

22:28

collections of what the Noose

22:31

Research Group has done over

22:33

time. You mentioned Hermes, but

22:35

then there's these other kind

22:37

of categories of things too,

22:39

like the yarn models, Capybara,

22:41

Puffin, Obsidian, just looking over

22:43

the hugging face now. I'm wondering if you

22:45

could just give us, like from your perspective,

22:47

a little bit of a map of

22:50

these different things and like how

22:52

people might categorize the different collections

22:55

of what Noose has done. I

22:57

definitely want to talk about like the future

22:59

things and ongoing things as well, but

23:02

as it stands now, what are

23:04

the kind of major categories of

23:06

what the collective has invested in

23:09

their time in over time? Certainly,

23:11

certainly. So within the stuff

23:13

that's newable on hugging face at least, we've

23:15

got the Hermes series of which,

23:18

like I told you guys, the initial

23:20

story of how it went down. But

23:22

from there, Technium kept going. I haven't

23:24

personally had any interaction with the Hermes

23:26

model since the initial. From there, Tech

23:28

just continued to create more and more

23:30

synthetic data, collect from more and more

23:32

sources, use more and more open data

23:35

sets. And he's just got the, I

23:37

guess, award-winning data thesis. The guy

23:40

really knows how to go about

23:42

curating and synthesizing good data. So

23:45

Technium, it's his baby, the

23:47

Hermes project. So everything you've seen since is

23:49

really his work and anyone who's kind of

23:51

collaborated with him, but almost like,

23:54

you can't call it anything but a solo project

23:56

because of the open data sets we use too.

23:59

Everything is built on that. the shoulders of giants and

24:01

the shoulders of each other as little people. But, uh,

24:03

tech really has helmed the Hermes initiative

24:06

so far. I think that's our most

24:08

popular model series and he released the

24:10

open Hermes as well, because we

24:12

had some data in the original Hermes that

24:14

we never released publicly and, uh, we wanted

24:17

to make that kind of an option for

24:19

everybody. So that's Hermes still

24:22

follows the same kind of philosophy of

24:24

synthetic data. And it now uses the

24:26

chat ML format instead of the

24:28

alpaca format is what we kind of upgraded to.

24:31

Then you've got a capybara and puffin, which

24:33

are both done by a volunteer and, uh,

24:36

you know, OG member LDJ. We

24:38

may be familiar with Luigi Danielle Jr. So

24:41

the capybara series was, uh,

24:43

using an amplify instruct method,

24:45

this novel method that, uh,

24:48

LDJ had worked on alongside another

24:50

one of our researchers, Jay. So

24:53

LDJ and Jay can get confusing, but,

24:55

uh, uh, the two of them worked

24:57

on the copybara series, created the dataset,

24:59

trained the models, and then puffin was,

25:01

uh, the idea of using handpicked

25:03

smaller samples from some of our

25:05

larger datasets to make sleek datasets for

25:08

an easy tune and see how

25:10

that works kind of, uh, in the

25:12

spirit of the Lima paper, where

25:15

they just used a few examples to get really

25:17

good results. Those are really

25:19

the popular tunes using synthetic data

25:21

for like general use. Yarn

25:24

is this novel context length extension

25:26

method at the time of creation

25:28

by Imozilla, also known as

25:30

Jeffrey Kanell and Bowen Pang,

25:32

also known as block 97 alongside,

25:36

uh, Enrico Chipotle and,

25:38

uh, Eleuther AI. So what

25:41

happened there was these guys were already looking

25:43

into context, like the extension for a while.

25:45

And, uh, when we kind of

25:47

came under the noose banner to do the work,

25:50

uh, it opened up a little bit

25:52

of resources from compute sponsorships. It opened

25:54

up a more centralized place for them

25:56

to be able to do that collaboration.

25:59

I had. no hand in the

26:02

yarn models whatsoever. And that's the exciting

26:04

thing is everyone really gets to

26:06

work in their own spheres, in their own kind

26:09

of autonomous circles. And then we just check

26:11

in and see, you know, how's the research going?

26:13

How's it coming along? Because we really work with

26:15

people that we heavily believe in and we believe

26:17

in their idea. So if

26:20

we don't already have an idea, we're kind of

26:22

just saying, you know, please freely create because

26:25

we brought you in because what you will

26:27

freely create will push forth our agenda anyway.

26:29

So I think those are our big model

26:32

releases and series that we have available. Outside

26:35

of that, we have a bunch of stuff

26:37

on our GitHub as well. Stuff

26:39

that's being worked on, stuff that hasn't necessarily come

26:41

out yet. There's a lot of that. So

26:45

I got a question for you as a follow up. It's

26:47

pretty fascinating, the story that you've been telling

26:50

us here, because of that kind of organic,

26:52

you know, creation of the organization or collective.

26:56

And I'm wondering as you've done that and you

26:58

kind of went through and talked about the different

27:00

model groups, and kind of talked about, you know,

27:02

the owners or spiritual owners, if you will, of

27:04

each of those families, how do

27:06

the different members of the collective interact

27:09

to kind of share? Like how do

27:11

you each push each other along or

27:13

share information or give ideas so

27:16

that cross-family efforts can kind of benefit

27:18

from the overall collective? And as you

27:20

said now, a C corp and you

27:23

guys are more organized at this point.

27:25

So what kind of culture has developed

27:27

around those communications and learnings? Yeah,

27:30

absolutely. I mean, when it started, it was

27:32

just like a small discord, maybe like 10

27:34

people. From there, like we kind of

27:36

created more channels as people wanted to work on

27:38

more things. And we had

27:40

initially split up into like three,

27:43

four different topics or sectors that people

27:45

could assign themselves to. One

27:47

being data synthesis, of course, so we can

27:49

kind of find new novel methods and formats

27:51

for distillation and the creation of synthetic data.

27:54

One Being training, like people who are

27:56

just like really good at training, hyperparam

27:59

stuff. People who will come

28:01

up with new architecture is a new

28:03

techniques and other being agents. A. Group

28:05

of people who want to actually try to build

28:07

tools and do autonomous work with this stuff. And.

28:09

Then we have this one category that at

28:11

it was a prediction for the future of

28:14

the simulation so we're happy with the were

28:16

very interested in kind of bringing this stuff

28:18

into simulation and unity into it and seeing

28:20

how all these things came together and. Is

28:22

interesting because. The. Training built

28:24

on the data synthesis, The agents build on

28:27

the training and then the same would build

28:29

on the agents was kind of. The ideas

28:31

of everybody needed to work together because all

28:33

those things are so intrinsically connected. but people

28:36

would have specializations on and were in that

28:38

were floating Wanted to work. We.

28:40

Didn't end up doing a lot on the same

28:42

side of things. Now recently, there's a lot more

28:44

interest. Or because we have a

28:46

lot more you know, capability generally as

28:49

the Ai community does, he know. But

28:51

as we've grown. To we went

28:53

to forty people. It was fine. I was

28:55

going to like five doesn't live with me

28:57

to scores it's it's a little on. We'll

28:59

be there. So what we do is we

29:02

kind of tier people in. you come into

29:04

the discord, you to see maybe two channels

29:06

and then we'll give people a developer role

29:08

of we don't really let people select their

29:10

own rose because we wanted nature beating kind

29:12

of sort through people. we knows her and

29:15

let them through and even as we do

29:17

open source research a lot of his unreleased

29:19

and we want to make sure that it's

29:21

kind of protected before. Release. So we

29:23

create this developer role so people

29:26

can than see like way more

29:28

channels of just general development, adult

29:30

conversation and from there as we

29:32

see you know contributors who have.

29:35

Started. To do more work or show

29:37

more passion towards contributing to news in

29:39

a particular field or who have some

29:42

reputation or some portfolio on a particular

29:44

field then will assign them one of

29:46

those roles. And that will open

29:48

up the family of channels relating to

29:50

those roles and our current projects surrounding

29:52

that role. So it does and says

29:54

projects, agent, projects, training, projects, etc. So

29:56

we kind of just tear it out

29:59

so people can. The Rack. And

30:01

people have been around for awhile of people

30:03

we consider cellos are part of the towards

30:05

the can usually see pretty much everything. So.

30:08

They're pretty effective in serving as

30:10

coordinators for the cross communication between

30:12

these different channels and grooves. and

30:15

even if something has like a

30:17

particular. Ah, someone has a particular

30:19

role or some ten or as a

30:21

particular role it's supposed to be a

30:23

part of like it's still discord and

30:26

we're still very chill so like people

30:28

will still work on like various different

30:30

overlaps inside of just once in. A

30:44

fearless and then you know

30:46

that artificial intelligence is revolutionizing

30:48

the way we produce information

30:50

changing society, culture, politics, the

30:52

economy. but it's also created

30:54

a world of he I

30:56

generated content including deep fakes.

30:58

saw can we tell what's

31:00

real on mine? Read: write

31:02

own building next era of

31:04

the Internet A new book

31:06

from entrepreneur and investor Chris

31:08

Dixon exposed one possible solution

31:11

to the internet's authenticity problem:

31:13

block scenes. From Ai that

31:15

tracks is source material to dinner

31:17

of programs that compensate rather than

31:19

cannibalize graders, Read Write Own is

31:21

a call to action for a

31:23

more open, transparent and democratic Internet.

31:25

One that opens the black box

31:27

of a I tracked the origins

31:29

we see our minds and much

31:31

more. This is our seems to

31:33

reimagine world seems in technologies to

31:35

build the internet. We once not

31:37

them. while we inherited. Order your

31:39

copy of Read Write Own Today

31:41

or go to Read And Write

31:43

Own. Dot Com and learn more.

31:51

And. I.

31:57

Have a I selfish question.

32:00

Which now that this is one of

32:02

the advantages of doing a podcast, they

32:04

get to talk to all the amazing

32:06

people doing amazing things and learn from

32:09

them. But I'm wondering as a person

32:11

who is also trying to find to

32:13

and some models either just for my

32:15

own enjoyment and and learning but also

32:17

fine tuning models for specific tasks and

32:20

in a specific ah customer use cases

32:22

and that sort of thing is a

32:24

lot of people out there. I think

32:26

many of our listeners who are thinking

32:28

like. Since. You being part of

32:31

this collective have worked for you know,

32:33

since the sort of dawn of of

32:35

these many you know the proliferation of

32:37

fine tuned for from La Minor Excedrin

32:39

And as you've seen all that as

32:41

you're doing, more and more fine tuned

32:43

now as you're looking towards the future,

32:46

Do you have any. Kind

32:48

of good advice or. Things.

32:50

To keep in mind for all those

32:53

like fine tuners out there that are

32:55

thinking about grabbing something off of hugging

32:57

face, creating their own versions of these

32:59

models. Maybe they have their own ideas

33:02

about a specific take on on a

33:04

model. Any general tips that you found

33:06

to be really useful over time or

33:08

like pitfalls that you'd like to highlight.

33:11

yeah, I mean I can. I can

33:13

try to think of a few of

33:15

the top of his. I'll say that.

33:18

Type of or hammers are really important. And

33:20

ah, it's important to try to get that right.

33:23

It's. Going to vary from model model but

33:25

a lot of the time from people think

33:27

hyper brands like don't really matter as much

33:29

to but obsess over and some people think

33:31

it's a secret sauce as well. so I'd

33:34

say like trying to do a lot of

33:36

research into a good for hims a good

33:38

learning me. I'd also say like

33:40

I could be totally wrong about this as

33:43

I'm not the trainer of Hermes today as

33:45

a lot of these models with something I

33:47

personally believe in a lot is like ignore

33:49

like people telling you to only train for

33:51

which x amount of time with if you're

33:53

not overfeeding like just keep like if you

33:56

can if you have the computer keep training

33:58

and keep him. trained for

34:00

more tokens, more epochs. That's something I

34:02

heavily believe in. In terms

34:04

of trainers to use, there's a lot

34:06

of people who make their own scripts

34:08

for specialty stuff. And there's, of

34:11

course, you can just use HingVase. The

34:14

library we use is called

34:16

Axolotl, A-X-O-L-O-T-L, like the animal.

34:20

The Akashious Wing Leon

34:22

of the Open Access Collective. We

34:24

think Axolotl is probably the best

34:26

general purpose trainer for LoRa's, Q-LoRa's,

34:28

Finetunes, et cetera. Any

34:31

open source repository is buggy and stuff

34:33

you're going to have to work out.

34:36

But it's, in my opinion,

34:38

probably the easiest and most effective

34:40

trainer to use for pretty much

34:42

any model architecture available right now.

34:45

So I definitely point everybody towards

34:47

Axolotl. Awesome. Yeah, that's

34:49

super useful. We'll share some links in

34:52

our show notes as well. So people

34:54

make sure and check that stuff out.

34:56

Another interesting question, as

35:00

you see, I

35:02

think we saw these waves

35:04

of models that came out

35:06

maybe around synthetic data, Finetunes,

35:08

or other types of Finetunes.

35:10

I see this interesting thing

35:12

happening over the past, however

35:17

many months, not that long in the

35:19

scheme of things, but in the AI

35:21

world, maybe a while, where we're now.

35:24

There's a lot of interesting approaches,

35:26

more so than just Finetunes, but

35:28

mixture of experts and merging, and

35:31

of course, multimodal stuff coming out. Now

35:33

I see Noose kind of dabbling in that.

35:35

You don't have to answer for the whole

35:37

collective. But as there's so many of these

35:40

things coming out and different approaches, what

35:42

are some of the things within

35:44

that? It doesn't have to be one

35:46

of those. But what are some of the things on

35:48

your mind moving forward or

35:51

on Noose's mind more

35:53

generally? Sure. I'll try to

35:55

go from simple to

35:57

complex on the kind of stuff. I

36:00

think that definitely just like straight

36:02

up instruction tuning is great. There's

36:05

other ways to tune like the Evol

36:07

instruct method. I would advise

36:09

people to try to create new instruction

36:12

methodologies that allow us to make even

36:14

better formatted data. People don't

36:16

spend enough time trying to create new instruct

36:18

formats. And we've definitely been

36:20

swamped with not doing that as well. So

36:22

I think towards the general community, it's a

36:24

really easy place to get started. You don't

36:26

need to really know how to code so

36:29

much as think about how a human might

36:31

more effectively phrase something or format something

36:34

and kind of remix from there. I think that's

36:36

like probably the easiest place to start. Then

36:39

there's a model merging, right? Model merging

36:41

is great. You can just like take

36:43

two models and Frankenstein them together to question

36:45

mark results. You know, you got to

36:47

just try and see what happens and feel it out.

36:50

Then from there, I would say

36:52

there's stuff like DPO. There's

36:55

RLHF, DPO, these kind of rewards

36:57

things that can let you like

37:00

enable rejections or create censorship or

37:02

put some kind of general concept

37:04

or attitude towards the model. We

37:07

found that to be pretty effective with the

37:10

latest noose Hermes mixture DPO. It

37:12

seems like people really like it and prefer it over just

37:14

the SFT. So

37:17

that's another thing that I'd heavily recommend. From

37:19

there, we get a little more

37:22

complex. We have some reward model

37:24

stuff we're working on that I won't speak to just

37:26

yet outside of saying we're working on it that we

37:28

think is going to be like pretty big for reasoning

37:30

boosts. Of course, there's techniques like

37:32

chain of thought and tree of thought for like

37:34

multi-step prompting. Creating

37:37

data sets even out of that for

37:39

any of these purposes that are already mentioned is

37:41

going to be really effective. Now

37:44

to stuff that maybe not everybody can actually

37:46

a lot of people would already be able

37:48

to do this. Here's like something that we

37:50

like to call over at noose activations hacking

37:53

where you're kind of messing with the

37:55

way that a model I'm trying to think

37:57

about how to say this in like the

37:59

most laid out. terms like you're trying to

38:01

mess with how a model like generally vibes

38:04

about something so

38:06

rather than just doing a system prompt or something

38:08

like that you can actually like

38:10

change the the model vectors to kind

38:12

of be like more political about something

38:14

less political about something more terse more

38:16

specific again that's far more effect and

38:19

control over a model than a system

38:21

prompt it's basically like a system prompt

38:23

that like tells it to embody certain

38:25

characteristics but it's not something you can

38:27

really jailbreak or get around as

38:30

far as my testing is shown certainly not

38:32

as easily as a system prompt like

38:35

we have no problem jailbreaking even the

38:37

most censored closed models today like

38:39

it can be done by anybody with

38:41

the right words right but um this activation

38:44

stuff it really creates a bit more of

38:46

a robustness and fidelity to the concepts that

38:48

you're trying to tell it to embody there's

38:51

a few more I'm trying to think of that would

38:53

be useful for people one

38:56

thing is soft prompting it's not really

38:58

around anymore it used to be pretty

39:00

big during the GPTJ like pretty llama

39:02

days when the cobalt AI

39:04

guys really pioneered the use of it in

39:06

the open source community but a

39:08

soft prompt basically takes like massive prompt

39:10

and compresses it down to like way

39:12

less tokens so you can give your

39:14

model like a huge a huge

39:17

system prompt or huge amount of information

39:19

and use like way less tokens so

39:22

soft printing is cool it's not gonna be

39:24

too difficult to like update it for like

39:27

llama, Mistral, like today's architectures it's just like

39:29

nobody has really done it that I've seen

39:32

so you know to the community if

39:34

you guys do that please share that's

39:39

actually much easier than the activation

39:41

stuff I think and then finally

39:43

probably the hardest unsolved is like

39:46

sampling methods like today

39:48

we use like top K top P

39:51

like you know nuclear sampling center or

39:53

whatever like there's better ways to

39:55

pick tokens for sure there's better ways

39:57

to judge the value of tokens for

39:59

sure Everyone has been too

40:01

kind of concerned with higher levels to

40:03

get that low and do whatever

40:06

the magic math is that I can't do that

40:08

would, you know,

40:10

enable some steering and

40:12

some even beyond

40:14

steering like alternative sampling paradigms.

40:17

And I think that would probably

40:19

bring the biggest change in transformation

40:21

to literally all models regardless of

40:24

the tune regardless of the architecture

40:26

etc. Get pulled

40:28

off so really looking forward to something like

40:30

that happening in the space. That

40:32

was a lot of really good advice that you have

40:34

there. I was sitting there trying to take notes while

40:37

you're talking through it and everything going wait, but

40:39

he said that too and he said that too.

40:41

There's a really good answer there. Thank

40:44

you for that. As we're starting to wind

40:46

up here, I wanted to ask

40:48

you, I know about as we're recording this, it

40:50

looks like it was just over three weeks ago,

40:53

about four weeks ago when we released this

40:55

episode. You guys announced your

40:58

$5.2 million seed financing

41:01

round. So congratulations on

41:03

that. That was pretty amazing. Thank you.

41:06

And I'm kind of wondering, so like you've

41:08

kind of started with this kind of fairy

41:10

tale story of kind of organically building from

41:12

the ground up, you know

41:14

yourself, you connect with somebody else,

41:17

a few other people join, you

41:19

get to thousands of people contributing,

41:21

you find and really producing amazing

41:23

work. And then

41:25

you're incorporating and now you got

41:27

the seed round coming. Where

41:29

does that lead you? It's kind of a sky's

41:32

the limit kind of scenario it seems, you know,

41:34

that now that you're kind of launching

41:36

and, you know, on that, you know, as

41:38

a corporation, as you said, where can

41:40

you go from here? What do you anticipate

41:42

over the next couple of years or

41:45

even several years out? You know, what's the vision?

41:47

What do you want to achieve? You've come a

41:49

long way so far. What's next? AGI.

41:51

No, I'm just kidding. Believe

41:55

you if you said it, actually. I

41:58

mean, like, you know, someone will do it. But

42:00

then you'll distill the knowledge.

42:03

Then we'll distill and then you'll run

42:05

the API on your neural link, on

42:07

your contact lens or something. But

42:12

for us, there's a huge focus

42:14

on locality. There's a huge focus on offline.

42:16

There's a huge focus on take the power

42:18

back, run the model yourself, do everything at

42:20

home. That's big for us.

42:22

And at the same time, of course, we believe in scale.

42:24

But there's this idea that there's so much unsolved at

42:26

the small model size. Why don't we

42:29

do that before we go to a trillion per

42:31

amps? Because we can scale those realizations. But

42:34

for us, there's certainly a transformation

42:36

and change in attitude and pressures

42:38

from going from pure open source

42:41

volunteer to as well having this

42:43

more corporate branch created as well.

42:46

But that being said, it's been pretty

42:49

consistent, our ethos and our motivation for

42:51

why we do this. And

42:54

like you said, it really was organic in

42:56

the sense that we're a product of the

42:58

times, we're a product of the atmosphere of

43:00

the community. People have said

43:02

nice things like you guys are setting the trend. And

43:04

it's not really true so much as the truth is

43:06

like, we are one of many embodiments

43:09

of the sentiment that the community has and

43:11

that the world has, we think. There's

43:14

more than one news research in this world. There's

43:16

alignment labs, there's Pygmalion, there's Cobalt, there's people who

43:18

have been around before us, people who will come

43:20

along the way, people who have already formed

43:22

since we have. And

43:24

there's lots of people who have kind

43:27

of embodied the news research ethos. And

43:29

it's not really just our ethos as

43:31

much as the overall community's ethos. They're

43:33

people who have come before us, people who

43:36

will come along the way, who do

43:38

very, very similar style of work

43:41

as us, this kind of open work. And

43:43

I think that's got everything to do with the fact

43:45

that like, this is what the

43:47

people want. We're just the everyman, just

43:50

like everybody else. We're not like billionaires

43:52

or super like all ex-Facebook

43:55

or anything like that. We're

43:57

just a bunch of people who really.

44:00

really care about this who want to

44:02

see everyone have access to

44:04

language models, everyone be able to automate

44:06

their lives, everyone be able to push

44:09

their understanding of any topic to the

44:11

next level. And our

44:13

work as we become an organization that's

44:16

looking to be a company

44:18

and create revenue, etc. We

44:21

won't let it tamper or hinder

44:23

any of the open source work

44:25

we do. In fact, we want

44:27

it to empower all of that

44:29

work because we believe that the

44:31

tools and the developments and services

44:33

that we will be providing as

44:35

a corporation will only serve to

44:37

better feed the entire open source

44:39

community. We're not really looking to

44:42

suddenly make like a closed Hermes

44:44

or something like that. We're more

44:46

looking to create tools and do

44:48

research that makes your open Hermes

44:50

far more effective, far better and, you know,

44:53

good enough that you may want to pay

44:55

for that tool. It

44:58

sounds like something I would pay for. That's for sure.

45:02

Yeah, it's super inspiring. I

45:04

really appreciate you taking

45:06

time current to talk with us. I've

45:08

thoroughly enjoyed this because I am such

45:10

a fan of everything you all are

45:12

doing and the community that you've built.

45:14

So thank you for saying true to

45:16

that culture and what you're doing. And

45:18

I'm really looking forward to seeing what

45:21

happens in the future and where things

45:23

head. And I hope that we can

45:25

talk again and have noose back on

45:27

the show. And in a year when,

45:29

of course, everything will be different in

45:31

the world, and I'm sure you'll still

45:33

be doing interesting things. So yeah, you're

45:35

always welcome back on the show. Thank

45:37

you so much. It's been a pleasure

45:39

to chat with you guys. Thanks for

45:41

being so candid. I'm glad

45:43

we were able to kind of push our message forth

45:45

more and thanks for the validation you and the community

45:47

have given us to keep doing this great work. All

45:50

right. Thanks. We'll talk soon. See ya. That

46:00

is Practical AI for this week, thanks for listening.

46:03

Subscribe now, if you haven't yet,

46:05

head to practicalai.fm for all the

46:07

ways. And don't forget to check

46:09

out our fresh changelog beats. The

46:12

dance party album is on Spotify, Apple Music,

46:14

and the rest. There's a link in the

46:16

show notes for ya. Thanks

46:18

once again to our partners at fly.io,

46:21

to our beat freakin' residents, Breakmaster Cylinder,

46:23

and to you for listening. That's all

46:25

for now, we'll talk to you again

46:27

next time.

Unlock more with Podchaser Pro

  • Audience Insights
  • Contact Information
  • Demographics
  • Charts
  • Sponsor History
  • and More!
Pro Features