Podchaser Logo
Home
Supervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit

Supervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit

Released Thursday, 11th April 2024
Good episode? Give it some love!
Supervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit

Supervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit

Supervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit

Supervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit

Thursday, 11th April 2024
Good episode? Give it some love!
Rate Episode

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:04

Hey everyone, welcome

0:06

to the Later In Space podcast. This

0:08

is Alessio, partner and CTO and resident

0:10

admissible partners, and I'm joined by my

0:13

co-host, Swix, founder of Small.ai. Hey,

0:15

and today we are back in the

0:17

studio with Andreas and Jowon from Illicit.

0:19

Welcome. Thanks guys. It's

0:21

great to be here. Yeah. So

0:23

I'll introduce you separately, but also, you know,

0:26

we'd love to learn a little bit more

0:28

about you personally. So Andreas, it looks like

0:30

you started Illicit first, Jowon joined later. That's

0:32

right. For all intents and purposes, the Illicit

0:34

and also the Ad that existed before then

0:36

were very different from what

0:38

I started. So I think it's like fair

0:41

to say that you co-funded it. Got it.

0:43

And Jowon, you're a co-founder and COO of Illicit

0:45

now. Yeah, that's right. So there's a little bit

0:48

of a history to this. I'm not super aware

0:50

of like the sort of journey. I was

0:52

aware of Ad and Illicit as sort of

0:54

a nonprofit type situation. And recently you turned

0:56

into like a B Corp. Public benefit corporation.

0:58

So yeah, maybe if you want, you could

1:01

take us through that journey of finding the

1:03

problem. You know, obviously you're working

1:05

together now. So like, how do you get

1:07

together to decide to leave your startup career

1:09

to join him? Yeah, it's

1:11

truly a very long journey. I guess truly it kind

1:13

of started in Germany when I was born. So

1:17

even as a kid, I was always interested in AI.

1:19

Like I kind of went to the library. There

1:21

were books about how to write programs in Q-basic.

1:23

Like some of them talked about how

1:26

to implement chat bots. I guess, Illiza. To be

1:28

clear, he grew up in like a

1:30

tiny village on the outskirts of Munich called

1:32

Dinklshirben, where it's like a

1:34

very, very idyllic German village. Yeah,

1:36

important to the story. So basically the main thing

1:38

is I've kind of always been thinking about AI

1:40

my entire life and been thinking about, well, at

1:42

some point this is going to be a huge

1:45

deal. It's going to be transformative. How can I

1:47

work on it? And I was

1:49

thinking about it from when I was a

1:51

teenager after high school, did a year where

1:53

I started a startup with the intention to

1:56

become rich. And then once I'm rich, I

1:58

can affect the trajectory of AI. AI,

2:00

did not become rich, decided to go back to

2:02

college and study cognitive science there,

2:04

which are the closest thing I could find

2:06

at the time to AI. In the last

2:08

year of college, moved to the US to

2:10

do a PhD at MIT, working

2:12

on probably kind of new programming languages

2:15

for AI, because it kind of seemed

2:17

like the existing languages were not great

2:19

at expressing world models and learning world

2:21

models to invasion inference. I was always

2:23

thinking about, well, ultimately the goal is to

2:25

actually build tools that help people reason

2:27

more clearly, ask and answer better questions and

2:29

make better decisions. But for a long

2:31

time it seemed like the technology to put

2:34

reasoning in machines just wasn't there. Initially,

2:37

at the end of my postdoc at

2:39

Stanford, was thinking about, well, what to

2:41

do? I think the standard path is

2:43

you become an academic and do research.

2:45

But it's really hard to actually build

2:47

interesting tools as an academic. You can't

2:49

really hire great engineers. Everything

2:51

is kind of on a paper-to-paper timeline. And

2:54

so I was like, well, maybe I should start a

2:56

startup, pursue that for a little bit. But it seemed

2:58

like it was too early, because you could have tried

3:00

to do an AI startup, but probably would not have

3:02

been the kind of AI startup we're seeing now. So

3:06

then decided to just start a nonprofit research lab

3:08

that's going to do research for a while until we

3:10

better figure out how to do

3:12

thinking in machines. And that was

3:14

odd. And then over time, it

3:16

became clear how to actually build actual tools

3:19

for reasoning. And only over

3:21

time, we developed a better way to... I'll

3:24

let you fill in some of the details here. Yeah. So

3:27

I guess my story maybe starts around 2015. I

3:29

kind of wanted to be a founder for a long time.

3:31

And I wanted to work on an idea that stood the

3:34

test of time for me, like an idea that stuck with

3:36

me for a long time. And starting

3:38

in 2015, actually, originally, I became interested in

3:40

AI-based tools from the perspective of mental health.

3:42

So there are a bunch of people around

3:44

me who are really struggling. One really close

3:47

friend in particular is really struggling with mental

3:49

health and didn't have any support. And it

3:51

didn't feel like there was anything before kind

3:53

of like getting hospitalized that could just help

3:55

her. And So luckily, she came and stayed with

3:58

me for a while and we were just able to... Doctor

4:00

some things but it seemed like you know lots

4:02

of people might I have that resource and something

4:04

may be a I enabled could be much more

4:07

scalable. I didn't feel ready to start a company

4:09

then us twenty fifteen and I also didn't feel

4:11

like the technology was already so then I went

4:13

into thin tack and like kind of learn how

4:16

to do the tech thing and then and twenty

4:18

nineteen I felt like it was time for me

4:20

to just jump in and and build something on

4:22

my own. I really wanted to create and at

4:25

the time I looked around attack and felt like

4:27

not super inspired by the options. I just I

4:29

didn't wanted. To have a tech career ladder like

4:31

had money I climb the career ladder their to

4:33

kind of interesting technologies. At the time there was

4:36

a I in there was crypto on us like

4:38

well the ai people seem like a little bit

4:40

more nice. Like

4:42

a slightly. More trustworthy. Both

4:44

super exciting, but. Through my bed

4:46

and on the I sighed and then

4:48

I got connected to undress and actually

4:50

the way he was thinking about pursuing

4:52

the research agenda ot was really compatible

4:54

with what I had envisioned for an

4:56

ideal Ai product and they had helps

4:58

kind of take down really complex thinking,

5:00

overwhelming thoughts and breaks down into small

5:02

pieces. and then this an admission that

5:04

we need a I to help us

5:07

figure. Out what we ought to do

5:09

was really inspiring for her. Eight? Yeah,

5:11

Could. I think it was clear that

5:13

we were building the most powerful optimizers.

5:15

Of our time. But. As a society we

5:17

hadn't figure out. How. To direct and

5:19

optimization potential and if you can have.

5:22

Direct. Tremendous amounts of optimization potential.

5:24

The wrong thing? That really disastrous. So

5:26

the goal of art was. Make.

5:28

Sure that if we build the most transformative

5:30

technology of our lifetime, it can be used

5:33

for something really impactful like good reasoning like

5:35

not just generating ads on my back on

5:37

us and marketing but like sounds like I

5:39

want to do more than shudder at odds

5:41

with us And also if the they Isis

5:43

and. Get. To be super intelligent enough

5:46

that they are doing that really complex reasoning that

5:48

we can trust them that that they are aligned

5:50

with us and we have ways of evaluating that

5:52

they're doing the right thing. That's what I did.

5:54

We did a lot of experiments. You

5:56

know, like under said before Sunday's models the

5:59

really liked her. A lot of

6:01

the issues you are seeing were more and

6:03

reinforcement learning by we saw a future where

6:05

a I would be able to do more

6:07

kind of logical reasoning. Not just kind of

6:09

extrapolate from numerical trance. We actually kind

6:12

of setup experiments. With people are kind of

6:14

people stood and as super intelligent systems and

6:16

we effectively gave them context windows so they

6:18

would have to like a bunch of text

6:20

and one person would get less tax than

6:23

one person would get all the tax. And

6:25

the person with less sex? what have to

6:27

evaluate the work of the person who could

6:29

read much more Cilic? In a world where

6:31

basically simulating like and twenty eight into a

6:34

nineteen, a world where an Ai system could

6:36

read significantly more than you and you as

6:38

the person who couldn't read that much how

6:40

to evaluate the work of yes of course.

6:43

Yeah so the lot of the work we

6:45

did and from that we kind of iterative

6:47

on the idea of breaking complex task down

6:49

into smaller toss like complex tasks like open

6:51

ended reasoning, logical reasoning into smaller ta so

6:54

that it's easier to train as systems on

6:56

them and also so that it's easier to

6:58

evaluate the work of the I system when

7:00

it's done and then also kind of in

7:02

early pioneer this idea the importance of supervising

7:05

the process of ai systems not just the

7:07

comes as a big part of illicit as

7:09

belts is are very intentional about not just

7:11

throwing a ton of. Data into a model

7:13

and training at and it's a cool. Here's

7:16

like scientific out, but like that's not at

7:18

all what we do. Our approach is very

7:20

much like what are the steps that an

7:22

expert human does or what is like an

7:25

ideal process as granular early as possible. Let's

7:27

break that down and then train aerosystems to

7:29

perform each of those steps very robustly when

7:31

you train like that from the start. After

7:34

the fact, it's much easier to evaluate, much

7:36

easier to troubleshoot at each point. like were.

7:38

To Something breakdown. So yeah we work on

7:40

those experiments for a while and then at

7:42

the start of twenty twenty one decided to

7:44

build a product do magnified. Because you

7:46

either you're about to go into more modern

7:48

yards and illicit. And I just wanted to

7:50

because I think a lot of people are

7:52

in. Where. you word bucks to the

7:54

eighteen nineteen ah where you chose a partner

7:57

to work with yeah right and you to

7:59

know him Yeah, yeah. You were just kind of

8:01

cold introduced. Yep. A lot of people are cold

8:03

introduced. Mm-hmm. I think cold introduced to tons

8:05

of people and I never work with them. I assume you

8:07

had a lot of other options, right? Like how do you

8:09

advise people to make those choices? We were not totally cold

8:11

introduced, so one of our closest friends introduced us.

8:14

And then Andreas had written a lot on the

8:16

OTT website, a lot of blog posts, a lot

8:18

of publications, and I just read it and I

8:20

was like, wow, this sounds like my writing.

8:22

Okay. And even other people, some of my closest

8:24

friends I asked for advice from, they were like, oh,

8:26

this sounds like your writing. But

8:29

I think I also had some kind of like

8:31

things I was looking for. I wanted someone with

8:33

a complimentary skillset. I want someone who was very

8:35

values aligned. And yeah, that

8:37

was all a good fit. We also did

8:39

a pretty lengthy mutual evaluation process where

8:41

we had a Google doc where we had

8:43

all kinds of questions for each other. And

8:46

I think it ended up being from 50 pages

8:48

or so of like various like

8:50

questions and back and forth. Was it the

8:53

YC list? There's some lists going around for

8:55

co-founder questions. No, we just made our own

8:57

questions. I guess it's probably

8:59

related in that you ask yourself what are the values

9:01

you care about? How would you approach various positions and

9:03

things like that? I shared like all of my

9:05

past performance reviews. Yeah? Yeah.

9:08

And he never had any so. No. Okay.

9:11

Okay. Okay. All right. Yeah,

9:14

sorry. I just had to, a lot of people are

9:16

going through that phase and you kind of skipped over it. I was

9:18

like, no, no, no, there's like interesting story. Yeah. Before

9:21

we jump into what a list it is

9:23

today, the history is a bit counterintuitive. So

9:25

you start with figuring out, oh, if we

9:27

had a super powerful model, how will we

9:29

align it? How will you use it? But

9:31

then you were actually like, well, let's just build the

9:33

product so that people can actually leverage it. And

9:36

I think there are a lot of folks today

9:38

that are now back to where you were maybe

9:40

five years ago. They're like, oh, what if this

9:42

app ends rather than focusing on actually building something

9:44

useful with it? What collect for you

9:46

to like move into a list and then we can

9:48

cover that story too. I think in many ways the approach

9:50

is still the same because the way we are building

9:52

a list that is not Let's train

9:55

a foundation model to do more stuff. It's

9:57

like, let's build a scaffolding such that we

9:59

can... Hi powerful models to go dance

10:01

I think is different now in that me

10:03

actually have like some of the most plugin

10:05

but if in two thousand and seventeen we

10:07

had had the models because of run the

10:09

same experiments we did run with humans like

10:11

then to fifth models. And. So many

10:14

ways our philosophy is always says think

10:16

out to the future what Martha can

10:18

exist in one, two years or longer

10:20

and how can we make it so

10:22

that they can actually be deployed and

10:24

kind of transparent controllable ways. Citing

10:26

motivational he we both are kind of

10:29

product people at heart. The research was

10:31

really important and it. Didn't. Make

10:33

sense to build a product that that times. but

10:35

at the end of the day the thing that

10:38

always motivated us as imagining a world where high

10:40

quality reasoning is really abundant. And as

10:42

is the technology that's gonna get us there

10:44

and there's a way to guy that technology

10:46

with research. but we can have a more

10:48

direct affect their products because with research it

10:50

published research and someone else has to implement

10:52

that into the product and the product out

10:54

like a more direct path and we wanted

10:56

to concretely have an impact on people's lives.

10:58

Yeah. I think is that and a personally

11:00

the motivation was we want to build of.

11:03

Yep! And. Insisted recap as well as

11:05

the models you're using back then were like.

11:08

I. Don't know with the like birds type

11:10

stuff for. T. Five or own

11:12

with type. We were talking about your. I

11:14

guess to be clear to the very beginning

11:16

V had humans do the work and then

11:18

I think the first model. That kind of

11:21

makes sense for Tb to and T M

11:23

L G and like the early terrorists models

11:25

be do also use like T Five space

11:27

models even now. started with deputy to yeah

11:29

call them to kind of curious about but

11:31

how do you start so early you know

11:33

like snow it's obvious east where to start

11:35

but back then he wasn't even. Nicer

11:37

Megadeth a lot. I was like, why

11:39

are you talking to this? I dunno.

11:41

I think tv details like clearly can't do

11:43

anything and I was like Andreas, you're wasting

11:46

your time for language as toy. It's Latvia

11:48

is right. So

11:50

what's the history of was illicit actually does as

11:52

a product you recently announced said after four months

11:54

he gets or million a revenue or you he

11:57

a lot of people use a get a lot

11:59

about you but. It would initially

12:01

structure data instruction from papers. Then

12:03

you had concept grouping and to

12:05

date maybe a more full stack

12:08

research enabler, paper understander platform.

12:10

What's the definitive definition of what ELLICIT

12:12

is and how did you get here?

12:14

Yeah, we say ELLICIT is an AI research assistant.

12:17

I think it will continue to evolve. That's part of

12:19

why we're so excited about building and research because

12:21

there's just so much space. I think the current

12:23

phase we're in right now, we talk about it

12:25

as really trying to make ELLICIT the

12:28

best place to understand what is known.

12:30

It's all a lot about literature summarization.

12:32

There's a ton of information that the

12:34

world already knows. It's really hard to

12:36

navigate, hard to make it relevant. A

12:38

lot of it is around document discovery

12:40

and processing and analysis. I really want

12:42

to import some of the incredible productivity

12:45

improvements we've seen in software engineering and

12:47

data science and into research. It's like,

12:49

how can we make researchers like data

12:51

scientists of text? That's why we're launching

12:53

this new set of features called Notebooks.

12:56

It's very much inspired by computational notebooks

12:58

like Jupyter Notebooks, Deep Note, or

13:00

Colab because they're so powerful and

13:02

so flexible. Ultimately, when people are

13:04

trying to get to

13:06

an answer or understand insight, they're

13:08

manipulating evidence and information. Today, that's

13:11

all packaged in PDFs, which are

13:13

super brittle. With language models,

13:15

we can decompose these PDFs into their

13:17

underlying claims and evidence and insights and

13:19

then let researchers mash them

13:21

up together, remix them, and analyze them together. I

13:24

would say quite simply, overall, Lissit

13:26

is an AI research assistant. Right

13:28

now, we're focused on text-based workflows,

13:30

but long-term, really want to go

13:32

further and further into reasoning and

13:34

decision making. When you say AI research

13:36

assistant, this is matter research.

13:39

Researchers use Lissit as a research assistant.

13:41

It's not a generic UVM research or

13:43

anything type of tool, or it could

13:46

be, but what are people using it

13:48

for today? Yeah. Specifically, in

13:50

science, a lot of people use human research

13:52

assistants to do things. You tell your grad

13:55

student, hey, here are a couple of papers.

13:57

Can you look at all of these? See?

14:00

which of these have sufficiently large populations and

14:02

actually study the disease that I'm interested in

14:04

and then write out what are the experiments

14:06

I did, what are the interventions they

14:08

did, what are the outcomes and organize that for

14:11

me. And the first phase of

14:13

understanding what is known really focuses on automating

14:15

that workflow. Because a lot of that work

14:17

is pretty rote work. I think it's not

14:19

the kind of thing that we need humans

14:21

to do, language models can do it. And

14:23

then if language models can do it, you

14:26

can obviously scale it up much more than

14:28

a grad student or undergrad research assistant would

14:30

be able to do. Yeah, the use cases

14:32

are pretty broad. So we do have a

14:34

very large percent of our users are just

14:36

using it personally or for a mix of

14:38

personal and professional things. People who care a

14:40

lot about health or biohacking or parents who

14:42

have a children with a kind of rare

14:44

disease and want to understand the literature directly.

14:46

So there is an individual kind of consumer

14:48

use case. We're most focused

14:50

on the power users, though that's where

14:53

we're really excited to build. So Lisset

14:55

was very much inspired by this workflow

14:57

in literature called systematic reviews or meta

14:59

analysis, which is basically the human

15:02

state of the art for summarizing scientific

15:04

literature. It typically involves like

15:06

five people working together for over a year.

15:08

And they kind of first start by trying

15:10

to find the maximally comprehensive set of papers

15:13

possible. So it's like 10,000 papers. And

15:16

they kind of systematically narrow that down to like

15:18

hundreds or 50 extract

15:20

key details from every single paper. Usually

15:22

have two people doing it and like

15:24

a third person reviewing it. So it's

15:26

like an incredibly laborious, time consuming process,

15:28

but you see it in every single

15:31

domain. So in science, in machine learning,

15:33

in policy, because it's so structured and designed

15:35

to be reproducible, it's really amenable to automation.

15:37

So it's kind of the workflow that we

15:39

want to automate first. And then you make

15:41

that accessible for any question and make kind

15:44

of these really robust living summaries of science.

15:46

So yeah, that's one of the workflows that

15:48

we're starting with. Our previous guest, Mike Conover,

15:50

he's building a new company called BrightWave, which

15:53

is AI research assistant for financial research. How

15:55

do you see the future of these tools? Like

15:58

does everything converge to like a God researcher? assist

16:00

in or is every domain going to have its

16:02

own thing? I think that's a

16:04

good and mostly open question. I

16:07

do think there are some differences

16:09

across domains. For example, some research

16:11

is more quantitative data analysis and

16:14

other research is more high-level cross-domain

16:16

thinking. And we definitely

16:18

want to contribute to the broad general

16:20

reasoning type space. If researchers are making

16:22

discoveries, often it's like, hey, this thing

16:24

in biology is actually an analysis to

16:27

these equations in economics or something. And

16:29

that's just fundamentally a thing where you

16:31

need to reason across domains. At least

16:33

within research, I think there will be

16:35

one best platform more or less for

16:37

this type of generalist research. I think

16:40

there may still be some particular tools

16:42

for genomics, particular types of modules of

16:44

genes and proteins and whatnot. But

16:47

for a lot of the high-level reasoning that humans do,

16:49

I think that is a more of a winner-type all

16:51

thing. I wanted to ask

16:53

a little bit deeper about the workflow that

16:55

you mentioned. I like that phrase. I see

16:57

that in your UI now, but that's as

17:00

it is today. And I think you were about to

17:02

tell us about how it was in 2021 and how

17:04

it maybe progressed. How has this workflow evolved over time?

17:07

Yeah, so the very first version of Elyssa actually

17:09

wasn't even a research assistant. It was a forecasting

17:11

assistant. So we set out and we were thinking

17:13

about what are some of the most impactful types

17:15

of reasoning that if we could scale up, AI

17:17

would really transform the world. We actually

17:19

started with literature review, but we're like, oh,

17:22

so many people are going to build literature

17:24

review tools. So let's start there. So then

17:26

we focus on geopolitical forecasting. So I don't

17:28

know if you're familiar with like Manifold or... Manifold

17:30

Market. Yeah, that kind of stuff before

17:32

Manifold. Yeah, yeah. I'm not predicting relationships.

17:34

We're predicting like, is China going to

17:36

invade Taiwan? Markets for everything.

17:39

Yeah. That's been a relationship.

17:41

Yeah, fair. Yeah, it's true. And

17:43

then we worked on that for a while.

17:45

And then after GPT-3 came out, I think

17:47

by that time we realized that originally we

17:49

were trying to help people convert their beliefs

17:51

into probability distributions. And so take fuzzy beliefs,

17:53

but like model them more concretely. And then

17:55

after a few months of iterating on that,

17:57

just realize, oh, the thing that's... blocking

18:00

people from making interesting predictions about important

18:02

events in the world is less kind

18:04

of on the probabilistic side and much

18:06

more on the research side. And

18:09

so that kind of combined with the very

18:11

generalist capabilities of GPT-3 prompted us

18:13

to make a more general research assistant.

18:15

Then we spent a few months iterating

18:17

on what even is a research assistant.

18:19

So we would embed with different researchers.

18:22

We built data labeling workflows in the

18:24

beginning kind of right off the bat.

18:26

We built ways to find experts in

18:28

a field and ways to ask good

18:30

research questions. So we just kind of iterated

18:32

through a lot of workflows. No one else

18:34

was really building at this time, and it

18:36

was very quick to just do some prompt

18:38

engineering and see what is a task that

18:41

is at the intersection of what's technologically capable

18:43

and important for researchers. And we

18:45

had a very nondescript landing page. It said

18:47

nothing. But somehow people were signing up. And

18:50

we had the sign-in form that was like, why are you

18:52

here? And everyone was like, I need help with literature review.

18:54

And we're like, literature review, that sounds so hard. I don't

18:56

even know what that means. We're like, we don't want to

18:58

work on it. But then eventually we're like, OK, everyone is

19:00

saying literature review. It's overwhelmingly people want to- And all

19:02

domains, not like medicine or physics or

19:04

all domains. Yeah. And we

19:06

also kind of personally knew literature review was hard. And if you

19:08

look at the graph for academic literature being published, every

19:11

single one that you guys know this in machine learning,

19:13

it's like, I've been to the right

19:15

superhuman amounts of papers. So we're like, all right, let's

19:17

just try it. I was really nervous. But Andreas was

19:19

like, this is kind of like the right problem space

19:21

to jump into even if we don't know what we're

19:23

doing. So my take was

19:25

like, fine, this feels really scary. But let's

19:28

just launch a feature every single week and double

19:30

our user numbers every month. And if we can

19:32

do that, we'll fail fast and we will find

19:34

something. I was worried about like getting lost in

19:37

the kind of academic white space. So

19:39

the very first version was actually a weekend prototype that

19:41

Andreas made. Do you want to explain how that worked?

19:44

I mostly remember there was really bad.

19:46

So the thing I remember is you

19:48

entered a question and it would give

19:50

you back a list of claims. So

19:53

Your question could be, I don't know, how does

19:55

creatine affect cognition and would give you back some

19:57

claims that are to some extent based on papers.

20:00

But they were often irrelevant. The papers are

20:02

often irrelevant and so we ended up soon

20:04

as printing out a bunch of examples of

20:06

results and putting them up on the wall

20:08

so that me but can feel the constant

20:10

same of having such a bad product and

20:12

us would be incentivized to make it better.

20:14

And I think Overtone has gotten a lot

20:16

better. But I think the Dennis overdone was

20:18

like really very bad. As I say

20:20

like a natural language summary of an abstract like

20:22

kind of a one sentence summer and which he

20:24

saw half and then as a learn more about

20:27

this systematic review work slowly started expanding the capability

20:29

so that you could extract a lot more data

20:31

from the papers and do more with that. And

20:33

we're using like and buildings and Co

20:36

sign similarity the coast us for retrieval

20:38

or was it keyword based her. I

20:41

think the very first version didn't even have

20:43

it's on search engine I think the very

20:45

first version probably used the semantic school or

20:47

a p I or something similar he and

20:49

only later and we discovered that applies not

20:52

very semantics I saw my ass than but

20:54

are on search for certain that has helped

20:56

a lot and in we're gonna go into

20:58

like more recent products stuff but like you

21:01

know I think you seen the more so

21:03

to start up oriented. Business.

21:05

Person than and using cinema ideologically like interested

21:07

in research of easy cause of European states

21:09

what's in a market sizing week as thinking

21:11

i like as you're you're here saying that

21:13

we have to double every month and I'm

21:15

like i don't know how you make their

21:17

conclusions difference from from this rate as he

21:19

also has a non profit at the time.

21:22

I mean market size wise I felt like

21:24

in this. Space. Where so much

21:26

was changing and it was very unclear

21:28

what of today was actually gonna be

21:30

true tomorrow. We just like really rested

21:33

a lot on very very simple fundamental

21:35

principles. It is like. If. You

21:37

can understand the truth that is very economically

21:39

the decision like valuable if he like know

21:41

the truth some yeah has enough for Yeah

21:44

researchers. The key to many breakthroughs that are

21:46

very. Commercially valuable because of my version of

21:48

it is students a poor and they don't

21:50

pay for anything. right? But that's obviously

21:52

not now. Not as if on doubts but you had

21:54

as more market in sight for me to have believed

21:56

her by you. Skip that. Yeah we did encounter I

21:59

guess talking to be. for our seed

22:01

round. A lot of VCs were like,

22:03

you know, researchers, they don't have any

22:05

money. Why don't you build legal assistance?

22:08

I think in some short-sighted way, maybe that's true, but

22:10

I think in the long run, R&D is

22:13

such a big space of the economy.

22:15

I think if you can substantially improve

22:17

how quickly people find new

22:19

discoveries or avoid controlled trials that don't go

22:21

anywhere, I think that's just a huge amount

22:23

of money. And there are a lot of

22:25

questions, obviously, about between here and there, but

22:27

I think as long as the fundamental principle

22:30

is there, we were okay with that, and

22:32

I guess we found some investors who also

22:34

were. Yeah, congrats. I mean, I'm sure we

22:36

can cover the sort of flip later. I

22:38

think you were about to start us on

22:40

like GPT-3 and how like that changed things

22:42

for you. It's funny, like I guess every

22:44

major GPT version, you have like some big

22:46

insight. Yeah, yeah.

22:49

I mean, what do you think? I

22:52

think it's a little bit less true

22:54

for us than for others because we

22:56

always believe that there will basically be

22:58

human-level machine work. And so

23:00

it is definitely true that in

23:03

practice for your product, as new models come out,

23:05

your product starts working better, you can add some

23:07

features that you couldn't add before. But

23:10

I don't think we really ever had

23:12

the moment where we're like, oh,

23:15

wow, that is super unanticipated. We

23:17

need to do something entirely different now

23:19

from what was on the roadmap. I

23:21

think GPT-3 was a big change because

23:23

it kind of said, oh, now is

23:25

the time that we can use AI

23:27

to build these tools. And then GPT-4

23:29

was maybe a little bit more of

23:31

an extension of GPT-3. GPT-3 over GPT-2

23:33

was like qualitative level shift. And then

23:35

GPT-4 was like, okay, great. Now it's

23:37

like more accurate, we're more accurate on

23:40

these things we can answer harder questions, but the shape of

23:42

the product had already taken place by that time. I

23:44

kind of want to ask you about this sort of pivot that you

23:46

made, but I guess that was just a way

23:48

to sell what you were doing, which is

23:50

you're adding extra features on grouping by concepts.

23:52

The GPT-4 pivot, quote unquote pivot that you

23:55

made. Oh, yeah, yeah, exactly. Right, right,

23:57

right. Yeah, when we launched this workflow,

23:59

now. Stupid. He for was available basically

24:01

Alyssa without a place where we're very happy.

24:03

Learn professor. So given a table of papers,

24:06

you can extract data across all the tables,

24:08

but you kind of wanna take the analysis

24:10

a step further. Sometimes what you'd care about

24:12

is not having a list of papers. by

24:15

a list of arguments, a list of effects,

24:17

a list of interventions, list of techniques, and

24:19

so that's one of the things are working

24:21

on is now that you've extracted this information

24:24

in a more structured way. Can you pivoted

24:26

or group buy whatever the information that you

24:28

extracted to have more inside. First information

24:30

still supported by the academic literature. those

24:32

big revolutionary thought be three. I think

24:34

I'm very just impressed by how for

24:36

spurs schools your ideas around with the

24:39

were fluids and I think that's why

24:41

you're not as reliance on like the

24:43

Lm improving because as I she's just

24:45

about improving the work. For that you

24:47

have recommended people today we might call

24:49

it and agents I don't know but

24:51

you're not relying on the Olympics Drive

24:53

it, it's rely on. This is the

24:55

way that Illicit does research and what

24:57

we think is most effective is on

24:59

talking. So users. The problem space is

25:01

still huge like if it's like this big

25:03

we are all still operating at this tiny

25:05

part bit of it's so in as I

25:07

think that if a lot in the context

25:09

of most people are like go with your

25:12

mouth What happened to Tbd Five comes outside

25:14

Ftp, Five comes out there still like all

25:16

of this other space that you can go

25:18

into and citing things. Really obsessed with a

25:20

problem which is very very big has helped

25:22

us like stay robust and just kind of

25:24

directly incorporate on permits and they keep. Guy

25:26

and are assessing hundred. You guys was

25:28

truly you tell us but projects basically

25:30

yellow. How much did costs become a

25:32

concern as you're working more and more

25:34

with open the eyes party manage their

25:36

relationship. Let me tie man who charlie as energy. Agency

25:39

and integrity and entirely. the is a

25:41

special characters their charlie when we found him

25:43

was had just finished his freshman year at

25:45

the university of warwick haven't yet heard about

25:47

us on some discord and then he applied

25:49

and we were like wow who has this

25:51

freshman and the we just saw the he

25:53

had done so many incredible side projects and

25:55

we were actually on a team retreat in

25:57

barcelona visiting our head of engineering at the

25:59

time never about this wonder kid, they're like this

26:01

kid. And then on our take-home project, he had done

26:03

like the best of anyone to that point. And so

26:06

we were just like so excited to hire him. So

26:08

we hired him as an intern and then we're like,

26:10

Charlie, what if you just dropped out of school? And

26:13

so then we convinced him to take a year

26:15

off. And he's just incredibly productive. And I think

26:17

the thing you're referring to is at the start

26:20

of 2023, Anthrope kind of launched their constitutional AI

26:22

paper. And within a few days,

26:24

I think four days, he had basically implemented

26:26

that in production. And then we had it

26:28

in-app a week or so after that. And

26:30

he has since kind of contributed to major improvements

26:32

like cutting costs down to a tenth of what

26:35

they were really large scale. But yeah, you can

26:37

talk about the technical stuff. Yeah, on

26:39

the constitutional AI project, this was for abstract

26:41

summarization, where in illicit, if you run a

26:44

query, it'll return papers to you. And then

26:46

it will summarize each paper with respect to

26:48

your query for you on the fly. And

26:50

that's a really important part of illicit because

26:53

it does it so much. Like if you

26:55

run a few searches, it'll have done it

26:57

a few hundred times for you. And so

26:59

we cared a lot about this, both being

27:02

like fast, cheap, and also very low on

27:04

hallucination. I think if illicit hallucinates something about

27:06

the abstract, that's really not good. And so

27:08

what Charlie did in that project

27:10

was create a constitution that expressed what

27:12

are the attributes of good summary. Everything

27:15

in the summary is reflected in the

27:17

actual abstract, and it's like

27:19

very concise, etc, etc. And then used

27:23

RLHF with a model that

27:25

was trained on the constitution

27:27

to basically fine tune a better

27:29

summarizer on an open source model. Yeah,

27:32

I think that might still be in use. Yeah,

27:34

yeah, definitely. Yeah, I think at the time,

27:36

the models hadn't been trained at all to

27:38

be faithful to a text. So they were

27:41

just generating so then when you ask them a

27:43

question, they tried too hard to answer the question

27:45

and didn't try hard enough to answer the question

27:47

given the text or answer what the text sent

27:49

about the question. So we had to basically teach

27:52

the models to do that specific task. How

27:54

do you monitor the ongoing performance

27:56

of your models? Not to

27:59

get too LLM-opsy, but you are one of

28:01

the larger, more well-known operations doing NLP at

28:03

scale. I guess effectively, you have to monitor

28:05

these things, and nobody has a good answer

28:07

that I talk to. Yeah, I don't think

28:09

we have a good answer yet. I

28:13

think the answers are actually a little

28:15

bit clearer on the just basic robustness

28:17

side of where you can import ideas

28:19

from soft engineering and

28:21

normal DevOps. You're like, well, you need

28:23

to monitor latencies and response times

28:26

and uptime and whatnot. I think we should

28:28

say performance is more about and

28:30

then things like hallucination rate, where I

28:33

think there the really important thing

28:35

is training time. So we care

28:37

a lot about having our own

28:39

internal benchmarks for model development

28:42

that reflect the distribution of user

28:44

queries so that we can know

28:46

ahead of time how well

28:48

is the model gonna perform on different

28:50

types of tasks. So the tasks being

28:52

summarization, question answering, given a paper, ranking,

28:54

and for each of those, we wanna

28:56

know what the distribution of things the

28:58

model is gonna see so that we

29:00

can have well calibrated predictions on

29:03

how well the model is gonna do in

29:05

production. And I think, yeah, there's some chance

29:07

that there's distribution shift and actually the things

29:10

users enter are gonna be different, but I

29:12

think that's much less important than getting the

29:14

kind of training right and having very high

29:16

quality, well-vetted data sets at training time. I

29:19

think we also end up effectively monitoring by trying to

29:21

evaluate new models as they come out. And so that

29:23

kind of prompts us to go through our eval suite

29:25

every couple of months. And so every time a new

29:27

model comes out, we have to see how is this performing

29:30

relative to production and what we currently have.

29:32

Yeah, I mean, since we're on this topic,

29:34

any new models have really caught your eye

29:36

this year? Like Claude came out of the

29:38

mud. Yeah, I think Claude is pretty, I

29:41

think the team's pretty excited about Claude. Yeah,

29:43

specifically, Claude Haiku is a good point on

29:45

the kind of Pareto frontier. It's

29:47

neither the cheapest model, nor is it

29:49

the most accurate,

29:51

most high quality model, but it's just

29:53

a really good trade off between cost

29:56

and accuracy. You apparently have to 10

29:58

shot it to make it good. I tried

30:00

using Iku for summarization, but Zero Shot

30:02

was not great. Then they were

30:04

like, it's a skill issue, you have to

30:06

try it harder. Interesting. I think GPD 4

30:08

unlocked tables for us, processing data

30:10

from tables, which was huge. GPD

30:13

4 Vision. Yeah, did you try

30:15

it like for you? I guess you can try for

30:17

you, because it's non-commercial. That's the adept model. Yeah, we

30:19

haven't tried that one. Yeah, but

30:21

Cloud is multimodal as well. I think

30:23

the interesting insight that we got from talking to David

30:25

Luan, who was CEO of Adept, was that multimodality

30:28

has effectively two different flavors. One

30:30

is we recognize images from a

30:32

camera in the outside natural world.

30:35

And actually, the more important multimodality

30:38

for knowledge work is screenshots, and

30:40

PDFs and charts and

30:42

graphs. So we need a new term for

30:44

that kind of multimodality. But is the claim

30:46

that current models are good at one or

30:48

the other? Yeah, they're over-indexed, because of the

30:51

history of computer vision is cocoa. So

30:53

now we're like, oh, actually, screens

30:56

are more important. OCR and

30:58

writing. You mentioned a lot of closed model

31:00

lab stuff, and then you also have this

31:02

open source model fine-tuning stuff. What is your

31:04

workload now between closed and open? It's

31:07

a good question, I think. Half and half?

31:09

It's a... Is that even a relevant question,

31:11

or not a nonsensical question? It depends a

31:14

little bit on how you index, whether you

31:16

index by computer cost or number of queries.

31:18

I'd say in terms of number of queries,

31:21

it's maybe similar. In terms of cost and

31:23

compute, I think the closed models make up

31:25

more of the budget since the main cases

31:27

where you wanna use closed models are cases

31:30

where they're just smarter, where there

31:32

are no existing open source models are quite

31:34

smart enough. Yeah. We

31:37

have a lot of interesting open-group questions

31:39

to go in, but just to wrap

31:41

the UX evolution, now you have the

31:44

notebooks. We talked a lot about how

31:46

chatbots are not the final frontier. How

31:49

did you decide to get into notebooks,

31:51

which is a very iterative, kind of

31:53

like interactive interface and maybe learnings from

31:55

that? Yeah, this is actually our fourth

31:57

time trying to make this work. I

32:00

think the first one was probably in early 2021. I

32:04

think because we've always been obsessed with this

32:06

idea of task decomposition and like branching, we

32:08

always wanted a tool that could be kind

32:10

of unbounded where you could keep going, could

32:12

do a lot of branching where you could

32:14

kind of apply language model operations

32:16

or computations on other tasks. So in

32:19

2021, we had this thing called composite

32:21

tasks where you could use PPT-3 to

32:23

brainstorm a bunch of research questions and

32:26

then take each research question and decompose

32:28

those further into sub questions. And

32:30

this kind of, again, that like task decomposition

32:32

tree type thing was always very exciting to

32:34

us. But that was like, it didn't work

32:37

and it was kind of overwhelming. Then at

32:39

the end of 2022, I think we tried again and

32:41

at that point we were thinking, okay, we've done a

32:43

lot with this literature review thing. We

32:45

also want to start helping with kind of adjacent

32:47

domains and different workflows. Like we want to help

32:49

more with machine learning. What does

32:51

that look like? And as we were thinking

32:53

about it, we're like, well, there are so

32:55

many research workflows. How do we not just build

32:57

three new workflows into illicit but make illicit

33:00

really generic to lots of workflows? What

33:02

is like a generic composable system with

33:04

nice abstractions that can like scale to

33:06

all these workflows? So we like iterated

33:08

on that a bunch and then didn't

33:10

quite narrow the problem space enough or

33:12

like get to what we wanted. And

33:14

then I think it was at the beginning

33:17

of 2023 where we're like, wow, computational notebooks

33:19

kind of enable this where they have a

33:21

lot of flexibility, but kind of

33:23

robust primitives such that you can extend the workflow and

33:25

it's not limited. It's not like you ask a

33:27

query, you get an answer, you're done. You can just

33:29

constantly keep building on top of that. And each

33:32

little step seems like a really good unit of

33:34

work for the language model. And also it was

33:36

just like really helpful to have a bit

33:38

more pre-existing work to emulate. Yeah, that's kind

33:40

of how we ended up at Computational Notebooks

33:43

for illicit. Maybe one thing that's worth making

33:45

explicit is the difference between Computational Notebooks and

33:47

chat because on the surface they seem pretty

33:49

similar. It's kind of this iterative interaction where

33:52

you add stuff. In both cases you

33:54

have a back and forth between you enter stuff and then you

33:56

get some output and then you enter stuff. in

33:59

the future. minds is with notebooks you can

34:01

define a process. So in data science,

34:03

you can be like, here's my data

34:05

analysis process that takes in a CSV

34:07

and then does some extraction and then

34:09

generates a figure at the end. And

34:12

you can prototype it using a small CSV, and

34:14

then you can run it over a much larger

34:16

CSV later. And similarly, the vision

34:19

for notebooks, in our case, is to not

34:21

make it this one-off TET interaction, but to

34:23

allow you to then say, if you start

34:26

and first you're like, OK, let me just

34:28

analyze a few papers and see, do I

34:30

get to the correct conclusions for those few

34:32

papers? Can I then later go back and

34:35

say, now let me run this over 10,000

34:37

papers now that

34:39

I've debugged the process using a few papers? And

34:42

that's an interaction that doesn't fit quite as

34:44

well into the TET framework, because that's more

34:46

for kind of quick back and forth interaction.

34:49

Do you think in notebooks it's kind of

34:51

like structure, editable chain of thought,

34:53

basically, not by SAP? Is that kind of

34:55

where you see this going? And then are

34:57

people going to reuse notebooks as like templates?

35:00

And maybe in traditional notebooks, it's like cookbooks,

35:02

right? You share a cookbook. You can start

35:04

from there. It's a similar in illicit. Yeah,

35:07

that's exactly right. So that's our hope that

35:09

people will build templates, share them with other

35:11

people. I think chain of thought is

35:13

maybe still like kind of one level lower

35:15

on the abstraction hierarchy than we would

35:17

think of notebooks. I think we'll probably

35:19

want to think about more semantic pieces,

35:21

like a building block is more like

35:23

a paper search or an extraction or

35:26

a list of concepts. And

35:28

then the model's detailed reasoning will

35:30

probably often be one level down. You always

35:32

want to be able to see it, but

35:34

you don't always want it to be front

35:36

and center. Yeah. What's the difference between a

35:38

notebook and an agent? Since everybody always asks

35:40

me, what's an agent? Like, how do you

35:42

think about where the line is? Yeah,

35:45

it's an interesting question. In the notebook

35:47

world, I would generally think

35:49

of the human as the agent in the

35:51

first iteration. So you have the notebook, and

35:53

the human kind of adds little action steps.

35:56

And then the next point on this kind

35:58

of progress gradient is, OK. okay, now you

36:00

can use language models to predict which action would you take

36:02

as a human. And at some point you're probably gonna be

36:04

very good at this. You'll be like, okay, in some cases

36:06

I can with 99.9% accuracy

36:08

predict what you do. And then you might

36:10

as well just execute it, like why wait for the human? And

36:13

eventually as you get better, that will just

36:15

look more and more like agents taking actions

36:18

as opposed to you doing the thing. I

36:20

think templates are a specific case of this

36:22

where you're like, okay, well, there's just particular

36:24

sequences of actions that you often wanna chunk

36:26

and have available as primitives, just like in

36:29

normal programming. And you can

36:31

view them as action sequences of agents

36:33

or you can view them as more

36:35

normal programming language abstraction thing. And I

36:37

think those are two valid views. Yeah.

36:40

How do you see this change as, like

36:42

you said, the models get better and you

36:44

need less and less human actual interfacing with

36:47

the model, you just get the results. Like

36:49

how does the UX and the way people

36:51

perceive it change? Yeah, I think this

36:53

kind of interaction paradigm for evaluation is not

36:55

really something the internet has encountered yet because

36:57

up to now the internet has all been

36:59

about getting data and work from people. But

37:01

so increasingly, I really want kind of evaluation

37:04

both from an interface perspective and from like

37:06

a technical perspective or operation perspective to be

37:08

a super power for illicit because I think

37:10

over time models will do more and more

37:12

of the work and people will have

37:14

to do more and more of the evaluation. So

37:16

I think, yeah, in terms of the interface, some

37:18

of the things we have today, for every kind

37:20

of language model generation, there's some citation back and

37:22

we kind of try to highlight the ground

37:24

truth in the paper that is most relevant

37:26

to whatever illicit. that

37:29

and make it super easy so that you can click on it and quickly see

37:31

in context and validate whether

37:33

the text actually supports the answer that illicit gave.

37:36

So I think we'd probably want to scale things up like that, like

37:39

the ability to kind of spot check the models work

37:41

super quickly, scale up interfaces like that. And-

37:45

Who would spot check the user? Yeah, to start, it would be the user.

37:48

One of the other things we do is also kind of flag the

37:51

model's uncertainty. So we have models report out, how

37:53

confident are you that this was the sample size

37:55

model's not sure we throw a flag. And so

37:58

the user knows to prioritize checking that. So

38:00

again, we can kind of scale that up. So when the

38:02

model's like, well, I searched this on Google, not sure if

38:05

that was the right thing, I have an uncertainty flag, and

38:07

the user can go and be like, okay, that was actually

38:09

the right thing to do or not. I've tried

38:11

to do uncertainty readings from models.

38:13

I don't know if you have this live,

38:15

but you do. Cause I just didn't find

38:18

them reliable because they just hallucinated their own

38:20

uncertainty. I would love to base it on

38:22

logprobs or something more native within the model

38:24

rather than generated. But okay,

38:27

it sounds like they scale properly for you.

38:30

We found it to be pretty calibrated. It varies on the

38:32

model. I think in some cases, we also

38:34

use the different models for the uncertainty estimates than

38:36

for the question answering. So one model would say,

38:38

here's my chain of thought, here's my answer, and

38:41

then a different type of model. Let's say the

38:43

first model is LAMA, and

38:45

let's say the second model is GB 3.5. And

38:48

then the second model just looks over the

38:50

results and like, okay, how confident are you

38:52

in this? And I think sometimes using a

38:54

different model can be better than using the

38:56

same model. On

38:58

top of your models, evaluate models, obviously you

39:01

can do that all day long. What's your

39:03

budget? Because your queries fan out a lot,

39:05

and then you have models, evaluate models. One

39:08

person typing in a question can lead to

39:10

a thousand calls. It depends on the

39:12

project. So if the project

39:14

is basically a systematic review that otherwise

39:16

human research assistants would do, then the

39:18

project is basically a human equivalent spend.

39:20

And this spend can get quite large

39:22

for those projects. I don't know, let's

39:24

say $100,000. So

39:27

in those cases, you're happier to spend compute

39:29

than in the kind of shallow search case

39:31

where someone just enters a question because, I

39:34

don't know, maybe you like it. I heard

39:36

about Creatine, what's it about? Probably

39:38

don't want to spend a lot of compute on

39:40

that. This sort of being able to invest more

39:42

or less compute into getting more or less accurate

39:45

answers is I think one of the core things

39:47

we care about, and that I think

39:49

is currently undervalued in the AI space. I

39:51

think currently you can choose which model you

39:53

want, and you can sometimes, I don't know,

39:55

you'll tip it and it'll try harder, or

39:57

you can try various things to get it

39:59

to work. harder but you don't have great

40:01

ways of converting willingness to spend into better

40:03

answers and we really want to build a

40:05

product that has this sort of unbounded

40:07

flavor where like if you care about it

40:10

a lot you should be able to get

40:12

really high quality answers really double checked in

40:14

every way. And you have a credit-based rating

40:16

so unlike most products it's not a fixed

40:18

monthly fee. Exactly. So like some of

40:21

the higher costs are tiered so for

40:23

most casual users they'll just get the

40:25

abstract summary which is kind of an

40:27

open source model then you

40:29

can add more columns which have more extractions and

40:32

these uncertainty features and then you can also add

40:34

the same columns in hierarchy mode which also parses

40:36

the table so we kind of stack the complexity

40:38

on the call. You know the fun thing you

40:40

can do with a credit system which is data for

40:42

data basically you can give people more credits if they

40:45

give data back to you. Yeah. I don't

40:47

know if you've already done that. We've thought about something like this

40:49

it's like if you don't have money but

40:51

you have time yes how do you exchange

40:53

that? Yeah. I think it's interesting

40:55

we haven't quite operationalized it and then you know there's been

40:57

some kind of like adverse selection like you know for example

40:59

it would be really valuable to get feedback on our model

41:01

so maybe if you were willing to give more robust feedback

41:04

on our results we could give you credits or something like

41:06

that but then there's kind of this will

41:08

people take it seriously. You want the good people. Exactly.

41:10

Can you tell who are the good people? Not

41:13

right now but yeah maybe at the point where we can

41:15

we can offer it. The complexity

41:17

of questions asked you know if it's

41:19

higher complexity these are the people. Yeah.

41:21

If you make a lot of typos

41:23

in your queries you're not gonna get

41:25

off. Negative

41:28

social credit. It's very topical right

41:30

now to think about the threat of long

41:32

context windows. All these models

41:34

that we're talking about these days all like a million

41:36

token plus. Is that relevant for you? Can

41:39

you make use of that? Is that just prohibitively

41:41

expensive because you're just paying for all those tokens

41:43

or you're just doing RAG? It's definitely

41:45

relevant and when we think about search as

41:47

many people do we think about kind of

41:49

a staged pipeline of retrieval where first you

41:52

use semantic search database with embeddings get like

41:54

the in our case maybe 400 or so

41:56

most relevant papers and then then you still

41:58

need to rank those. And I

42:00

think at that point it becomes pretty

42:03

interesting to use larger models. So specifically

42:05

in the past I think a lot

42:07

of ranking was kind of per item

42:09

ranking where you would score each individual

42:11

item, maybe using increasingly expensive scoring methods,

42:14

and then rank based on the scores. But I

42:16

think list-wise re-ranking where you have a model that

42:19

can see all the elements is a lot more

42:21

powerful. Because often you can only really tell how

42:23

good a thing is in comparison to other things.

42:26

And what thing should come first, it

42:28

really depends on. Like well what other things are

42:31

available, maybe you even care about diversity in

42:33

your results, you don't want to show 10

42:35

very similar papers as the first 10 results.

42:37

So I think the long context models are

42:39

quite interesting there. And especially for

42:41

our case where we care more about power users

42:43

who are perhaps a little bit more willing to

42:46

wait a little bit longer to get higher quality

42:48

results relative to people who just quickly check out

42:50

things because why not. And I think being able

42:52

to spend more on longer context is quite valuable.

42:55

I think one thing the longer context models

42:57

changed for us is maybe a focus from

43:00

breaking down tasks to breaking down

43:02

the evaluation. So before,

43:04

if we wanted to answer a question

43:06

from the full text of a paper, we had

43:08

to figure out how to chunk it and find

43:10

the relevant chunk and then answer based on that

43:12

chunk. And the nice thing was then you know

43:14

kind of which chunk the model used to answer

43:16

the question. So if you want to

43:18

help the user check it, yeah, you can be like,

43:21

well this was the chunk that the model got. And

43:23

now if you put the whole text in the paper,

43:25

you have to kind of find the chunk like more

43:27

retroactively basically. And so you need kind of like a

43:29

different set of abilities and obviously like different

43:31

technology to figure out. You still want to

43:33

point the user to the supporting quotes in

43:35

the text, but then the interaction is a little

43:37

different. You like scan through and find some Ruge

43:39

score before. I think

43:42

there's an interesting space of almost research

43:44

problems here because you would ideally make

43:46

causal claims like if this hadn't been

43:48

in the text, the model wouldn't have

43:50

said this thing. And maybe

43:52

you can do expensive approximations to that where like

43:54

I don't know you just throw a chunk of

43:56

the paper and re-answer and see what happens. But

43:59

hopefully there are. better ways of doing

44:01

that where you just get that kind

44:03

of counterfactual information for free from the

44:05

model. Do you think at

44:07

all about the cost of maintaining RAG versus

44:10

just putting more tokens in the window? I

44:12

think in software development a lot of times

44:14

people buy developer productivity things so that we

44:17

don't have to worry about it. Context

44:19

window is kind of the same right? You have to

44:21

maintain chunking and like RAG retrieval and like re-ranking and

44:23

all of this versus I just shove everything into the

44:26

context and like it costs a little more but at

44:28

least I don't have to do all of that. Is

44:30

that something you thought about? I think we still

44:33

like hit up against context limits enough that it's not

44:35

really do we still want to keep this RAG around

44:37

it's like we do still need it for the scale

44:39

of the work that we're doing. Yeah. And I

44:41

think there are different kinds of maintainability.

44:43

In one sense I think you're right

44:45

that throw everything into the context window

44:47

thing is easier to maintain because you

44:49

just can swap out a model. In

44:52

another sense if things go wrong it's

44:54

harder to debug where like if you

44:56

know here's the process that we go

44:58

through to go from 200 million

45:00

papers to an answer and there are like

45:02

little steps and you understand okay this is

45:04

the step that finds the relevant paragraph or

45:06

whatever it may be you'll know which step

45:08

breaks if the answers are bad. Whereas if

45:10

it's just like a new model version came

45:12

out and now it suddenly doesn't find your

45:15

needle in a haystack anymore then you're like

45:17

okay what can you do? You're kind of

45:19

at a loss. Yeah. Let's

45:21

talk a bit about yeah needle in a haystack and

45:23

like maybe the opposite of it which is like hard

45:25

grounding I don't know if that's like the best thing

45:27

to think about it but I was using one of

45:29

these chat which are documents features and I

45:32

put the AMD MI300 specs and the

45:34

new Black Metal chips from NVIDIA and

45:36

I was asking questions and that's the

45:38

AMD chip support NVLink and the response

45:40

was like oh it doesn't say in

45:42

the specs but if you ask GbD4

45:45

without the docs it would tell you

45:47

no because NVLink it's a NVIDIA technology.

45:49

That's your NV. Yeah. It just says

45:51

NVLink. How do you

45:53

think about that having the context sometimes to press

45:55

the knowledge that the model has? It really depends

45:57

on the task because I think sometimes it is.

46:00

exactly what you want. So imagine you're a

46:02

researcher, you're writing the background section of your

46:04

paper and you're trying to describe what these

46:06

other papers say. You really don't want extra

46:08

information to be introduced there. In other cases

46:10

where you're just trying to figure out the

46:12

truth and you're giving the documents because you

46:14

think they will help the model figure out

46:16

what the truth is, I think you do want,

46:18

if the model has a hunch that there might

46:20

be something that's not in the paper, you do

46:22

want to surface that. I think ideally

46:24

you still don't want the model to just tell

46:26

you. It probably the ideal thing looks

46:28

a bit more like agent control

46:30

where the model can issue a

46:33

query that then is

46:35

intended to surface documents that substantiate its hunch.

46:37

That may be a reasonable middle ground between

46:39

model just telling you and model being fully

46:42

limited to the papers you give it. Yeah,

46:45

I would say they're just kind of different tasks

46:47

right now and the task that Elisa is mostly

46:49

focused on is what do these papers say. But

46:51

there's another task which is like just give

46:53

me the best possible answer and that give me

46:55

the best possible answer sometimes depends on what do

46:58

these papers say but it can also depend on

47:00

other stuff that's not in the papers. So

47:02

ideally we can do both and then kind of do

47:04

this overall task for you more going forward.

47:08

We have seen a lot of details but

47:10

just to zoom back out a little bit,

47:12

what are maybe the most underrated features of

47:14

Elisa and what is one thing

47:16

that maybe the users surprised you the most by

47:18

using it? I think the most powerful feature

47:20

of Elisa is the ability to extract,

47:23

add columns to this table which effectively

47:25

extracts data from all of your papers

47:27

at once. It's well used but

47:29

there are kind of many different extensions of

47:31

that that I think users are still discovering.

47:33

So one is we let you give a

47:36

description of the column, we let you give

47:38

instructions of a column, we let you create

47:40

custom columns. So we have like 30 plus

47:42

predefined fields that users can extract like what

47:44

were the methods, what were the main findings,

47:46

how many people were studied and we actually

47:48

show you basically the prompts that we're using

47:50

to extract that from our predefined fields and then

47:52

you can fork this and you can say, oh actually

47:55

I don't care about the population of people, I only

47:57

care about the population of rats, like you can change

47:59

the instruction. So I think users are still

48:01

kind of discovering that there's both this

48:03

predefined, easy to use default, but that

48:05

they can extend it to be much

48:07

more specific to them, and then they

48:09

can also ask custom questions. One

48:12

use case of that is you can start to create

48:14

different column types that you might not expect. So

48:16

rather than just creating generative answers like

48:18

a description of the methodology, you can

48:20

say classify the methodology into a prospective

48:22

study, a retrospective study, or a case

48:25

study, and then you can filter based

48:27

on that. It's like all using the

48:29

same technology and the interface, but it

48:31

unlocks different workflows. So I think

48:33

that the ability to ask custom questions,

48:36

give instructions, and specifically use that to

48:38

create different types of columns like classification

48:40

columns is still pretty underrated. In

48:43

terms of use case, I spoke

48:45

to someone who works in medical

48:47

affairs at a genomic sequencing company

48:49

recently. So the doctors kind

48:51

of order these genomic tests, these sequencing

48:53

tests, to kind of identify if a

48:55

patient has a particular disease. This company

48:58

helps them process it, and this person

49:00

basically interacts with all the doctors, and

49:02

if the doctors have any questions. My understanding

49:04

is that medical affairs is kind of like customer

49:06

support or customer success in ARMA. So this person

49:08

like talks to doctors all day long, and one

49:10

of the things they started using a listed for

49:13

is like putting the results of their tests as

49:15

the query. Like this test showed,

49:18

you know, this percentage presence of this and

49:20

40% that and whatever, you know,

49:22

what genes are present here or what in

49:24

the sample. And getting kind of a list

49:26

of academic papers that would support their findings

49:29

and using this to help doctors interpret their

49:31

tests. So we talked about, okay,

49:33

cool, like if we built, he's pretty interested in

49:36

doing a survey of infectious

49:38

disease specialists and getting them

49:40

to evaluate, you know, having them write up

49:42

their answers, comparing it to a list of

49:44

answers, trying to see can a list start

49:46

being used to interpret the results of these

49:49

diagnostic tests because the way they ship these

49:51

tests to doctors is they report on a

49:53

really wide array of things. He

49:56

was saying that at a large well-resourced

49:58

hospital, like a city hospital, There might

50:00

be a team of infectious disease specialists who

50:02

can help interpret these results. But

50:04

at under-resourced hospitals or more rural hospitals, the

50:06

primary care physician can't interpret the test results.

50:09

Then they can't order it, they can't use

50:11

it, they can't help their patients with it.

50:13

So thinking about an evidence-backed way of interpreting

50:15

these tests is definitely kind of an extension

50:17

of the product that I hadn't considered before.

50:19

But yeah, the idea of using that to

50:22

bring more access to physicians in all different

50:24

parts of the country and helping them interpret

50:26

complicated science is pretty cool. We

50:28

are Kenjun from MBUON on the podcast

50:31

and we talked about better allocating scientific

50:33

resources. How do you think about

50:35

these use cases and maybe how illicit can

50:37

help drive more research? And do you see

50:39

a world in which maybe

50:42

the models actually do some of the

50:44

research before suggesting us? Yeah, I think

50:46

that's very close to what we care

50:48

about. Our product values are systematic,

50:50

transparent, and unbounded. And I think

50:53

to make research especially more systematic and

50:55

unbounded, I think is basically the thing

50:57

that's at stake here. So for example, I was

51:00

recently talking to people in longevity and I

51:02

think there isn't really one field of longevity,

51:04

there are kind of different scientific subdomains that

51:07

are surfacing various things that are related to

51:09

longevity. And I think if you could more

51:11

systematically say, look, here are all the different

51:13

interventions we could do and here's

51:15

the expected ROI of these experiments, here's

51:18

like the evidence so far that supports

51:20

those being either likely to surface

51:22

new information or not, here's the cost of

51:24

these experiments. I think you could be so

51:26

much more systematic than scientists today. I'd guess

51:29

in like 10, 20 years we'll look back

51:31

and it will be incredible how unsystematic science

51:33

was back in the day. Our views

51:35

kind of have models catch up

51:37

to expert humans today, start with kind of

51:39

novice humans and then increasingly expert humans. But

51:41

we really want the models to earn their

51:43

right to the expertise. So that's why we

51:46

do things in this very step-by-step way, that's

51:48

why we don't just like throw a bunch

51:50

of data and apply a bunch of compute

51:52

and hope we get good results. But obviously

51:54

at some point you hope that once it's

51:56

kind of earned its stripes it can surpass

51:58

human researchers. But I think that's. where making

52:00

sure that the models processes are really

52:02

explicit and transparent and that it's really

52:05

easy to evaluate is important because if

52:07

it does surpass human understanding, people will

52:09

still need to be able to audit

52:11

its work somehow or spot check its

52:13

work somehow to be able to

52:15

reliably trust it and use it. So yeah, that's

52:17

kind of why the process-based approaches is really important.

52:20

And on the question of will models do their

52:22

own research, I think one

52:24

feature that most currently don't have that

52:26

will need to be better there is

52:28

better world models. I think currently models

52:30

are just not great at representing what's

52:32

going on in a particular situation or

52:35

domain in a way that allows them

52:37

to come to interesting, surprising conclusions. I

52:39

think they're very good at coming to

52:41

conclusions that are nearby to conclusions that

52:43

people have come to. They're not as

52:46

good at kind of reasoning and making

52:48

surprising connections maybe. And so having deeper

52:50

models of, let's see, what are the

52:52

underlying structures of different domains, how they're

52:54

related or not related, I think will be

52:56

an important ingredient for models actually being able

52:58

to make novel contributions. On the topic of

53:01

hiring more expert humans, you've hired some very

53:03

expert humans. My friend Maggie Appleton

53:05

joined you guys I think maybe a year

53:07

ago-ish. In fact, I think you're doing an

53:09

offsite and we're actually organizing our big AI-UX

53:11

meetup around whenever she's in San Francisco. How

53:13

big is the team? How have you sort

53:16

of transitioned your company into this sort of

53:18

PBC and sort of the plan for the

53:20

future? Yeah, we're 12 people now. About

53:22

half of us are in the Bay Area and

53:25

then distributed across US and Europe. A

53:27

mix of mostly kind of roles in engineering and

53:29

product. Yeah, and I think that the transition to

53:31

PBC was really not that

53:33

eventful because I think we were already,

53:35

even as a nonprofit, we were already

53:38

shipping every week. So very much operating as

53:40

a product. Very much as a starting point. And

53:42

then I would say the kind of PBC component was

53:44

to very explicitly say that we have a mission that

53:46

we care a lot about. There are a lot of

53:48

ways to make money. We think our mission will make

53:51

us a lot of money, but we are going to

53:53

be opinionated about how we make money. We're going to

53:55

take the version of making a lot of money that's

53:57

in line with our mission. But it's all very convergent.

54:00

it is not going to make any money if

54:02

it's a bad product, if it doesn't actually help

54:04

you discover truth and do research more rigorously. So

54:07

I think for us, the kind of mission

54:09

and the success of the company are very

54:11

intertwined. We're hoping to grow the team quite

54:13

a lot this year. Probably some of our

54:15

highest priority roles are in engineering, but also

54:17

opening up roles more in design and product

54:20

marketing, go-to-market. Yeah, do you want to talk

54:22

about the roles? Yeah, broadly we're

54:24

just looking for senior software engineers and

54:26

don't need any particular AI expertise. A

54:28

lot of it is just how do

54:31

you build good orchestration for complex tasks?

54:33

So we talked earlier about these sort

54:35

of notebooks, scaling up task orchestration, and

54:38

I think a lot of this looks more like

54:40

traditional software engineering than it does look like machine

54:42

learning research. And I think the people who are

54:44

really good at building good abstractions,

54:47

building applications that can kind of

54:49

survive even if some of their

54:51

pieces break, like making reliable components

54:53

out of unreliable pieces, I think those are the

54:55

people we're looking for. No, that's exactly what I

54:58

used to do. Have you

55:00

explored the existing orchestration frameworks?

55:02

Temporal, Airflow, Daxter, Prefect?

55:04

We've looked into them a little bit.

55:06

I think we have some specific requirements

55:08

around being able to stream work back

55:10

very quickly to our users. Those could

55:12

definitely be relevant. Okay, well, you're hiring.

55:14

I'm sure we'll plug all the links.

55:16

Thank you so much for coming. Any

55:18

parting words? Any words of wisdom? Models

55:21

you live by? I think it's a really important time

55:23

for humanity, so I hope everyone listening

55:25

to this podcast can think hard

55:27

about exactly how they want to

55:29

participate in this story. There's

55:32

so much to build, and we can be

55:34

really intentional about what we align ourselves with.

55:37

There are a lot of applications that are going to

55:39

be really good for the world and a lot of

55:41

applications that are not. And so, yeah, I hope people

55:43

can take that seriously and kind of seize the moment.

55:45

Yeah, I love how intentional you guys have been. Thank you

55:47

for sharing that story. Thank you. Thank

55:57

you.

Unlock more with Podchaser Pro

  • Audience Insights
  • Contact Information
  • Demographics
  • Charts
  • Sponsor History
  • and More!
Pro Features