Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:04
Hey everyone, welcome
0:06
to the Later In Space podcast. This
0:08
is Alessio, partner and CTO and resident
0:10
admissible partners, and I'm joined by my
0:13
co-host, Swix, founder of Small.ai. Hey,
0:15
and today we are back in the
0:17
studio with Andreas and Jowon from Illicit.
0:19
Welcome. Thanks guys. It's
0:21
great to be here. Yeah. So
0:23
I'll introduce you separately, but also, you know,
0:26
we'd love to learn a little bit more
0:28
about you personally. So Andreas, it looks like
0:30
you started Illicit first, Jowon joined later. That's
0:32
right. For all intents and purposes, the Illicit
0:34
and also the Ad that existed before then
0:36
were very different from what
0:38
I started. So I think it's like fair
0:41
to say that you co-funded it. Got it.
0:43
And Jowon, you're a co-founder and COO of Illicit
0:45
now. Yeah, that's right. So there's a little bit
0:48
of a history to this. I'm not super aware
0:50
of like the sort of journey. I was
0:52
aware of Ad and Illicit as sort of
0:54
a nonprofit type situation. And recently you turned
0:56
into like a B Corp. Public benefit corporation.
0:58
So yeah, maybe if you want, you could
1:01
take us through that journey of finding the
1:03
problem. You know, obviously you're working
1:05
together now. So like, how do you get
1:07
together to decide to leave your startup career
1:09
to join him? Yeah, it's
1:11
truly a very long journey. I guess truly it kind
1:13
of started in Germany when I was born. So
1:17
even as a kid, I was always interested in AI.
1:19
Like I kind of went to the library. There
1:21
were books about how to write programs in Q-basic.
1:23
Like some of them talked about how
1:26
to implement chat bots. I guess, Illiza. To be
1:28
clear, he grew up in like a
1:30
tiny village on the outskirts of Munich called
1:32
Dinklshirben, where it's like a
1:34
very, very idyllic German village. Yeah,
1:36
important to the story. So basically the main thing
1:38
is I've kind of always been thinking about AI
1:40
my entire life and been thinking about, well, at
1:42
some point this is going to be a huge
1:45
deal. It's going to be transformative. How can I
1:47
work on it? And I was
1:49
thinking about it from when I was a
1:51
teenager after high school, did a year where
1:53
I started a startup with the intention to
1:56
become rich. And then once I'm rich, I
1:58
can affect the trajectory of AI. AI,
2:00
did not become rich, decided to go back to
2:02
college and study cognitive science there,
2:04
which are the closest thing I could find
2:06
at the time to AI. In the last
2:08
year of college, moved to the US to
2:10
do a PhD at MIT, working
2:12
on probably kind of new programming languages
2:15
for AI, because it kind of seemed
2:17
like the existing languages were not great
2:19
at expressing world models and learning world
2:21
models to invasion inference. I was always
2:23
thinking about, well, ultimately the goal is to
2:25
actually build tools that help people reason
2:27
more clearly, ask and answer better questions and
2:29
make better decisions. But for a long
2:31
time it seemed like the technology to put
2:34
reasoning in machines just wasn't there. Initially,
2:37
at the end of my postdoc at
2:39
Stanford, was thinking about, well, what to
2:41
do? I think the standard path is
2:43
you become an academic and do research.
2:45
But it's really hard to actually build
2:47
interesting tools as an academic. You can't
2:49
really hire great engineers. Everything
2:51
is kind of on a paper-to-paper timeline. And
2:54
so I was like, well, maybe I should start a
2:56
startup, pursue that for a little bit. But it seemed
2:58
like it was too early, because you could have tried
3:00
to do an AI startup, but probably would not have
3:02
been the kind of AI startup we're seeing now. So
3:06
then decided to just start a nonprofit research lab
3:08
that's going to do research for a while until we
3:10
better figure out how to do
3:12
thinking in machines. And that was
3:14
odd. And then over time, it
3:16
became clear how to actually build actual tools
3:19
for reasoning. And only over
3:21
time, we developed a better way to... I'll
3:24
let you fill in some of the details here. Yeah. So
3:27
I guess my story maybe starts around 2015. I
3:29
kind of wanted to be a founder for a long time.
3:31
And I wanted to work on an idea that stood the
3:34
test of time for me, like an idea that stuck with
3:36
me for a long time. And starting
3:38
in 2015, actually, originally, I became interested in
3:40
AI-based tools from the perspective of mental health.
3:42
So there are a bunch of people around
3:44
me who are really struggling. One really close
3:47
friend in particular is really struggling with mental
3:49
health and didn't have any support. And it
3:51
didn't feel like there was anything before kind
3:53
of like getting hospitalized that could just help
3:55
her. And So luckily, she came and stayed with
3:58
me for a while and we were just able to... Doctor
4:00
some things but it seemed like you know lots
4:02
of people might I have that resource and something
4:04
may be a I enabled could be much more
4:07
scalable. I didn't feel ready to start a company
4:09
then us twenty fifteen and I also didn't feel
4:11
like the technology was already so then I went
4:13
into thin tack and like kind of learn how
4:16
to do the tech thing and then and twenty
4:18
nineteen I felt like it was time for me
4:20
to just jump in and and build something on
4:22
my own. I really wanted to create and at
4:25
the time I looked around attack and felt like
4:27
not super inspired by the options. I just I
4:29
didn't wanted. To have a tech career ladder like
4:31
had money I climb the career ladder their to
4:33
kind of interesting technologies. At the time there was
4:36
a I in there was crypto on us like
4:38
well the ai people seem like a little bit
4:40
more nice. Like
4:42
a slightly. More trustworthy. Both
4:44
super exciting, but. Through my bed
4:46
and on the I sighed and then
4:48
I got connected to undress and actually
4:50
the way he was thinking about pursuing
4:52
the research agenda ot was really compatible
4:54
with what I had envisioned for an
4:56
ideal Ai product and they had helps
4:58
kind of take down really complex thinking,
5:00
overwhelming thoughts and breaks down into small
5:02
pieces. and then this an admission that
5:04
we need a I to help us
5:07
figure. Out what we ought to do
5:09
was really inspiring for her. Eight? Yeah,
5:11
Could. I think it was clear that
5:13
we were building the most powerful optimizers.
5:15
Of our time. But. As a society we
5:17
hadn't figure out. How. To direct and
5:19
optimization potential and if you can have.
5:22
Direct. Tremendous amounts of optimization potential.
5:24
The wrong thing? That really disastrous. So
5:26
the goal of art was. Make.
5:28
Sure that if we build the most transformative
5:30
technology of our lifetime, it can be used
5:33
for something really impactful like good reasoning like
5:35
not just generating ads on my back on
5:37
us and marketing but like sounds like I
5:39
want to do more than shudder at odds
5:41
with us And also if the they Isis
5:43
and. Get. To be super intelligent enough
5:46
that they are doing that really complex reasoning that
5:48
we can trust them that that they are aligned
5:50
with us and we have ways of evaluating that
5:52
they're doing the right thing. That's what I did.
5:54
We did a lot of experiments. You
5:56
know, like under said before Sunday's models the
5:59
really liked her. A lot of
6:01
the issues you are seeing were more and
6:03
reinforcement learning by we saw a future where
6:05
a I would be able to do more
6:07
kind of logical reasoning. Not just kind of
6:09
extrapolate from numerical trance. We actually kind
6:12
of setup experiments. With people are kind of
6:14
people stood and as super intelligent systems and
6:16
we effectively gave them context windows so they
6:18
would have to like a bunch of text
6:20
and one person would get less tax than
6:23
one person would get all the tax. And
6:25
the person with less sex? what have to
6:27
evaluate the work of the person who could
6:29
read much more Cilic? In a world where
6:31
basically simulating like and twenty eight into a
6:34
nineteen, a world where an Ai system could
6:36
read significantly more than you and you as
6:38
the person who couldn't read that much how
6:40
to evaluate the work of yes of course.
6:43
Yeah so the lot of the work we
6:45
did and from that we kind of iterative
6:47
on the idea of breaking complex task down
6:49
into smaller toss like complex tasks like open
6:51
ended reasoning, logical reasoning into smaller ta so
6:54
that it's easier to train as systems on
6:56
them and also so that it's easier to
6:58
evaluate the work of the I system when
7:00
it's done and then also kind of in
7:02
early pioneer this idea the importance of supervising
7:05
the process of ai systems not just the
7:07
comes as a big part of illicit as
7:09
belts is are very intentional about not just
7:11
throwing a ton of. Data into a model
7:13
and training at and it's a cool. Here's
7:16
like scientific out, but like that's not at
7:18
all what we do. Our approach is very
7:20
much like what are the steps that an
7:22
expert human does or what is like an
7:25
ideal process as granular early as possible. Let's
7:27
break that down and then train aerosystems to
7:29
perform each of those steps very robustly when
7:31
you train like that from the start. After
7:34
the fact, it's much easier to evaluate, much
7:36
easier to troubleshoot at each point. like were.
7:38
To Something breakdown. So yeah we work on
7:40
those experiments for a while and then at
7:42
the start of twenty twenty one decided to
7:44
build a product do magnified. Because you
7:46
either you're about to go into more modern
7:48
yards and illicit. And I just wanted to
7:50
because I think a lot of people are
7:52
in. Where. you word bucks to the
7:54
eighteen nineteen ah where you chose a partner
7:57
to work with yeah right and you to
7:59
know him Yeah, yeah. You were just kind of
8:01
cold introduced. Yep. A lot of people are cold
8:03
introduced. Mm-hmm. I think cold introduced to tons
8:05
of people and I never work with them. I assume you
8:07
had a lot of other options, right? Like how do you
8:09
advise people to make those choices? We were not totally cold
8:11
introduced, so one of our closest friends introduced us.
8:14
And then Andreas had written a lot on the
8:16
OTT website, a lot of blog posts, a lot
8:18
of publications, and I just read it and I
8:20
was like, wow, this sounds like my writing.
8:22
Okay. And even other people, some of my closest
8:24
friends I asked for advice from, they were like, oh,
8:26
this sounds like your writing. But
8:29
I think I also had some kind of like
8:31
things I was looking for. I wanted someone with
8:33
a complimentary skillset. I want someone who was very
8:35
values aligned. And yeah, that
8:37
was all a good fit. We also did
8:39
a pretty lengthy mutual evaluation process where
8:41
we had a Google doc where we had
8:43
all kinds of questions for each other. And
8:46
I think it ended up being from 50 pages
8:48
or so of like various like
8:50
questions and back and forth. Was it the
8:53
YC list? There's some lists going around for
8:55
co-founder questions. No, we just made our own
8:57
questions. I guess it's probably
8:59
related in that you ask yourself what are the values
9:01
you care about? How would you approach various positions and
9:03
things like that? I shared like all of my
9:05
past performance reviews. Yeah? Yeah.
9:08
And he never had any so. No. Okay.
9:11
Okay. Okay. All right. Yeah,
9:14
sorry. I just had to, a lot of people are
9:16
going through that phase and you kind of skipped over it. I was
9:18
like, no, no, no, there's like interesting story. Yeah. Before
9:21
we jump into what a list it is
9:23
today, the history is a bit counterintuitive. So
9:25
you start with figuring out, oh, if we
9:27
had a super powerful model, how will we
9:29
align it? How will you use it? But
9:31
then you were actually like, well, let's just build the
9:33
product so that people can actually leverage it. And
9:36
I think there are a lot of folks today
9:38
that are now back to where you were maybe
9:40
five years ago. They're like, oh, what if this
9:42
app ends rather than focusing on actually building something
9:44
useful with it? What collect for you
9:46
to like move into a list and then we can
9:48
cover that story too. I think in many ways the approach
9:50
is still the same because the way we are building
9:52
a list that is not Let's train
9:55
a foundation model to do more stuff. It's
9:57
like, let's build a scaffolding such that we
9:59
can... Hi powerful models to go dance
10:01
I think is different now in that me
10:03
actually have like some of the most plugin
10:05
but if in two thousand and seventeen we
10:07
had had the models because of run the
10:09
same experiments we did run with humans like
10:11
then to fifth models. And. So many
10:14
ways our philosophy is always says think
10:16
out to the future what Martha can
10:18
exist in one, two years or longer
10:20
and how can we make it so
10:22
that they can actually be deployed and
10:24
kind of transparent controllable ways. Citing
10:26
motivational he we both are kind of
10:29
product people at heart. The research was
10:31
really important and it. Didn't. Make
10:33
sense to build a product that that times. but
10:35
at the end of the day the thing that
10:38
always motivated us as imagining a world where high
10:40
quality reasoning is really abundant. And as
10:42
is the technology that's gonna get us there
10:44
and there's a way to guy that technology
10:46
with research. but we can have a more
10:48
direct affect their products because with research it
10:50
published research and someone else has to implement
10:52
that into the product and the product out
10:54
like a more direct path and we wanted
10:56
to concretely have an impact on people's lives.
10:58
Yeah. I think is that and a personally
11:00
the motivation was we want to build of.
11:03
Yep! And. Insisted recap as well as
11:05
the models you're using back then were like.
11:08
I. Don't know with the like birds type
11:10
stuff for. T. Five or own
11:12
with type. We were talking about your. I
11:14
guess to be clear to the very beginning
11:16
V had humans do the work and then
11:18
I think the first model. That kind of
11:21
makes sense for Tb to and T M
11:23
L G and like the early terrorists models
11:25
be do also use like T Five space
11:27
models even now. started with deputy to yeah
11:29
call them to kind of curious about but
11:31
how do you start so early you know
11:33
like snow it's obvious east where to start
11:35
but back then he wasn't even. Nicer
11:37
Megadeth a lot. I was like, why
11:39
are you talking to this? I dunno.
11:41
I think tv details like clearly can't do
11:43
anything and I was like Andreas, you're wasting
11:46
your time for language as toy. It's Latvia
11:48
is right. So
11:50
what's the history of was illicit actually does as
11:52
a product you recently announced said after four months
11:54
he gets or million a revenue or you he
11:57
a lot of people use a get a lot
11:59
about you but. It would initially
12:01
structure data instruction from papers. Then
12:03
you had concept grouping and to
12:05
date maybe a more full stack
12:08
research enabler, paper understander platform.
12:10
What's the definitive definition of what ELLICIT
12:12
is and how did you get here?
12:14
Yeah, we say ELLICIT is an AI research assistant.
12:17
I think it will continue to evolve. That's part of
12:19
why we're so excited about building and research because
12:21
there's just so much space. I think the current
12:23
phase we're in right now, we talk about it
12:25
as really trying to make ELLICIT the
12:28
best place to understand what is known.
12:30
It's all a lot about literature summarization.
12:32
There's a ton of information that the
12:34
world already knows. It's really hard to
12:36
navigate, hard to make it relevant. A
12:38
lot of it is around document discovery
12:40
and processing and analysis. I really want
12:42
to import some of the incredible productivity
12:45
improvements we've seen in software engineering and
12:47
data science and into research. It's like,
12:49
how can we make researchers like data
12:51
scientists of text? That's why we're launching
12:53
this new set of features called Notebooks.
12:56
It's very much inspired by computational notebooks
12:58
like Jupyter Notebooks, Deep Note, or
13:00
Colab because they're so powerful and
13:02
so flexible. Ultimately, when people are
13:04
trying to get to
13:06
an answer or understand insight, they're
13:08
manipulating evidence and information. Today, that's
13:11
all packaged in PDFs, which are
13:13
super brittle. With language models,
13:15
we can decompose these PDFs into their
13:17
underlying claims and evidence and insights and
13:19
then let researchers mash them
13:21
up together, remix them, and analyze them together. I
13:24
would say quite simply, overall, Lissit
13:26
is an AI research assistant. Right
13:28
now, we're focused on text-based workflows,
13:30
but long-term, really want to go
13:32
further and further into reasoning and
13:34
decision making. When you say AI research
13:36
assistant, this is matter research.
13:39
Researchers use Lissit as a research assistant.
13:41
It's not a generic UVM research or
13:43
anything type of tool, or it could
13:46
be, but what are people using it
13:48
for today? Yeah. Specifically, in
13:50
science, a lot of people use human research
13:52
assistants to do things. You tell your grad
13:55
student, hey, here are a couple of papers.
13:57
Can you look at all of these? See?
14:00
which of these have sufficiently large populations and
14:02
actually study the disease that I'm interested in
14:04
and then write out what are the experiments
14:06
I did, what are the interventions they
14:08
did, what are the outcomes and organize that for
14:11
me. And the first phase of
14:13
understanding what is known really focuses on automating
14:15
that workflow. Because a lot of that work
14:17
is pretty rote work. I think it's not
14:19
the kind of thing that we need humans
14:21
to do, language models can do it. And
14:23
then if language models can do it, you
14:26
can obviously scale it up much more than
14:28
a grad student or undergrad research assistant would
14:30
be able to do. Yeah, the use cases
14:32
are pretty broad. So we do have a
14:34
very large percent of our users are just
14:36
using it personally or for a mix of
14:38
personal and professional things. People who care a
14:40
lot about health or biohacking or parents who
14:42
have a children with a kind of rare
14:44
disease and want to understand the literature directly.
14:46
So there is an individual kind of consumer
14:48
use case. We're most focused
14:50
on the power users, though that's where
14:53
we're really excited to build. So Lisset
14:55
was very much inspired by this workflow
14:57
in literature called systematic reviews or meta
14:59
analysis, which is basically the human
15:02
state of the art for summarizing scientific
15:04
literature. It typically involves like
15:06
five people working together for over a year.
15:08
And they kind of first start by trying
15:10
to find the maximally comprehensive set of papers
15:13
possible. So it's like 10,000 papers. And
15:16
they kind of systematically narrow that down to like
15:18
hundreds or 50 extract
15:20
key details from every single paper. Usually
15:22
have two people doing it and like
15:24
a third person reviewing it. So it's
15:26
like an incredibly laborious, time consuming process,
15:28
but you see it in every single
15:31
domain. So in science, in machine learning,
15:33
in policy, because it's so structured and designed
15:35
to be reproducible, it's really amenable to automation.
15:37
So it's kind of the workflow that we
15:39
want to automate first. And then you make
15:41
that accessible for any question and make kind
15:44
of these really robust living summaries of science.
15:46
So yeah, that's one of the workflows that
15:48
we're starting with. Our previous guest, Mike Conover,
15:50
he's building a new company called BrightWave, which
15:53
is AI research assistant for financial research. How
15:55
do you see the future of these tools? Like
15:58
does everything converge to like a God researcher? assist
16:00
in or is every domain going to have its
16:02
own thing? I think that's a
16:04
good and mostly open question. I
16:07
do think there are some differences
16:09
across domains. For example, some research
16:11
is more quantitative data analysis and
16:14
other research is more high-level cross-domain
16:16
thinking. And we definitely
16:18
want to contribute to the broad general
16:20
reasoning type space. If researchers are making
16:22
discoveries, often it's like, hey, this thing
16:24
in biology is actually an analysis to
16:27
these equations in economics or something. And
16:29
that's just fundamentally a thing where you
16:31
need to reason across domains. At least
16:33
within research, I think there will be
16:35
one best platform more or less for
16:37
this type of generalist research. I think
16:40
there may still be some particular tools
16:42
for genomics, particular types of modules of
16:44
genes and proteins and whatnot. But
16:47
for a lot of the high-level reasoning that humans do,
16:49
I think that is a more of a winner-type all
16:51
thing. I wanted to ask
16:53
a little bit deeper about the workflow that
16:55
you mentioned. I like that phrase. I see
16:57
that in your UI now, but that's as
17:00
it is today. And I think you were about to
17:02
tell us about how it was in 2021 and how
17:04
it maybe progressed. How has this workflow evolved over time?
17:07
Yeah, so the very first version of Elyssa actually
17:09
wasn't even a research assistant. It was a forecasting
17:11
assistant. So we set out and we were thinking
17:13
about what are some of the most impactful types
17:15
of reasoning that if we could scale up, AI
17:17
would really transform the world. We actually
17:19
started with literature review, but we're like, oh,
17:22
so many people are going to build literature
17:24
review tools. So let's start there. So then
17:26
we focus on geopolitical forecasting. So I don't
17:28
know if you're familiar with like Manifold or... Manifold
17:30
Market. Yeah, that kind of stuff before
17:32
Manifold. Yeah, yeah. I'm not predicting relationships.
17:34
We're predicting like, is China going to
17:36
invade Taiwan? Markets for everything.
17:39
Yeah. That's been a relationship.
17:41
Yeah, fair. Yeah, it's true. And
17:43
then we worked on that for a while.
17:45
And then after GPT-3 came out, I think
17:47
by that time we realized that originally we
17:49
were trying to help people convert their beliefs
17:51
into probability distributions. And so take fuzzy beliefs,
17:53
but like model them more concretely. And then
17:55
after a few months of iterating on that,
17:57
just realize, oh, the thing that's... blocking
18:00
people from making interesting predictions about important
18:02
events in the world is less kind
18:04
of on the probabilistic side and much
18:06
more on the research side. And
18:09
so that kind of combined with the very
18:11
generalist capabilities of GPT-3 prompted us
18:13
to make a more general research assistant.
18:15
Then we spent a few months iterating
18:17
on what even is a research assistant.
18:19
So we would embed with different researchers.
18:22
We built data labeling workflows in the
18:24
beginning kind of right off the bat.
18:26
We built ways to find experts in
18:28
a field and ways to ask good
18:30
research questions. So we just kind of iterated
18:32
through a lot of workflows. No one else
18:34
was really building at this time, and it
18:36
was very quick to just do some prompt
18:38
engineering and see what is a task that
18:41
is at the intersection of what's technologically capable
18:43
and important for researchers. And we
18:45
had a very nondescript landing page. It said
18:47
nothing. But somehow people were signing up. And
18:50
we had the sign-in form that was like, why are you
18:52
here? And everyone was like, I need help with literature review.
18:54
And we're like, literature review, that sounds so hard. I don't
18:56
even know what that means. We're like, we don't want to
18:58
work on it. But then eventually we're like, OK, everyone is
19:00
saying literature review. It's overwhelmingly people want to- And all
19:02
domains, not like medicine or physics or
19:04
all domains. Yeah. And we
19:06
also kind of personally knew literature review was hard. And if you
19:08
look at the graph for academic literature being published, every
19:11
single one that you guys know this in machine learning,
19:13
it's like, I've been to the right
19:15
superhuman amounts of papers. So we're like, all right, let's
19:17
just try it. I was really nervous. But Andreas was
19:19
like, this is kind of like the right problem space
19:21
to jump into even if we don't know what we're
19:23
doing. So my take was
19:25
like, fine, this feels really scary. But let's
19:28
just launch a feature every single week and double
19:30
our user numbers every month. And if we can
19:32
do that, we'll fail fast and we will find
19:34
something. I was worried about like getting lost in
19:37
the kind of academic white space. So
19:39
the very first version was actually a weekend prototype that
19:41
Andreas made. Do you want to explain how that worked?
19:44
I mostly remember there was really bad.
19:46
So the thing I remember is you
19:48
entered a question and it would give
19:50
you back a list of claims. So
19:53
Your question could be, I don't know, how does
19:55
creatine affect cognition and would give you back some
19:57
claims that are to some extent based on papers.
20:00
But they were often irrelevant. The papers are
20:02
often irrelevant and so we ended up soon
20:04
as printing out a bunch of examples of
20:06
results and putting them up on the wall
20:08
so that me but can feel the constant
20:10
same of having such a bad product and
20:12
us would be incentivized to make it better.
20:14
And I think Overtone has gotten a lot
20:16
better. But I think the Dennis overdone was
20:18
like really very bad. As I say
20:20
like a natural language summary of an abstract like
20:22
kind of a one sentence summer and which he
20:24
saw half and then as a learn more about
20:27
this systematic review work slowly started expanding the capability
20:29
so that you could extract a lot more data
20:31
from the papers and do more with that. And
20:33
we're using like and buildings and Co
20:36
sign similarity the coast us for retrieval
20:38
or was it keyword based her. I
20:41
think the very first version didn't even have
20:43
it's on search engine I think the very
20:45
first version probably used the semantic school or
20:47
a p I or something similar he and
20:49
only later and we discovered that applies not
20:52
very semantics I saw my ass than but
20:54
are on search for certain that has helped
20:56
a lot and in we're gonna go into
20:58
like more recent products stuff but like you
21:01
know I think you seen the more so
21:03
to start up oriented. Business.
21:05
Person than and using cinema ideologically like interested
21:07
in research of easy cause of European states
21:09
what's in a market sizing week as thinking
21:11
i like as you're you're here saying that
21:13
we have to double every month and I'm
21:15
like i don't know how you make their
21:17
conclusions difference from from this rate as he
21:19
also has a non profit at the time.
21:22
I mean market size wise I felt like
21:24
in this. Space. Where so much
21:26
was changing and it was very unclear
21:28
what of today was actually gonna be
21:30
true tomorrow. We just like really rested
21:33
a lot on very very simple fundamental
21:35
principles. It is like. If. You
21:37
can understand the truth that is very economically
21:39
the decision like valuable if he like know
21:41
the truth some yeah has enough for Yeah
21:44
researchers. The key to many breakthroughs that are
21:46
very. Commercially valuable because of my version of
21:48
it is students a poor and they don't
21:50
pay for anything. right? But that's obviously
21:52
not now. Not as if on doubts but you had
21:54
as more market in sight for me to have believed
21:56
her by you. Skip that. Yeah we did encounter I
21:59
guess talking to be. for our seed
22:01
round. A lot of VCs were like,
22:03
you know, researchers, they don't have any
22:05
money. Why don't you build legal assistance?
22:08
I think in some short-sighted way, maybe that's true, but
22:10
I think in the long run, R&D is
22:13
such a big space of the economy.
22:15
I think if you can substantially improve
22:17
how quickly people find new
22:19
discoveries or avoid controlled trials that don't go
22:21
anywhere, I think that's just a huge amount
22:23
of money. And there are a lot of
22:25
questions, obviously, about between here and there, but
22:27
I think as long as the fundamental principle
22:30
is there, we were okay with that, and
22:32
I guess we found some investors who also
22:34
were. Yeah, congrats. I mean, I'm sure we
22:36
can cover the sort of flip later. I
22:38
think you were about to start us on
22:40
like GPT-3 and how like that changed things
22:42
for you. It's funny, like I guess every
22:44
major GPT version, you have like some big
22:46
insight. Yeah, yeah.
22:49
I mean, what do you think? I
22:52
think it's a little bit less true
22:54
for us than for others because we
22:56
always believe that there will basically be
22:58
human-level machine work. And so
23:00
it is definitely true that in
23:03
practice for your product, as new models come out,
23:05
your product starts working better, you can add some
23:07
features that you couldn't add before. But
23:10
I don't think we really ever had
23:12
the moment where we're like, oh,
23:15
wow, that is super unanticipated. We
23:17
need to do something entirely different now
23:19
from what was on the roadmap. I
23:21
think GPT-3 was a big change because
23:23
it kind of said, oh, now is
23:25
the time that we can use AI
23:27
to build these tools. And then GPT-4
23:29
was maybe a little bit more of
23:31
an extension of GPT-3. GPT-3 over GPT-2
23:33
was like qualitative level shift. And then
23:35
GPT-4 was like, okay, great. Now it's
23:37
like more accurate, we're more accurate on
23:40
these things we can answer harder questions, but the shape of
23:42
the product had already taken place by that time. I
23:44
kind of want to ask you about this sort of pivot that you
23:46
made, but I guess that was just a way
23:48
to sell what you were doing, which is
23:50
you're adding extra features on grouping by concepts.
23:52
The GPT-4 pivot, quote unquote pivot that you
23:55
made. Oh, yeah, yeah, exactly. Right, right,
23:57
right. Yeah, when we launched this workflow,
23:59
now. Stupid. He for was available basically
24:01
Alyssa without a place where we're very happy.
24:03
Learn professor. So given a table of papers,
24:06
you can extract data across all the tables,
24:08
but you kind of wanna take the analysis
24:10
a step further. Sometimes what you'd care about
24:12
is not having a list of papers. by
24:15
a list of arguments, a list of effects,
24:17
a list of interventions, list of techniques, and
24:19
so that's one of the things are working
24:21
on is now that you've extracted this information
24:24
in a more structured way. Can you pivoted
24:26
or group buy whatever the information that you
24:28
extracted to have more inside. First information
24:30
still supported by the academic literature. those
24:32
big revolutionary thought be three. I think
24:34
I'm very just impressed by how for
24:36
spurs schools your ideas around with the
24:39
were fluids and I think that's why
24:41
you're not as reliance on like the
24:43
Lm improving because as I she's just
24:45
about improving the work. For that you
24:47
have recommended people today we might call
24:49
it and agents I don't know but
24:51
you're not relying on the Olympics Drive
24:53
it, it's rely on. This is the
24:55
way that Illicit does research and what
24:57
we think is most effective is on
24:59
talking. So users. The problem space is
25:01
still huge like if it's like this big
25:03
we are all still operating at this tiny
25:05
part bit of it's so in as I
25:07
think that if a lot in the context
25:09
of most people are like go with your
25:12
mouth What happened to Tbd Five comes outside
25:14
Ftp, Five comes out there still like all
25:16
of this other space that you can go
25:18
into and citing things. Really obsessed with a
25:20
problem which is very very big has helped
25:22
us like stay robust and just kind of
25:24
directly incorporate on permits and they keep. Guy
25:26
and are assessing hundred. You guys was
25:28
truly you tell us but projects basically
25:30
yellow. How much did costs become a
25:32
concern as you're working more and more
25:34
with open the eyes party manage their
25:36
relationship. Let me tie man who charlie as energy. Agency
25:39
and integrity and entirely. the is a
25:41
special characters their charlie when we found him
25:43
was had just finished his freshman year at
25:45
the university of warwick haven't yet heard about
25:47
us on some discord and then he applied
25:49
and we were like wow who has this
25:51
freshman and the we just saw the he
25:53
had done so many incredible side projects and
25:55
we were actually on a team retreat in
25:57
barcelona visiting our head of engineering at the
25:59
time never about this wonder kid, they're like this
26:01
kid. And then on our take-home project, he had done
26:03
like the best of anyone to that point. And so
26:06
we were just like so excited to hire him. So
26:08
we hired him as an intern and then we're like,
26:10
Charlie, what if you just dropped out of school? And
26:13
so then we convinced him to take a year
26:15
off. And he's just incredibly productive. And I think
26:17
the thing you're referring to is at the start
26:20
of 2023, Anthrope kind of launched their constitutional AI
26:22
paper. And within a few days,
26:24
I think four days, he had basically implemented
26:26
that in production. And then we had it
26:28
in-app a week or so after that. And
26:30
he has since kind of contributed to major improvements
26:32
like cutting costs down to a tenth of what
26:35
they were really large scale. But yeah, you can
26:37
talk about the technical stuff. Yeah, on
26:39
the constitutional AI project, this was for abstract
26:41
summarization, where in illicit, if you run a
26:44
query, it'll return papers to you. And then
26:46
it will summarize each paper with respect to
26:48
your query for you on the fly. And
26:50
that's a really important part of illicit because
26:53
it does it so much. Like if you
26:55
run a few searches, it'll have done it
26:57
a few hundred times for you. And so
26:59
we cared a lot about this, both being
27:02
like fast, cheap, and also very low on
27:04
hallucination. I think if illicit hallucinates something about
27:06
the abstract, that's really not good. And so
27:08
what Charlie did in that project
27:10
was create a constitution that expressed what
27:12
are the attributes of good summary. Everything
27:15
in the summary is reflected in the
27:17
actual abstract, and it's like
27:19
very concise, etc, etc. And then used
27:23
RLHF with a model that
27:25
was trained on the constitution
27:27
to basically fine tune a better
27:29
summarizer on an open source model. Yeah,
27:32
I think that might still be in use. Yeah,
27:34
yeah, definitely. Yeah, I think at the time,
27:36
the models hadn't been trained at all to
27:38
be faithful to a text. So they were
27:41
just generating so then when you ask them a
27:43
question, they tried too hard to answer the question
27:45
and didn't try hard enough to answer the question
27:47
given the text or answer what the text sent
27:49
about the question. So we had to basically teach
27:52
the models to do that specific task. How
27:54
do you monitor the ongoing performance
27:56
of your models? Not to
27:59
get too LLM-opsy, but you are one of
28:01
the larger, more well-known operations doing NLP at
28:03
scale. I guess effectively, you have to monitor
28:05
these things, and nobody has a good answer
28:07
that I talk to. Yeah, I don't think
28:09
we have a good answer yet. I
28:13
think the answers are actually a little
28:15
bit clearer on the just basic robustness
28:17
side of where you can import ideas
28:19
from soft engineering and
28:21
normal DevOps. You're like, well, you need
28:23
to monitor latencies and response times
28:26
and uptime and whatnot. I think we should
28:28
say performance is more about and
28:30
then things like hallucination rate, where I
28:33
think there the really important thing
28:35
is training time. So we care
28:37
a lot about having our own
28:39
internal benchmarks for model development
28:42
that reflect the distribution of user
28:44
queries so that we can know
28:46
ahead of time how well
28:48
is the model gonna perform on different
28:50
types of tasks. So the tasks being
28:52
summarization, question answering, given a paper, ranking,
28:54
and for each of those, we wanna
28:56
know what the distribution of things the
28:58
model is gonna see so that we
29:00
can have well calibrated predictions on
29:03
how well the model is gonna do in
29:05
production. And I think, yeah, there's some chance
29:07
that there's distribution shift and actually the things
29:10
users enter are gonna be different, but I
29:12
think that's much less important than getting the
29:14
kind of training right and having very high
29:16
quality, well-vetted data sets at training time. I
29:19
think we also end up effectively monitoring by trying to
29:21
evaluate new models as they come out. And so that
29:23
kind of prompts us to go through our eval suite
29:25
every couple of months. And so every time a new
29:27
model comes out, we have to see how is this performing
29:30
relative to production and what we currently have.
29:32
Yeah, I mean, since we're on this topic,
29:34
any new models have really caught your eye
29:36
this year? Like Claude came out of the
29:38
mud. Yeah, I think Claude is pretty, I
29:41
think the team's pretty excited about Claude. Yeah,
29:43
specifically, Claude Haiku is a good point on
29:45
the kind of Pareto frontier. It's
29:47
neither the cheapest model, nor is it
29:49
the most accurate,
29:51
most high quality model, but it's just
29:53
a really good trade off between cost
29:56
and accuracy. You apparently have to 10
29:58
shot it to make it good. I tried
30:00
using Iku for summarization, but Zero Shot
30:02
was not great. Then they were
30:04
like, it's a skill issue, you have to
30:06
try it harder. Interesting. I think GPD 4
30:08
unlocked tables for us, processing data
30:10
from tables, which was huge. GPD
30:13
4 Vision. Yeah, did you try
30:15
it like for you? I guess you can try for
30:17
you, because it's non-commercial. That's the adept model. Yeah, we
30:19
haven't tried that one. Yeah, but
30:21
Cloud is multimodal as well. I think
30:23
the interesting insight that we got from talking to David
30:25
Luan, who was CEO of Adept, was that multimodality
30:28
has effectively two different flavors. One
30:30
is we recognize images from a
30:32
camera in the outside natural world.
30:35
And actually, the more important multimodality
30:38
for knowledge work is screenshots, and
30:40
PDFs and charts and
30:42
graphs. So we need a new term for
30:44
that kind of multimodality. But is the claim
30:46
that current models are good at one or
30:48
the other? Yeah, they're over-indexed, because of the
30:51
history of computer vision is cocoa. So
30:53
now we're like, oh, actually, screens
30:56
are more important. OCR and
30:58
writing. You mentioned a lot of closed model
31:00
lab stuff, and then you also have this
31:02
open source model fine-tuning stuff. What is your
31:04
workload now between closed and open? It's
31:07
a good question, I think. Half and half?
31:09
It's a... Is that even a relevant question,
31:11
or not a nonsensical question? It depends a
31:14
little bit on how you index, whether you
31:16
index by computer cost or number of queries.
31:18
I'd say in terms of number of queries,
31:21
it's maybe similar. In terms of cost and
31:23
compute, I think the closed models make up
31:25
more of the budget since the main cases
31:27
where you wanna use closed models are cases
31:30
where they're just smarter, where there
31:32
are no existing open source models are quite
31:34
smart enough. Yeah. We
31:37
have a lot of interesting open-group questions
31:39
to go in, but just to wrap
31:41
the UX evolution, now you have the
31:44
notebooks. We talked a lot about how
31:46
chatbots are not the final frontier. How
31:49
did you decide to get into notebooks,
31:51
which is a very iterative, kind of
31:53
like interactive interface and maybe learnings from
31:55
that? Yeah, this is actually our fourth
31:57
time trying to make this work. I
32:00
think the first one was probably in early 2021. I
32:04
think because we've always been obsessed with this
32:06
idea of task decomposition and like branching, we
32:08
always wanted a tool that could be kind
32:10
of unbounded where you could keep going, could
32:12
do a lot of branching where you could
32:14
kind of apply language model operations
32:16
or computations on other tasks. So in
32:19
2021, we had this thing called composite
32:21
tasks where you could use PPT-3 to
32:23
brainstorm a bunch of research questions and
32:26
then take each research question and decompose
32:28
those further into sub questions. And
32:30
this kind of, again, that like task decomposition
32:32
tree type thing was always very exciting to
32:34
us. But that was like, it didn't work
32:37
and it was kind of overwhelming. Then at
32:39
the end of 2022, I think we tried again and
32:41
at that point we were thinking, okay, we've done a
32:43
lot with this literature review thing. We
32:45
also want to start helping with kind of adjacent
32:47
domains and different workflows. Like we want to help
32:49
more with machine learning. What does
32:51
that look like? And as we were thinking
32:53
about it, we're like, well, there are so
32:55
many research workflows. How do we not just build
32:57
three new workflows into illicit but make illicit
33:00
really generic to lots of workflows? What
33:02
is like a generic composable system with
33:04
nice abstractions that can like scale to
33:06
all these workflows? So we like iterated
33:08
on that a bunch and then didn't
33:10
quite narrow the problem space enough or
33:12
like get to what we wanted. And
33:14
then I think it was at the beginning
33:17
of 2023 where we're like, wow, computational notebooks
33:19
kind of enable this where they have a
33:21
lot of flexibility, but kind of
33:23
robust primitives such that you can extend the workflow and
33:25
it's not limited. It's not like you ask a
33:27
query, you get an answer, you're done. You can just
33:29
constantly keep building on top of that. And each
33:32
little step seems like a really good unit of
33:34
work for the language model. And also it was
33:36
just like really helpful to have a bit
33:38
more pre-existing work to emulate. Yeah, that's kind
33:40
of how we ended up at Computational Notebooks
33:43
for illicit. Maybe one thing that's worth making
33:45
explicit is the difference between Computational Notebooks and
33:47
chat because on the surface they seem pretty
33:49
similar. It's kind of this iterative interaction where
33:52
you add stuff. In both cases you
33:54
have a back and forth between you enter stuff and then you
33:56
get some output and then you enter stuff. in
33:59
the future. minds is with notebooks you can
34:01
define a process. So in data science,
34:03
you can be like, here's my data
34:05
analysis process that takes in a CSV
34:07
and then does some extraction and then
34:09
generates a figure at the end. And
34:12
you can prototype it using a small CSV, and
34:14
then you can run it over a much larger
34:16
CSV later. And similarly, the vision
34:19
for notebooks, in our case, is to not
34:21
make it this one-off TET interaction, but to
34:23
allow you to then say, if you start
34:26
and first you're like, OK, let me just
34:28
analyze a few papers and see, do I
34:30
get to the correct conclusions for those few
34:32
papers? Can I then later go back and
34:35
say, now let me run this over 10,000
34:37
papers now that
34:39
I've debugged the process using a few papers? And
34:42
that's an interaction that doesn't fit quite as
34:44
well into the TET framework, because that's more
34:46
for kind of quick back and forth interaction.
34:49
Do you think in notebooks it's kind of
34:51
like structure, editable chain of thought,
34:53
basically, not by SAP? Is that kind of
34:55
where you see this going? And then are
34:57
people going to reuse notebooks as like templates?
35:00
And maybe in traditional notebooks, it's like cookbooks,
35:02
right? You share a cookbook. You can start
35:04
from there. It's a similar in illicit. Yeah,
35:07
that's exactly right. So that's our hope that
35:09
people will build templates, share them with other
35:11
people. I think chain of thought is
35:13
maybe still like kind of one level lower
35:15
on the abstraction hierarchy than we would
35:17
think of notebooks. I think we'll probably
35:19
want to think about more semantic pieces,
35:21
like a building block is more like
35:23
a paper search or an extraction or
35:26
a list of concepts. And
35:28
then the model's detailed reasoning will
35:30
probably often be one level down. You always
35:32
want to be able to see it, but
35:34
you don't always want it to be front
35:36
and center. Yeah. What's the difference between a
35:38
notebook and an agent? Since everybody always asks
35:40
me, what's an agent? Like, how do you
35:42
think about where the line is? Yeah,
35:45
it's an interesting question. In the notebook
35:47
world, I would generally think
35:49
of the human as the agent in the
35:51
first iteration. So you have the notebook, and
35:53
the human kind of adds little action steps.
35:56
And then the next point on this kind
35:58
of progress gradient is, OK. okay, now you
36:00
can use language models to predict which action would you take
36:02
as a human. And at some point you're probably gonna be
36:04
very good at this. You'll be like, okay, in some cases
36:06
I can with 99.9% accuracy
36:08
predict what you do. And then you might
36:10
as well just execute it, like why wait for the human? And
36:13
eventually as you get better, that will just
36:15
look more and more like agents taking actions
36:18
as opposed to you doing the thing. I
36:20
think templates are a specific case of this
36:22
where you're like, okay, well, there's just particular
36:24
sequences of actions that you often wanna chunk
36:26
and have available as primitives, just like in
36:29
normal programming. And you can
36:31
view them as action sequences of agents
36:33
or you can view them as more
36:35
normal programming language abstraction thing. And I
36:37
think those are two valid views. Yeah.
36:40
How do you see this change as, like
36:42
you said, the models get better and you
36:44
need less and less human actual interfacing with
36:47
the model, you just get the results. Like
36:49
how does the UX and the way people
36:51
perceive it change? Yeah, I think this
36:53
kind of interaction paradigm for evaluation is not
36:55
really something the internet has encountered yet because
36:57
up to now the internet has all been
36:59
about getting data and work from people. But
37:01
so increasingly, I really want kind of evaluation
37:04
both from an interface perspective and from like
37:06
a technical perspective or operation perspective to be
37:08
a super power for illicit because I think
37:10
over time models will do more and more
37:12
of the work and people will have
37:14
to do more and more of the evaluation. So
37:16
I think, yeah, in terms of the interface, some
37:18
of the things we have today, for every kind
37:20
of language model generation, there's some citation back and
37:22
we kind of try to highlight the ground
37:24
truth in the paper that is most relevant
37:26
to whatever illicit. that
37:29
and make it super easy so that you can click on it and quickly see
37:31
in context and validate whether
37:33
the text actually supports the answer that illicit gave.
37:36
So I think we'd probably want to scale things up like that, like
37:39
the ability to kind of spot check the models work
37:41
super quickly, scale up interfaces like that. And-
37:45
Who would spot check the user? Yeah, to start, it would be the user.
37:48
One of the other things we do is also kind of flag the
37:51
model's uncertainty. So we have models report out, how
37:53
confident are you that this was the sample size
37:55
model's not sure we throw a flag. And so
37:58
the user knows to prioritize checking that. So
38:00
again, we can kind of scale that up. So when the
38:02
model's like, well, I searched this on Google, not sure if
38:05
that was the right thing, I have an uncertainty flag, and
38:07
the user can go and be like, okay, that was actually
38:09
the right thing to do or not. I've tried
38:11
to do uncertainty readings from models.
38:13
I don't know if you have this live,
38:15
but you do. Cause I just didn't find
38:18
them reliable because they just hallucinated their own
38:20
uncertainty. I would love to base it on
38:22
logprobs or something more native within the model
38:24
rather than generated. But okay,
38:27
it sounds like they scale properly for you.
38:30
We found it to be pretty calibrated. It varies on the
38:32
model. I think in some cases, we also
38:34
use the different models for the uncertainty estimates than
38:36
for the question answering. So one model would say,
38:38
here's my chain of thought, here's my answer, and
38:41
then a different type of model. Let's say the
38:43
first model is LAMA, and
38:45
let's say the second model is GB 3.5. And
38:48
then the second model just looks over the
38:50
results and like, okay, how confident are you
38:52
in this? And I think sometimes using a
38:54
different model can be better than using the
38:56
same model. On
38:58
top of your models, evaluate models, obviously you
39:01
can do that all day long. What's your
39:03
budget? Because your queries fan out a lot,
39:05
and then you have models, evaluate models. One
39:08
person typing in a question can lead to
39:10
a thousand calls. It depends on the
39:12
project. So if the project
39:14
is basically a systematic review that otherwise
39:16
human research assistants would do, then the
39:18
project is basically a human equivalent spend.
39:20
And this spend can get quite large
39:22
for those projects. I don't know, let's
39:24
say $100,000. So
39:27
in those cases, you're happier to spend compute
39:29
than in the kind of shallow search case
39:31
where someone just enters a question because, I
39:34
don't know, maybe you like it. I heard
39:36
about Creatine, what's it about? Probably
39:38
don't want to spend a lot of compute on
39:40
that. This sort of being able to invest more
39:42
or less compute into getting more or less accurate
39:45
answers is I think one of the core things
39:47
we care about, and that I think
39:49
is currently undervalued in the AI space. I
39:51
think currently you can choose which model you
39:53
want, and you can sometimes, I don't know,
39:55
you'll tip it and it'll try harder, or
39:57
you can try various things to get it
39:59
to work. harder but you don't have great
40:01
ways of converting willingness to spend into better
40:03
answers and we really want to build a
40:05
product that has this sort of unbounded
40:07
flavor where like if you care about it
40:10
a lot you should be able to get
40:12
really high quality answers really double checked in
40:14
every way. And you have a credit-based rating
40:16
so unlike most products it's not a fixed
40:18
monthly fee. Exactly. So like some of
40:21
the higher costs are tiered so for
40:23
most casual users they'll just get the
40:25
abstract summary which is kind of an
40:27
open source model then you
40:29
can add more columns which have more extractions and
40:32
these uncertainty features and then you can also add
40:34
the same columns in hierarchy mode which also parses
40:36
the table so we kind of stack the complexity
40:38
on the call. You know the fun thing you
40:40
can do with a credit system which is data for
40:42
data basically you can give people more credits if they
40:45
give data back to you. Yeah. I don't
40:47
know if you've already done that. We've thought about something like this
40:49
it's like if you don't have money but
40:51
you have time yes how do you exchange
40:53
that? Yeah. I think it's interesting
40:55
we haven't quite operationalized it and then you know there's been
40:57
some kind of like adverse selection like you know for example
40:59
it would be really valuable to get feedback on our model
41:01
so maybe if you were willing to give more robust feedback
41:04
on our results we could give you credits or something like
41:06
that but then there's kind of this will
41:08
people take it seriously. You want the good people. Exactly.
41:10
Can you tell who are the good people? Not
41:13
right now but yeah maybe at the point where we can
41:15
we can offer it. The complexity
41:17
of questions asked you know if it's
41:19
higher complexity these are the people. Yeah.
41:21
If you make a lot of typos
41:23
in your queries you're not gonna get
41:25
off. Negative
41:28
social credit. It's very topical right
41:30
now to think about the threat of long
41:32
context windows. All these models
41:34
that we're talking about these days all like a million
41:36
token plus. Is that relevant for you? Can
41:39
you make use of that? Is that just prohibitively
41:41
expensive because you're just paying for all those tokens
41:43
or you're just doing RAG? It's definitely
41:45
relevant and when we think about search as
41:47
many people do we think about kind of
41:49
a staged pipeline of retrieval where first you
41:52
use semantic search database with embeddings get like
41:54
the in our case maybe 400 or so
41:56
most relevant papers and then then you still
41:58
need to rank those. And I
42:00
think at that point it becomes pretty
42:03
interesting to use larger models. So specifically
42:05
in the past I think a lot
42:07
of ranking was kind of per item
42:09
ranking where you would score each individual
42:11
item, maybe using increasingly expensive scoring methods,
42:14
and then rank based on the scores. But I
42:16
think list-wise re-ranking where you have a model that
42:19
can see all the elements is a lot more
42:21
powerful. Because often you can only really tell how
42:23
good a thing is in comparison to other things.
42:26
And what thing should come first, it
42:28
really depends on. Like well what other things are
42:31
available, maybe you even care about diversity in
42:33
your results, you don't want to show 10
42:35
very similar papers as the first 10 results.
42:37
So I think the long context models are
42:39
quite interesting there. And especially for
42:41
our case where we care more about power users
42:43
who are perhaps a little bit more willing to
42:46
wait a little bit longer to get higher quality
42:48
results relative to people who just quickly check out
42:50
things because why not. And I think being able
42:52
to spend more on longer context is quite valuable.
42:55
I think one thing the longer context models
42:57
changed for us is maybe a focus from
43:00
breaking down tasks to breaking down
43:02
the evaluation. So before,
43:04
if we wanted to answer a question
43:06
from the full text of a paper, we had
43:08
to figure out how to chunk it and find
43:10
the relevant chunk and then answer based on that
43:12
chunk. And the nice thing was then you know
43:14
kind of which chunk the model used to answer
43:16
the question. So if you want to
43:18
help the user check it, yeah, you can be like,
43:21
well this was the chunk that the model got. And
43:23
now if you put the whole text in the paper,
43:25
you have to kind of find the chunk like more
43:27
retroactively basically. And so you need kind of like a
43:29
different set of abilities and obviously like different
43:31
technology to figure out. You still want to
43:33
point the user to the supporting quotes in
43:35
the text, but then the interaction is a little
43:37
different. You like scan through and find some Ruge
43:39
score before. I think
43:42
there's an interesting space of almost research
43:44
problems here because you would ideally make
43:46
causal claims like if this hadn't been
43:48
in the text, the model wouldn't have
43:50
said this thing. And maybe
43:52
you can do expensive approximations to that where like
43:54
I don't know you just throw a chunk of
43:56
the paper and re-answer and see what happens. But
43:59
hopefully there are. better ways of doing
44:01
that where you just get that kind
44:03
of counterfactual information for free from the
44:05
model. Do you think at
44:07
all about the cost of maintaining RAG versus
44:10
just putting more tokens in the window? I
44:12
think in software development a lot of times
44:14
people buy developer productivity things so that we
44:17
don't have to worry about it. Context
44:19
window is kind of the same right? You have to
44:21
maintain chunking and like RAG retrieval and like re-ranking and
44:23
all of this versus I just shove everything into the
44:26
context and like it costs a little more but at
44:28
least I don't have to do all of that. Is
44:30
that something you thought about? I think we still
44:33
like hit up against context limits enough that it's not
44:35
really do we still want to keep this RAG around
44:37
it's like we do still need it for the scale
44:39
of the work that we're doing. Yeah. And I
44:41
think there are different kinds of maintainability.
44:43
In one sense I think you're right
44:45
that throw everything into the context window
44:47
thing is easier to maintain because you
44:49
just can swap out a model. In
44:52
another sense if things go wrong it's
44:54
harder to debug where like if you
44:56
know here's the process that we go
44:58
through to go from 200 million
45:00
papers to an answer and there are like
45:02
little steps and you understand okay this is
45:04
the step that finds the relevant paragraph or
45:06
whatever it may be you'll know which step
45:08
breaks if the answers are bad. Whereas if
45:10
it's just like a new model version came
45:12
out and now it suddenly doesn't find your
45:15
needle in a haystack anymore then you're like
45:17
okay what can you do? You're kind of
45:19
at a loss. Yeah. Let's
45:21
talk a bit about yeah needle in a haystack and
45:23
like maybe the opposite of it which is like hard
45:25
grounding I don't know if that's like the best thing
45:27
to think about it but I was using one of
45:29
these chat which are documents features and I
45:32
put the AMD MI300 specs and the
45:34
new Black Metal chips from NVIDIA and
45:36
I was asking questions and that's the
45:38
AMD chip support NVLink and the response
45:40
was like oh it doesn't say in
45:42
the specs but if you ask GbD4
45:45
without the docs it would tell you
45:47
no because NVLink it's a NVIDIA technology.
45:49
That's your NV. Yeah. It just says
45:51
NVLink. How do you
45:53
think about that having the context sometimes to press
45:55
the knowledge that the model has? It really depends
45:57
on the task because I think sometimes it is.
46:00
exactly what you want. So imagine you're a
46:02
researcher, you're writing the background section of your
46:04
paper and you're trying to describe what these
46:06
other papers say. You really don't want extra
46:08
information to be introduced there. In other cases
46:10
where you're just trying to figure out the
46:12
truth and you're giving the documents because you
46:14
think they will help the model figure out
46:16
what the truth is, I think you do want,
46:18
if the model has a hunch that there might
46:20
be something that's not in the paper, you do
46:22
want to surface that. I think ideally
46:24
you still don't want the model to just tell
46:26
you. It probably the ideal thing looks
46:28
a bit more like agent control
46:30
where the model can issue a
46:33
query that then is
46:35
intended to surface documents that substantiate its hunch.
46:37
That may be a reasonable middle ground between
46:39
model just telling you and model being fully
46:42
limited to the papers you give it. Yeah,
46:45
I would say they're just kind of different tasks
46:47
right now and the task that Elisa is mostly
46:49
focused on is what do these papers say. But
46:51
there's another task which is like just give
46:53
me the best possible answer and that give me
46:55
the best possible answer sometimes depends on what do
46:58
these papers say but it can also depend on
47:00
other stuff that's not in the papers. So
47:02
ideally we can do both and then kind of do
47:04
this overall task for you more going forward.
47:08
We have seen a lot of details but
47:10
just to zoom back out a little bit,
47:12
what are maybe the most underrated features of
47:14
Elisa and what is one thing
47:16
that maybe the users surprised you the most by
47:18
using it? I think the most powerful feature
47:20
of Elisa is the ability to extract,
47:23
add columns to this table which effectively
47:25
extracts data from all of your papers
47:27
at once. It's well used but
47:29
there are kind of many different extensions of
47:31
that that I think users are still discovering.
47:33
So one is we let you give a
47:36
description of the column, we let you give
47:38
instructions of a column, we let you create
47:40
custom columns. So we have like 30 plus
47:42
predefined fields that users can extract like what
47:44
were the methods, what were the main findings,
47:46
how many people were studied and we actually
47:48
show you basically the prompts that we're using
47:50
to extract that from our predefined fields and then
47:52
you can fork this and you can say, oh actually
47:55
I don't care about the population of people, I only
47:57
care about the population of rats, like you can change
47:59
the instruction. So I think users are still
48:01
kind of discovering that there's both this
48:03
predefined, easy to use default, but that
48:05
they can extend it to be much
48:07
more specific to them, and then they
48:09
can also ask custom questions. One
48:12
use case of that is you can start to create
48:14
different column types that you might not expect. So
48:16
rather than just creating generative answers like
48:18
a description of the methodology, you can
48:20
say classify the methodology into a prospective
48:22
study, a retrospective study, or a case
48:25
study, and then you can filter based
48:27
on that. It's like all using the
48:29
same technology and the interface, but it
48:31
unlocks different workflows. So I think
48:33
that the ability to ask custom questions,
48:36
give instructions, and specifically use that to
48:38
create different types of columns like classification
48:40
columns is still pretty underrated. In
48:43
terms of use case, I spoke
48:45
to someone who works in medical
48:47
affairs at a genomic sequencing company
48:49
recently. So the doctors kind
48:51
of order these genomic tests, these sequencing
48:53
tests, to kind of identify if a
48:55
patient has a particular disease. This company
48:58
helps them process it, and this person
49:00
basically interacts with all the doctors, and
49:02
if the doctors have any questions. My understanding
49:04
is that medical affairs is kind of like customer
49:06
support or customer success in ARMA. So this person
49:08
like talks to doctors all day long, and one
49:10
of the things they started using a listed for
49:13
is like putting the results of their tests as
49:15
the query. Like this test showed,
49:18
you know, this percentage presence of this and
49:20
40% that and whatever, you know,
49:22
what genes are present here or what in
49:24
the sample. And getting kind of a list
49:26
of academic papers that would support their findings
49:29
and using this to help doctors interpret their
49:31
tests. So we talked about, okay,
49:33
cool, like if we built, he's pretty interested in
49:36
doing a survey of infectious
49:38
disease specialists and getting them
49:40
to evaluate, you know, having them write up
49:42
their answers, comparing it to a list of
49:44
answers, trying to see can a list start
49:46
being used to interpret the results of these
49:49
diagnostic tests because the way they ship these
49:51
tests to doctors is they report on a
49:53
really wide array of things. He
49:56
was saying that at a large well-resourced
49:58
hospital, like a city hospital, There might
50:00
be a team of infectious disease specialists who
50:02
can help interpret these results. But
50:04
at under-resourced hospitals or more rural hospitals, the
50:06
primary care physician can't interpret the test results.
50:09
Then they can't order it, they can't use
50:11
it, they can't help their patients with it.
50:13
So thinking about an evidence-backed way of interpreting
50:15
these tests is definitely kind of an extension
50:17
of the product that I hadn't considered before.
50:19
But yeah, the idea of using that to
50:22
bring more access to physicians in all different
50:24
parts of the country and helping them interpret
50:26
complicated science is pretty cool. We
50:28
are Kenjun from MBUON on the podcast
50:31
and we talked about better allocating scientific
50:33
resources. How do you think about
50:35
these use cases and maybe how illicit can
50:37
help drive more research? And do you see
50:39
a world in which maybe
50:42
the models actually do some of the
50:44
research before suggesting us? Yeah, I think
50:46
that's very close to what we care
50:48
about. Our product values are systematic,
50:50
transparent, and unbounded. And I think
50:53
to make research especially more systematic and
50:55
unbounded, I think is basically the thing
50:57
that's at stake here. So for example, I was
51:00
recently talking to people in longevity and I
51:02
think there isn't really one field of longevity,
51:04
there are kind of different scientific subdomains that
51:07
are surfacing various things that are related to
51:09
longevity. And I think if you could more
51:11
systematically say, look, here are all the different
51:13
interventions we could do and here's
51:15
the expected ROI of these experiments, here's
51:18
like the evidence so far that supports
51:20
those being either likely to surface
51:22
new information or not, here's the cost of
51:24
these experiments. I think you could be so
51:26
much more systematic than scientists today. I'd guess
51:29
in like 10, 20 years we'll look back
51:31
and it will be incredible how unsystematic science
51:33
was back in the day. Our views
51:35
kind of have models catch up
51:37
to expert humans today, start with kind of
51:39
novice humans and then increasingly expert humans. But
51:41
we really want the models to earn their
51:43
right to the expertise. So that's why we
51:46
do things in this very step-by-step way, that's
51:48
why we don't just like throw a bunch
51:50
of data and apply a bunch of compute
51:52
and hope we get good results. But obviously
51:54
at some point you hope that once it's
51:56
kind of earned its stripes it can surpass
51:58
human researchers. But I think that's. where making
52:00
sure that the models processes are really
52:02
explicit and transparent and that it's really
52:05
easy to evaluate is important because if
52:07
it does surpass human understanding, people will
52:09
still need to be able to audit
52:11
its work somehow or spot check its
52:13
work somehow to be able to
52:15
reliably trust it and use it. So yeah, that's
52:17
kind of why the process-based approaches is really important.
52:20
And on the question of will models do their
52:22
own research, I think one
52:24
feature that most currently don't have that
52:26
will need to be better there is
52:28
better world models. I think currently models
52:30
are just not great at representing what's
52:32
going on in a particular situation or
52:35
domain in a way that allows them
52:37
to come to interesting, surprising conclusions. I
52:39
think they're very good at coming to
52:41
conclusions that are nearby to conclusions that
52:43
people have come to. They're not as
52:46
good at kind of reasoning and making
52:48
surprising connections maybe. And so having deeper
52:50
models of, let's see, what are the
52:52
underlying structures of different domains, how they're
52:54
related or not related, I think will be
52:56
an important ingredient for models actually being able
52:58
to make novel contributions. On the topic of
53:01
hiring more expert humans, you've hired some very
53:03
expert humans. My friend Maggie Appleton
53:05
joined you guys I think maybe a year
53:07
ago-ish. In fact, I think you're doing an
53:09
offsite and we're actually organizing our big AI-UX
53:11
meetup around whenever she's in San Francisco. How
53:13
big is the team? How have you sort
53:16
of transitioned your company into this sort of
53:18
PBC and sort of the plan for the
53:20
future? Yeah, we're 12 people now. About
53:22
half of us are in the Bay Area and
53:25
then distributed across US and Europe. A
53:27
mix of mostly kind of roles in engineering and
53:29
product. Yeah, and I think that the transition to
53:31
PBC was really not that
53:33
eventful because I think we were already,
53:35
even as a nonprofit, we were already
53:38
shipping every week. So very much operating as
53:40
a product. Very much as a starting point. And
53:42
then I would say the kind of PBC component was
53:44
to very explicitly say that we have a mission that
53:46
we care a lot about. There are a lot of
53:48
ways to make money. We think our mission will make
53:51
us a lot of money, but we are going to
53:53
be opinionated about how we make money. We're going to
53:55
take the version of making a lot of money that's
53:57
in line with our mission. But it's all very convergent.
54:00
it is not going to make any money if
54:02
it's a bad product, if it doesn't actually help
54:04
you discover truth and do research more rigorously. So
54:07
I think for us, the kind of mission
54:09
and the success of the company are very
54:11
intertwined. We're hoping to grow the team quite
54:13
a lot this year. Probably some of our
54:15
highest priority roles are in engineering, but also
54:17
opening up roles more in design and product
54:20
marketing, go-to-market. Yeah, do you want to talk
54:22
about the roles? Yeah, broadly we're
54:24
just looking for senior software engineers and
54:26
don't need any particular AI expertise. A
54:28
lot of it is just how do
54:31
you build good orchestration for complex tasks?
54:33
So we talked earlier about these sort
54:35
of notebooks, scaling up task orchestration, and
54:38
I think a lot of this looks more like
54:40
traditional software engineering than it does look like machine
54:42
learning research. And I think the people who are
54:44
really good at building good abstractions,
54:47
building applications that can kind of
54:49
survive even if some of their
54:51
pieces break, like making reliable components
54:53
out of unreliable pieces, I think those are the
54:55
people we're looking for. No, that's exactly what I
54:58
used to do. Have you
55:00
explored the existing orchestration frameworks?
55:02
Temporal, Airflow, Daxter, Prefect?
55:04
We've looked into them a little bit.
55:06
I think we have some specific requirements
55:08
around being able to stream work back
55:10
very quickly to our users. Those could
55:12
definitely be relevant. Okay, well, you're hiring.
55:14
I'm sure we'll plug all the links.
55:16
Thank you so much for coming. Any
55:18
parting words? Any words of wisdom? Models
55:21
you live by? I think it's a really important time
55:23
for humanity, so I hope everyone listening
55:25
to this podcast can think hard
55:27
about exactly how they want to
55:29
participate in this story. There's
55:32
so much to build, and we can be
55:34
really intentional about what we align ourselves with.
55:37
There are a lot of applications that are going to
55:39
be really good for the world and a lot of
55:41
applications that are not. And so, yeah, I hope people
55:43
can take that seriously and kind of seize the moment.
55:45
Yeah, I love how intentional you guys have been. Thank you
55:47
for sharing that story. Thank you. Thank
55:57
you.
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More