Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:13
Hello, and welcome to Podcastinit enit.
0:15
The podcast about Python and the people who make
0:17
it great. When
0:18
you're ready to launch your next app or want to try a
0:20
project you hear about on the show, you'll need somewhere to
0:22
deploy it. So check out our friends over at Linode.
0:24
With their managed Kubernetes platform,
0:27
it's easy to get started with the next generation
0:29
of deployment and scaling powered by the battle
0:31
tested Linode platform. form, including simple
0:33
pricing, node balancers, forty gigabit
0:36
networking, and dedicated CPU and
0:38
GPU instances. And now
0:40
can launch a managed MySQL postgres
0:42
or MongoDB cluster in minutes to keep
0:44
your critical data safe with automated backups
0:46
and failover. Go to python podcast
0:48
dot com slash linode today to get a
0:50
one hundred dollar credit to try out their new database
0:53
service, and don't forget to thank them for the continued
0:55
support of this show. Your
0:56
host, as usual, is Tobias Macy, and
0:59
this month, I'm running a series about python's use
1:01
in machine learning. If you enjoy this
1:03
episode, you can explore further on my new show,
1:05
the machine learning Podcastinit helps you go
1:07
from idea to production with machine learning.
1:10
To find out more, you can go to the machine
1:12
learning podcast dot com. Your
1:13
host is Tobias Macy, and today I'm
1:16
interviewing Travis Adaire about Predibase,
1:18
a low code platform for building ML models
1:20
in a declarative format. Travis, can you
1:22
start by introducing yourself? Thanks
1:24
for having me on today, Tobias. So
1:26
I'm Travis. I'm the CTO of CreditBase.
1:29
CreditBase is low code
1:31
platform designed to make machine
1:33
learning more accessible and
1:35
more useful to the enterprise. Before
1:39
that, I was a tech lead manager
1:41
for Uber's machine learning platform, leading
1:43
team focused on deep learning training, one
1:46
of the lead maintainers on the Horwab
1:48
open source project
1:50
and also a core contributor to
1:52
the Lululemon project as well, which is one of
1:54
the foundational technologies for
1:56
credit base. And
1:58
do you remember how you first got started working
2:00
in the area of machine learning? Absolutely.
2:01
Yeah. So it goes
2:03
back a bit to twenty eleven
2:06
or so. I was working for Lawrence
2:08
Livermore National Lab
2:10
on processing about ten terabytes
2:12
of seismic data, and our goal
2:15
is to try to do some
2:17
analysis of it, to detect weapons
2:20
certainly enough. But
2:21
what we found was that
2:23
there were a lot of over fifty
2:25
percent of the data or so was noise, and
2:27
we had no good way to detect it. So
2:29
started pulling out some of my undergrad AI
2:32
textbooks, started implementing some
2:34
support vector machine and ran it on
2:37
top of the tube.
2:38
and got just really excited about the whole
2:40
thing, published an article in computers
2:42
and geosciences journal
2:44
and to go to grad
2:46
school and get my mass is that machine
2:48
learning and
2:49
more active in the the industry.
2:51
And, yeah, that's all I'll say.
2:53
And now that has
2:55
brought you where you are today with the
2:58
business and the project. And I'm wondering
3:00
if you can describe a bit more about what it is
3:02
that you're building there and some of the story behind
3:05
you decided that this was a problem
3:07
space that you wanted to spend your time and focus
3:09
on. When
3:10
I was at Uber, I actually started off
3:12
as a machine learning engineer, so
3:14
working on kind of the vertical problems
3:16
of ML.
3:17
And what I found was that
3:20
there were a lot of things I wanted to do. Like, I
3:22
wanted to, you know, try deep learning
3:24
and try training on large data sets
3:26
and multi GPU and all these sorts of things.
3:29
but
3:29
there just wasn't a lot of good tooling available
3:31
at the time to do that. There was TensorFlow
3:34
and we
3:35
wanted to run TensorFlow scalably. There
3:37
was, like, a whole lot of hoops you had to
3:39
jump through. And
3:40
then integrating that with something like Spark
3:42
that we were using for data processing was
3:44
just, like, forget
3:45
about it.
3:46
So what I realized was that if I wanted
3:48
to solve it all problems, I really need
3:50
to start with the it all tooling and
3:52
infrastructure. And so I joined the
3:55
Michelangelo deep learning team.
3:57
And while I was there working
3:59
on this horizontal
3:59
problem, you know,
4:01
we worked with a lot of customers that
4:03
had very similar patterns that emerged
4:06
to mine where
4:08
we
4:08
often found that there was this struggle
4:10
to get something that was productionizable,
4:13
like, at scale. Right? And there is
4:15
also a repeating pattern
4:17
of there
4:18
just being, frankly, more ML problems
4:21
than there was ML expertise of the company.
4:23
And so from a kind of horizontal platform
4:26
standpoint of Uber AI,
4:28
tasked with figuring out, you know, how can
4:30
we help get models of production
4:32
faster we kind of
4:33
realized that there was a need for better
4:36
abstractions, better infrastructure
4:38
more generally. And
4:39
so the tool looping
4:42
that my co founder, Puro, put together,
4:44
end
4:44
up being a perfect encapsulation of
4:46
that vision of being able to say,
4:48
you know, let's let researchers build
4:51
state the art models and put them into
4:53
as components into this framework. And
4:56
then that gives the vertical teams in the
4:58
company this very easy declarative
5:00
interface to just kind of swap
5:02
in and out different components for their
5:04
data without having to rewrite,
5:06
you know, huge swaths of python
5:08
code every time. And we
5:10
realized that this was like a very successful pattern
5:12
for Uber. And at the same time, you know,
5:15
Ludwig became open source, saw
5:17
that it resonated very strongly with the community
5:19
and realized that there was like a very real
5:21
need for this kind of better
5:23
abstraction layer, if you will, in
5:25
the industry,
5:26
and decide to form Predibase with the
5:28
intent at pushing the
5:30
state of the art forward in terms of
5:33
what kinds of tools data science
5:35
and machine learning teams have available
5:37
to them. to make them more productive and
5:39
kind of decrease the time, the value of machine
5:41
learning in the enterprise. In
5:43
terms of the audience that you're
5:45
focused on, particularly given
5:47
that you're very early in your journey
5:49
of being a new company and starting
5:51
to work with some of the first
5:53
sets of customers I'm curious
5:55
how you think about the
5:57
priorities that you're trying to support
5:59
and how that
5:59
influences the areas of focus
6:02
that you're investing in and the user
6:04
experience and feature development that you're prioritizing?
6:07
Yeah. So we like to say that we
6:10
don't expect a Predibase that we're gonna be
6:12
your first machine learning a call that you've
6:14
ever trained. Right? So oftentimes, we're
6:16
coming into organizations that
6:18
have a lot of machine learning problems that they
6:20
wanna solve. maybe they have some
6:22
kind of horizontal team that's focused
6:24
on trying to build out a platform for
6:27
doing machine learning. And maybe
6:29
they've tried using some AutoML tools
6:31
in past and have struggled with getting
6:33
them into production. And so what
6:35
we identify is, you know, we
6:37
see these teams that have struggled in this way
6:39
And the value proposition that we
6:41
wanna bring to them is to say, you know, if
6:43
you've used, like, you know,
6:45
some traditional ML systems in the past,
6:47
like spark and I'll live or what have you and
6:49
you're struggling to kind of up
6:51
level to deep learning and more state
6:53
of the art techniques in the industry. we
6:56
provide, like, a platform that gives you those
6:58
capabilities and a form factor that's, like,
7:00
much more familiar to you. And if you're
7:02
struggling to kind of keep up with the amount
7:04
of problems that the organization has, like,
7:06
maybe you have teams of engineers
7:08
that have in all problems that maybe are
7:10
not the top priority for whole company,
7:13
but a very important priority for that
7:15
team. provides a
7:17
platform that allows
7:19
those teams to be unblocked and allows
7:21
everyone in the organization to collaborate
7:23
together
7:24
towards building these solutions. And
7:26
so this focus on you
7:29
know, collaboration and kind of
7:31
mixed modality, like, you know, very
7:33
broad set of tasks that people might wanna solve.
7:35
those are very core focuses for
7:37
us when we look at companies that we
7:39
wanna partner with at this stage.
7:41
One
7:41
of the interesting things that you mentioned is
7:44
that you're working with companies who have
7:46
a lot of machine learning problems
7:48
to solve. And I'm wondering if you can
7:50
talk to what that really
7:52
means? Like, how you can identify that a
7:54
problem that you have is a machine learning
7:56
problem or whether machine learning is the
7:58
right approach to being
7:59
able to provide value in
8:02
utility for a given
8:04
objective that you're trying to achieve?
8:05
I
8:06
would say that it comes in a few
8:08
different flavors. Like on the one hand, you
8:10
have kind of traditional
8:13
data warehouse type systems
8:15
that have tables that have historical
8:17
data or transactional data And
8:19
so very often, the story there is
8:21
people wanna do some kind of predictive analytics.
8:24
Right? So we know who
8:25
churned last month, you know, we wanna
8:27
predict who's gonna churn next month. So
8:29
that kind of forward looking predictive
8:32
capability is, like, one,
8:34
like, type
8:34
of problem that we see a lot with companies.
8:37
that
8:37
fits very nicely into machine learning.
8:39
So you have a lot of data in your
8:41
database. You wanna be able to, you
8:43
know, predict or, like, make forward
8:45
looking statements about that data.
8:47
That's one area where we saw it.
8:49
But beyond that kind of structured
8:51
problem, there's also this question of
8:53
unstructured data as well. Right? And so you have a
8:55
lot of companies that have text
8:57
data or image data or audio data
9:00
sitting around. that they've collected, and they
9:02
don't really know what to do with it. And it's
9:04
maybe not so much a question of, you know,
9:06
I have data that says, what customers
9:08
submitted and support take in the past, I
9:10
want to predict what support tickets are going to say in the
9:12
future. It's not anything like that, but it's more
9:14
about just understanding semantically
9:16
what's going on in the data and then how
9:18
that can be used to better
9:20
informed and predictive forward looking models that
9:22
we want to build on
9:23
that more transactional data.
9:25
So this idea of unlocking the power
9:27
of un structured data is another really core one. And
9:29
so one of the things I think is
9:31
very unique about and Ludwig is
9:33
the ability that we can kind of bridge
9:35
this gap between structured and structured
9:37
data. So the platform is is very flexible.
9:40
It's data oriented in such a way that
9:42
if you have some transactional
9:44
tabular data and you also have some
9:46
unstructured image or audio or
9:48
text data, those can be combined
9:50
together into a unified machine
9:52
learning model in a very simple and straightforward
9:55
way. And so we can unblock
9:57
organizations that have all this disparate
9:59
data and they
9:59
want us drive value from it, but they haven't
10:02
figured out how in an effective way.
10:04
That's
10:04
where we think there's a lot of power for machine
10:06
learning, particularly in Predibase to kind of
10:08
slot it. Yeah. One
10:09
of the ways that I've seen those
10:11
different applications categorized is
10:13
the difference between predictive
10:15
analytics, which is the first category
10:17
that you mentioned versus prescriptive
10:19
analytics of this is what you should
10:21
do and then descriptive analytics
10:23
to say, just wanna understand what
10:25
this is trying to tell me.
10:27
Right. Right. Absolutely. I think that
10:29
descriptive component is
10:30
one that not a lot of people have tapped
10:33
into. In
10:33
terms of the way
10:36
that you have formulated the product
10:38
that you're building at Predibase, you're positioning
10:40
it as declarative ML And
10:43
you've mentioned earlier that organizations may
10:45
have had experience trying to use
10:47
the category of tools
10:49
called AutoML And I'm wondering if you can
10:51
just talk to the differences in
10:54
the nomenclature as far
10:56
as what that really means and
10:58
how the sort of expectations are
11:01
different between an AutoML
11:03
category of tool and a declarative
11:05
ML category of tool. Absolutely. Yeah.
11:07
I think this is like a very
11:09
key differentiator between how
11:11
we're thinking about the problem and how a lot of
11:13
other companies out there are. So
11:15
the way that we think about it is that at a
11:17
high level, there are very similar capabilities
11:19
in terms of being able to
11:21
put ML in the hands of non experts at
11:23
kind of the starting point. But we believe that
11:26
declarative ML provides a
11:28
more principled and flexible
11:30
path forward. So whereas a lot
11:32
of AutoML solutions, I think, in
11:34
it, generates state, AutoML often
11:36
becomes this kind of kitchen
11:38
sink style approach where the
11:40
system will throw everything it can at the
11:42
problem. They can think of And
11:44
if something works, then great. You
11:46
know, you kinda take that baton and run
11:48
with it. If something doesn't work,
11:50
then there's not really a lot of options
11:52
that you have terms of what's going to happen
11:55
next, like how someone who is maybe
11:57
a domain expert or a data expert
11:59
can come in
11:59
and kind of help out with, you know,
12:02
unblocking things
12:03
where we think declarative ML provides
12:05
a difference here is
12:07
that because it gives you this
12:09
very complete specification,
12:12
you can start at something very high level and say,
12:14
you know, I just want to predict this target
12:16
given these input variables. And
12:18
you can get a baseline from there.
12:21
but
12:21
it's not the end of the story. And so because it's
12:23
very explicit about saying, here's everything that
12:25
the system did, and you can
12:27
modify and customize any
12:29
aspect of this down to individual,
12:32
you know, layers of a neural network.
12:34
Right? It allows people to
12:36
then iterate on these systems, on
12:38
these implementations over time.
12:40
and kind of build towards
12:43
a working solution in a more
12:45
principled way. So for example,
12:47
they can say, oh, well, I initially
12:49
tried building this model with this set of
12:51
parameters. And then for V2I
12:53
swapped out this model architecture and
12:55
I changed the learning rate from this
12:57
to this. and it gives you that audit
12:59
trail of being able to say, here are all the things I
13:01
tried, here's what changed, and
13:03
here's the effect that that had on on
13:05
model performance. And so we
13:07
think that this is also very
13:09
powerful for enabling collaboration.
13:12
So if, you know, someone who is an
13:14
engineer maybe wants to train
13:15
their first model in
13:16
Predibase can do that without having to know a
13:19
lot of details about the
13:21
how of what's going on under the hood.
13:23
But if they get a solution and then they want
13:25
to maybe have a more expert data
13:27
scientist, take a look and say, you know, what do
13:29
you think we should try in order
13:31
to get better performance? they can take a
13:33
look at the config and say, oh, well, you
13:35
know, I see that you're using this parameter here.
13:37
Let's maybe try swapping that out,
13:39
see what happens. And so
13:42
it gives you that ability to make
13:44
these incremental changes in a very simple
13:46
way that if you were kind of
13:48
going down to just pure low level
13:50
tools like PyTorch or TensorFlow, it'd
13:52
be much more difficult to do that because
13:54
you'd be having to ship over entire
13:57
cheaper notebooks and Python
13:59
libraries and, you know, what's the
14:01
execution environment for all this?
14:03
It's much more difficult for someone to just
14:05
quickly take a look and kinda provide
14:07
feedback and kind of next steps.
14:09
And we also believe that this
14:11
ties in very nicely to our
14:13
version of AutoML, which we
14:15
would call iterative ML or
14:17
interim L where we see it being
14:19
much more of a conversation that you're having
14:21
with the system where you try
14:23
something out the system can
14:25
propose some new things to
14:27
modify in the in the specification for
14:29
the model. You can choose either except
14:31
to reject any of those things train for
14:33
a little bit more and then use the results of
14:36
that previous run to
14:38
then inform what you're gonna try next. So it
14:40
becomes a very in the loop back
14:42
and
14:42
forth process
14:43
that progresses in a much more in
14:45
a way that we think is much more like how
14:47
traditional software development is done, right, where you
14:49
have to get repo make some
14:51
code commits. And over time, you
14:53
can see the code evolve and change to
14:55
kind of better conform to the end
14:57
state. as opposed to just trying
14:59
to get all out there at once and, you
15:01
know, not have any way of knowing, you
15:04
know, what was the his street? What was the
15:06
effect of every change that we did?
15:08
In this
15:08
category of declarative ML,
15:11
another company that I've seen using
15:13
that terminology is continual,
15:15
which is based on being able
15:17
to build machine
15:19
learning pipelines on top of your data
15:21
warehouse so that you can just treat
15:23
your machine learning workflow as sequel
15:25
effectively. I'm wondering if you
15:27
can just characterize the
15:29
relative strengths and use cases of what you're
15:31
building at pred versus what they're
15:33
building at continual? Yeah.
15:34
Absolutely. So I guess the first thing I'd say
15:37
is that, you know, in general, we're very glad
15:39
to see other companies kind
15:41
of validate the idea behind
15:43
declarative about, you know, from following
15:45
the work that interests and and the folks that
15:47
continue to have been doing, It's always
15:49
been very nice to see that they've referenced our
15:51
work on Lulily again over ten. So
15:53
one of our co founders, Chris Ray,
15:55
was had a company called Lattice that
15:57
had a product called Overton that was acquired
15:59
by Apple, which was another early
16:01
declarative analysis. And so I think in
16:03
general, there's, like, a really good shared
16:05
vision of kind of moving the conversation
16:07
forward about better abstractions in a moment.
16:09
So I think there's definitely an element
16:11
of, you
16:11
know, rising tides, you know, raising all
16:14
ships to it. where
16:15
I think their differences would be
16:17
they definitely are very linked into the
16:19
kind of modern data stack operation
16:21
side of things. I think
16:23
that's their value proposition resonates very
16:25
nicely with people who are
16:27
active DVT users, for example. That's
16:29
a big part of kind of how they're approaching
16:31
the problem, which I think is a totally valid
16:33
way to think about it. For us,
16:35
we definitely think that we can do
16:38
a lot not just on the operation
16:40
side, but on the model development
16:42
side as well. So in with
16:44
Ludwig, we provide a framework
16:46
that is also pushing forward the
16:48
state of the art of what ML
16:50
models can do. Right? And so that's
16:52
a big part of the story for
16:54
us is trying to figure out how do
16:56
we help users get good
16:58
models in the first place and do it in a
17:00
way that is very low barriers
17:02
to entry, but very high performance
17:04
and high ceiling. Right? We also believe
17:06
the operations component is a big part of
17:08
that, but it's not the only part. I think there's still
17:10
a lot of work that needs to be done on just
17:13
getting to a good model that you wanna put in
17:15
production. And so
17:16
that's where I think Ludwig
17:18
is a bit different from what some of the other tools
17:20
out there provide, and that's
17:22
we're
17:22
also tackling that aspect of the problem and
17:24
the people. And as far
17:25
as the implementation and
17:28
architecture of what you've built
17:30
at Predibase, can you talk to the
17:32
overall system design and the ways
17:34
that you've thought about the
17:36
architecture of how to
17:38
approach this problem of making
17:40
declarative ML accessible
17:42
and easy to operate so
17:44
that teams who don't necessarily
17:46
want to invest in building
17:48
the entire ML ops stack
17:50
can be able to pick it up and run with
17:52
it and start to gain value very
17:54
quickly. Predibase
17:54
built as a multi
17:57
cloud native platform. We're built
17:59
on top of Kubernetes.
18:01
And so, you know, we have deployments
18:03
that run AWS, Azure, GCP,
18:05
and also on some on premise
18:07
Kubernetes clusters. So we believe that that's
18:09
like a very core part to make
18:11
it flexible so that wherever your data
18:13
happens to live, you know, we can
18:16
push the compute down to be as close
18:18
to that data as possible to
18:20
minimize the latency and minimize
18:22
the egress costs and all of those things.
18:24
So that's
18:24
a very core part of how it's architected.
18:27
We
18:27
also have a separation, but between
18:29
what control plane and data plane of
18:32
the system. Our data plane
18:34
is built on top of the Ludwig
18:36
Anré
18:36
open source work that we've done.
18:39
So we use Ray for doing the
18:40
distributed aspect of, you know,
18:43
scaling the to large datasets and
18:45
paralyzing the work. We
18:46
use Horobat for doing distributed data
18:49
parallel training. And then
18:50
we also have a serving layer
18:52
as well that's built on top of that.
18:54
And then
18:55
we also have a separate control
18:57
plane that provides a serverless
18:59
abstraction layer on top of this data
19:01
plane so that from the user
19:04
perspective, they don't need to be as concerned
19:06
about provisioning rate
19:08
clusters that run and, like, how to
19:10
right size it for the workload and
19:12
whether I wanna use this GPU
19:14
or that GPU. So that's a
19:16
big aspect of what we provide on
19:18
the infrastructure side is this kind
19:20
of intelligent provisioning life
19:22
cycle management of the compute
19:24
resources and making sure that these
19:26
long running training and prediction
19:28
workloads can be processed end to
19:30
end in an efficient way. And then,
19:32
of course, there's a whole another serving stack
19:34
as well that we're building out that's built
19:36
on top of NVIDIA
19:37
tried and and we'll have a lot, hopefully, on
19:39
the to say on in terms of our work
19:41
there, kind of some blog posts coming out of the
19:44
future. But that's
19:45
something that we're also looking to push into the open
19:47
source to some extent as well as some
19:49
of the
19:49
serving capabilities for Ludwig
19:52
that we're bringing to
19:53
the enterprise.
19:54
as you started to go down the
19:57
path of starting to build out this
19:59
platform and explore the
20:01
capabilities that you wanted to offer, I'm
20:03
wondering how the initial design
20:05
and ideas and vision around where
20:07
you wanted to end up have
20:10
shifted and evolved and some of
20:12
the directions that you have
20:14
moved in order to be able to
20:16
accommodate some of the early feedback that you've gotten as
20:18
you work with design partners and
20:20
to some of the overall evolution of
20:22
the platform as you started to dig
20:24
deeper into this space? That's
20:26
a great
20:26
question. So definitely, we had a certain
20:28
set of assumptions. coming in about
20:30
what the market was looking for and kind of
20:33
what level the user wanted to think about the
20:35
problem. Right? Whether this was a
20:37
production problem first and foremost for
20:39
them. a
20:39
research problem somewhere in
20:42
between. Right? And so
20:43
we definitely had a very early focus
20:45
on thinking a lot about the analyst
20:48
use case. and
20:48
how, you know, there are people who have data
20:50
but don't
20:51
have a background in ML who want to
20:53
be, you know, up leveled with that
20:56
capability. And so we thought a lot you
20:58
know, making it kind of operations
21:00
and production oriented to begin with. What
21:02
we found in kind of the early working with early
21:04
customers is that there's still a lot
21:06
of interest in, you know, the model development
21:09
aspect. And, you know, you're never going to at
21:11
least, you know, how
21:12
AI machine learning works today,
21:14
get that perfect model every
21:16
time, like, without any kind of
21:18
manual intervention or kind of domain expertise.
21:21
And
21:21
so we definitely, from a very early
21:23
on working with customers realize that
21:25
having a teaching
21:26
element of the platform was very
21:29
important as well explaining,
21:31
you know, here
21:31
are the options you have available to you
21:34
in this declare to specification. Here's how
21:36
you
21:36
should think about using one option
21:38
versus the other and what the appropriate ranges
21:41
are. how you should go about doing model development in
21:43
terms of starting with, you know, a really
21:45
complex
21:45
model or baseline,
21:48
understanding
21:48
how the model
21:50
perform kind of in a post hoc way and
21:52
saying, you
21:52
know, these were the features that
21:55
contributed the most to the model's
21:57
predictions or these were deals that
21:58
maybe, you know, were
21:59
imbalanced in some way, and so there are
22:02
other corrections we need to make. So
22:04
definitely
22:04
that aspect of iteration
22:07
and instilling kind of machine learning
22:09
best practices is something
22:11
that has been a
22:13
learned experience so, you know, one of the
22:15
most recent additions we made to the
22:17
platform just before coming
22:18
out of stealth was investing really
22:20
heavily in a Python SDK
22:24
similar to what the Loom and Python SDK
22:26
does, but provides it with some more
22:28
enterprise features that
22:30
really make it well integrated into
22:32
a data science stack where you're able
22:35
to experiment
22:36
with data, experiment with models,
22:38
iterate, as opposed to just going straight
22:40
forward the production. bottle
22:42
from day one. Right? So that was definitely
22:44
something we learned earlier on
22:46
in the process of working with customers.
22:48
And it's
22:48
also interesting to think about which
22:52
areas of the overall machine
22:54
learning problem and life
22:56
cycle you're looking to be
22:58
able to facilitate because
23:00
there are, you know, boundless
23:03
capabilities where, you know, there's the
23:05
experiment tracking and model
23:07
tracking. There's the model
23:09
monitoring to be able to understand, you know,
23:11
what the concept drift is happening once it's
23:13
in production. You know, there's the
23:15
kind of pipelineing So
23:17
there are definitely a number of different
23:20
areas that you could try to focus on, a number of
23:22
different directions you could try to push
23:24
into. I'm curious given the fact that you were
23:26
starting with Ludwig as
23:28
the kind of core building
23:30
block, how that helped to
23:32
shape your overall consideration
23:34
of the appropriate scope for what you're
23:36
trying to build and how you were thinking about
23:39
what are the maybe boundaries
23:41
and interfaces that you
23:43
want to incorporate to be able
23:45
to let fit
23:47
into the overall workflow
23:49
and life cycle of machine learning
23:51
in an organization. well-being able
23:54
to be very opinionated and
23:56
drive the conversation around the
23:58
areas that you wanted to own. It's
24:00
a really good question, I think, because this,
24:02
I think,
24:02
is at the core of the problem of
24:05
startups in the space trying to define
24:07
their category. Right? because
24:09
I think when you look at the space in
24:11
general, what you see is that there are
24:13
a lot of really good tools that are
24:15
what you might call point solutions that
24:17
are solving one aspect of the problem,
24:19
whether it's
24:20
explainability, model training,
24:22
model serving, what have you. Right?
24:24
But,
24:24
really, what organizations need is they
24:26
they do need something that is end to
24:29
end. Right? That is actually a platform
24:31
that is fundamentally delivering
24:33
business value, not just, you know,
24:35
model training or something like that. So the
24:37
way that we think about it broadly
24:40
is that we want
24:42
to be able to provide a
24:44
story that goes from data to
24:46
deployment for the users. So
24:48
connecting data from your
24:50
data warehouse or database,
24:51
providing best in class model
24:54
training at scale with the
24:56
serverless infrastructure and
24:57
then providing a really clean and simple
25:00
path to deployment that can be
25:02
either a REST API for
25:04
low latency real time prediction. or
25:07
the people SQL life language
25:09
that provides batch prediction capabilities
25:11
to the user. And starting
25:13
with that core vision of, like, this
25:15
is the journey for the user. There are, as you
25:18
said, a lot of other aspects to it as
25:20
well, like, you know, model explainability, data
25:23
preparation, and data quality and data
25:25
versioning, model monitoring, model
25:27
drift detection. And the way we're thinking
25:29
about these things today is pretty
25:31
similar to we think about them actually on the
25:33
open source side with Lululemon where we
25:35
want to try to be as
25:37
integrated with the community as possible.
25:39
Right? So On Lude Lake, we integrations
25:41
with common ML, weights and
25:43
biases, while logs, ML
25:46
flow, and stack that we're
25:47
working on. And
25:48
so these tools, you know, provide,
25:51
like, different capabilities of different
25:53
parts of the process. Right? Like, experiment
25:55
tracking or
25:56
model monitoring.
25:58
But the way we wanna think about it
25:59
is if you already have a
26:02
tool that you like these
26:04
problems. Like, we don't want to have to say, private
26:06
based is a rip
26:07
and replace solution for you. Right?
26:09
We wanna be well integrated. So
26:11
if you
26:11
wanna use weights and biases or
26:14
comment, you know, you
26:14
just give us an API key and we'll log
26:16
things there, and then have a nice way to
26:18
link back and forth between the two. or
26:20
if you're using y logs slash y
26:23
labs for doing model
26:25
monitoring, you know, we're thinking
26:26
about ways that we can integrate there
26:28
to do automated model retraining based
26:31
on triggers that come from
26:33
from
26:33
MyLab. So that's the way we're thinking
26:35
about the problem today is let's integrate as
26:37
much possible in the parts of the platform
26:39
where we don't feel we're providing
26:41
strong differentiated value or that we
26:43
could provide, like, a best in class
26:45
value proposition on.
26:46
while still telling the user, like,
26:49
hey. If you are starting from scratch,
26:51
right, and you don't have an ML
26:53
platform today, Predibase isn't a point
26:55
solution. It's something that we'll give you end to
26:57
end from the
26:58
data to a deployed model that can
27:00
start delivering value. And then you can layer on
27:02
more
27:02
tooling on top of that, you know, as we
27:05
see folks. And
27:06
so in terms of that workflow,
27:08
in the case where somebody is green fielding.
27:10
They say, I want to adopt AML. This
27:12
is my first foray into that. I'm going
27:14
to use predease to be able to experiment with
27:17
how can I take this data that I have and turn it into
27:19
something useful that I can do with it. Just wondering
27:21
if you can just talk through that kind of
27:23
end to end workflow of starting with
27:25
the data and ending with I have
27:27
a model running in production and I'm doing something
27:29
with it. In the tool,
27:31
there are different ways that users can do it.
27:33
We do have a web UI that people
27:35
can use to all
27:37
the actions. Everything that you
27:38
can do in the platform can also be done
27:40
through our SQL like language people
27:43
as well as the Python SDK.
27:45
So we
27:45
have many different views depending on the
27:48
persona that's using the platform that do the
27:50
same thing. But
27:50
regardless of which entry point you choose
27:53
to use, the steps are largely the
27:55
same as
27:55
you first start with the data.
27:58
So if you have your data in
27:59
Snowflake or s three or big query
28:02
or whatever, you
28:02
just give us some credentials, point
28:05
us to what table or what bucket
28:07
you're interested in working with,
28:08
and then we can start with any data
28:11
that is structured in some kind of
28:13
table like form. Right?
28:15
So that can be an actual database
28:17
table. That
28:17
can also be a part k
28:20
file, CSV, anything like that. And then,
28:22
you know, maybe a question that comes after
28:24
that is, what if I wanna use unstructured
28:26
data like images or audio?
28:29
So
28:29
the way we think about that today is that
28:31
give us the URLs to
28:33
those images or those audio files as
28:35
columns in your tabular data, and
28:37
then we can pick those up join all that
28:39
together until, like, a single flat
28:42
tabular view for training. Right? Once you've
28:44
pointed us to the data that you wanna
28:46
work with, we automatically do
28:48
all the metadata extraction and
28:50
schema extraction from the data for you.
28:52
So we know, you know, what
28:53
data types the data is and
28:56
in that sort of thing. And then you can
28:57
start creating models. So you, you know,
28:59
go into the model builder UI or
29:01
use the SDK to build a model in
29:03
a way that very similar to how you do it at Ludwig. And
29:06
all you need to specify
29:06
to get started is just the target or
29:09
targets since we support multi
29:11
task learning. that you
29:13
wanna predict.
29:13
From there, you can customize any aspects
29:16
of the training through either
29:17
layering on full
29:19
kind of like hyper parameter optimization.
29:22
Automell suggested, like,
29:24
configurations for everything. Forestar
29:26
was
29:26
just a very simple baseline. Either
29:29
way, you can go, like, any level of
29:31
the extremes. Right? Or any level of customization
29:34
customization between, and
29:34
then start training a model. Once you
29:37
start training a model, it will be sent to one
29:39
of the what we call engines. So that's one
29:41
of our serverless clusters that
29:43
does the computation for you. That
29:45
lives, you know, wherever your data happens to
29:47
live and the same kind of
29:49
region. Right? Model will get trained.
29:51
And from there, you can start using
29:53
people or the Python SDK to
29:55
validates. We also provide a full set of visualizations for
29:57
the user to explore in terms
29:59
of,
29:59
you know,
30:00
understanding the explainability of, like, feature
30:03
importance. We also have
30:05
calibration plots, all sorts of
30:06
other things like confusion matrices,
30:09
etcetera, that you can dig into. And
30:11
From there, you can either iterate on the model,
30:13
continue to develop in kind of
30:15
a incremental way that with a fully kind
30:17
of versions and lineage process,
30:20
And
30:20
then once you're happy with the model, there's kind of a one click
30:22
deployment that we have where you
30:24
can deploy it to a rest end point and
30:26
then start curling it with, you
30:28
know, JSON objects as you
30:30
see fit. And then if
30:31
you'd like to, you know, retire the model
30:33
or replace it with a new model version, it's
30:35
a similar one click kind of deployment process
30:38
as well. And
30:39
then there's, of course, ways that you can automate this as well
30:41
to do retraining
30:43
as well as do validation
30:45
to determine when you want
30:47
to trigger redeployment.
30:49
Right? So if you say, I have a
30:51
held out test data set, I only want to redeploy
30:53
when the, you know, new model does
30:55
better than the old model on that data set,
30:57
you know, that's
30:58
something that you can configure with
31:00
the platform as well. And
31:01
at a high level, that's the journey that
31:03
we see as being the core flow that
31:05
the wants to go through. So connect data, train model,
31:08
deploy. And then there's a lot in
31:09
between, of course, that kind of fills in the gaps,
31:11
but that's fundamentally what the
31:14
platform provides. And you
31:15
mentioned the QQL dialect a
31:17
couple of times, and I noticed
31:19
that when I was going through some of
31:21
the blog posts and some
31:23
of the early material that you have about what you're building at
31:26
So I'm wondering if you can speak
31:28
to some of the motivation
31:30
behind creating this new dialect
31:33
and this new, I guess, language you could call it, and
31:35
some of the ways that you think about the
31:37
semantic and syntactic design
31:40
of it. Yeah. So
31:40
with people, we think it's a
31:42
very natural way to extend the
31:45
declarative idea because I think it plays
31:47
very nicely into the
31:49
way that think about ML systems today compared
31:52
to databases a few decades
31:54
back, right, where you
31:55
had in the level languages,
31:57
like cobalt that people would be writing,
31:59
would
31:59
be interacting with databases and and
32:02
then SQL comes along and provides this
32:04
very nice declarative way
32:06
of expressing all sorts of complex
32:08
data analysis that you might wanna do.
32:11
And we
32:11
see people as being the natural extension
32:13
of that idea to the ML domain.
32:16
where since you already have this declared
32:19
specification that provides a very tight
32:21
semantic link between
32:23
your data fields, the fields of your
32:25
data set, and
32:26
the fields that are the inputs and outputs of your model
32:28
and then everything that happens in between.
32:30
People
32:30
provides a very natural way
32:33
to express you
32:35
know, the model prediction request that
32:37
you might wanna do. So what I think is
32:39
very powerful about people is that you
32:41
can do something as complex as
32:43
a batch prediction over, you know,
32:45
a ten terabyte data set
32:48
using
32:48
some model that you that wanna write out
32:50
to a downstream table Normally,
32:52
you would end up writing like an ETL job
32:54
in Spark to do something like this, but
32:56
that's just a one line people query, which
32:58
would be predict
33:00
target, given, select
33:02
star from data or whatever. Right? And you
33:04
can, of course, then do all
33:06
sorts of more complex things from there
33:08
in terms of joining
33:09
tables across different data sets, filtering them,
33:12
doing sliced analysis, you have ways of
33:14
doing what we call hypothetical
33:16
queries
33:16
that are kinda similar to
33:18
what you might do for real time prediction where
33:20
you want to
33:21
take like an entirely new data point
33:23
and then express it as a query that then can
33:25
be predicted on. And so I think,
33:27
you know, certainly one powerful use case
33:29
of people is this idea of a
33:32
more efficient way of doing batch prediction that
33:34
fits in nice into other tools
33:36
that do ELT like
33:38
DVT is a really good example there
33:40
where we already have a DVT integration.
33:42
That way, we've written that some of our users are
33:44
using. And so if you want to
33:45
be able to express your
33:48
prediction pipeline as
33:50
sequel, essentially, like people provides a
33:52
very natural way to do that. But
33:54
we
33:54
also think that people is a very
33:56
powerful enabler of putting
33:59
ML
33:59
inference and, like, letting people interact with
34:02
and understand the model. stakeholders
34:04
who might not today
34:05
ever really interact with an eval
34:07
model. Right? So anyone now who
34:10
understands SQL can start
34:12
making predictions start to play around
34:14
with understanding, like, what do I have to change in
34:15
this input to make the model predict something else? Really? It's
34:17
a very fun kind of just interactive process
34:19
that users can go
34:22
through. And
34:22
these sorts of people careers are a very nice sharing
34:25
point as well. So the data scientists has a model
34:27
that they've trained and the
34:29
analyst wants to play
34:31
around
34:31
with it or wants to see the result of
34:34
some prediction on some slice of
34:36
data, you just need
34:36
to share that people career with them
34:39
and that say, hey, go run this and, you know, let
34:41
me know how it goes instead of having
34:43
to ship whole notebooks and
34:45
Python files and whatever
34:47
So
34:48
that's kind of where we see the value of people as batch prediction,
34:50
as well as for kind of
34:52
pipelineing and doing, you know, BLT type
34:56
workloads. as
34:56
well as this kind of shareability of making all inference
34:58
more accessible to the broader organization.
35:01
Noting the
35:02
PQL acronym My
35:06
initial thought when I first read that in the blog post was, oh, obviously,
35:09
the p stands for prednis, but
35:11
it stands for predictive. And
35:14
so I'm wondering if you can talk to what your overall vision
35:16
is for this syntax and
35:18
if you intend for it to
35:21
be something that is maybe
35:24
adopted outside of as
35:26
kind of a general standard for this
35:28
means of interacting with machine
35:30
learning and just some of the all vision
35:33
there. So the interesting
35:33
thing there is that the name people
35:35
actually predated the name credit
35:37
base. So we had the
35:39
name people in mind before when we came up with
35:42
But to
35:42
your point, I definitely do
35:45
see people being something larger
35:47
than in a lot of
35:49
ways. Like, we want to see
35:51
more folks in the industry adopted.
35:53
And so the vision for people is
35:55
that Now you do
35:55
have a lot of the AI tools today
35:58
that have very tight integration with
35:59
SQL, and we'd like to
36:01
be able to see you
36:03
know, very nice integration with people in a lot
36:05
of these tools in the future as well, you
36:07
know, thinking about how we
36:09
can make it standard that the larger community
36:12
embraces, I think there is a lot of value
36:13
to that. And so we do work
36:15
very closely with some
36:17
companies in the BI space
36:19
through a Linux Foundation, actually, where
36:22
Ludwig and Horribe, the projects that we
36:24
maintain are hosted. And
36:25
so we do have a collaboration there with
36:27
the AI plus BI committee that
36:29
is working on exactly
36:31
this problem of integrating,
36:32
you know, machine learning prediction into
36:34
BI systems. And
36:36
that's where I think things can
36:38
go. the
36:40
standard ends up becoming well adopted in the
36:42
future. And another interesting
36:44
element of the overall
36:46
ML space is the question of
36:49
collaboration. You mentioned PQL allows you
36:51
to say, I've got this model. I wanna pass
36:53
it off to this analyst to be able to play with
36:55
it and experiment with its
36:58
capabilities. Maybe provide some feedback on ways that I should tweak it
37:00
to, you know, make it more powerful
37:02
for a certain use case. And I'm just
37:04
curious how you think about that
37:06
collaboration aspect
37:08
of Predibase you've designed the platform to
37:10
be able to be kind
37:12
of idiomatic and recognizable for
37:15
different roles and stakeholders across
37:18
the organization who are interacting with
37:20
the different capabilities of the model and
37:22
the overall workflow? So
37:23
I do think that collaboration is
37:26
very core to what we're doing because we
37:28
see this as being a tool, not just
37:30
for an
37:30
individual data scientist or engineer,
37:32
but a tool for an organization. Right?
37:35
And so
37:36
we do have different metaphors
37:38
that we think about that relate
37:40
to different stakeholders that you can
37:42
see kind of visions of each of them in
37:44
the platform. So for folks
37:46
who are more on the analytics side,
37:48
we do
37:48
have a query editor built into
37:50
the UI that lets you just start writing
37:53
people or even ordinary SQL
37:55
queries, the
37:55
parser is expressive enough to kind of
37:58
support both in the editor
37:59
and kind of
37:59
playing around with things as you would if
38:02
you're using superset or some other,
38:04
like, the i slash analytics tool.
38:06
But we also then for the kind of data
38:08
scientists and the
38:10
engineer personas, we
38:10
have a lot of tools that kind of adhere to
38:12
more of like a Github style workflow where, you know,
38:15
folks will be able to
38:17
kind incrementally update models in a way that is,
38:20
you know, versioned and so you can kind of
38:22
dip between different models
38:24
and then have this ability to do
38:26
experiments in separate branches. And then once you're happy with how the experiment
38:28
is doing, saying, oh, this experiment is
38:30
now doing better than what is currently in
38:34
production, let's,
38:34
you know, merge this back into the main similar to how
38:37
you would and get. And then, you know,
38:39
that kind of becomes a concept very similar to
38:41
a pull request where people
38:44
comment and kind of say, hey, I don't agree with this particular parameter
38:46
choice. Can we maybe revisit this?
38:48
So there are different
38:50
ways that you make
38:52
it more approachable to people that we've thought about doing it to make it more
38:54
approachable to people by having those call
38:57
outs
38:57
that harken back to things that they are
38:59
already familiar with. but
39:01
at
39:01
the same time giving them something that's net new. Right? Like,
39:04
I think the problem was just, you know, get
39:06
using ML today is that to
39:08
get doesn't provide a story
39:10
about the non source code artifacts. Right? So you need to
39:12
use external tools for that. And
39:14
things are not super tightly integrated,
39:15
and that's where something like
39:17
front of base slot in to
39:19
fill that gap for that
39:21
particular persona. Right? So it's
39:23
all about large part providing the
39:25
right metaphors for the right person that's, excuse
39:27
me, which will Digging
39:29
into the Ludwig aspect of what you're building, as you
39:31
mentioned, it's an open source project, it
39:33
predates the business, you have
39:35
used it as sort
39:37
of the core building block of what you're
39:40
providing. I'm curious if you can talk to
39:42
some of the ways that you're thinking
39:44
about the governance of the open
39:46
source project. and how you identify
39:48
which pieces of the engineering that you're doing
39:50
on and around Ludwig are part
39:54
of the business and which parts belong with the open source
39:56
project. And along with that,
39:58
some of the ways that your work on pred debase
39:59
has fed back into the
40:02
Ludwig project. From
40:03
the governance standpoint, we have
40:05
been making a concerted effort to
40:07
get more folks involved and, you know,
40:10
we hold regular monthly meetings with
40:12
the community talk
40:13
about the roadmap, gets buy in from
40:15
different people about what features
40:17
are important. So right now, one thing that
40:19
we've been working on on the open source
40:21
side because a lot
40:21
of other companies have been interested is
40:24
working on a model hub that
40:26
provides, you know, some ability to
40:28
share different trained blue loop models
40:30
and configurations.
40:31
So that's something that's
40:31
definitely been a community driven effort to
40:34
date. And then I would say that
40:36
in terms of how we see the relationship
40:38
between what's front of base what's
40:40
loop like, we
40:41
do have a very substantial part of
40:43
the engineering team that works almost
40:45
exclusively on the open source. And so it's
40:47
very important to us that not
40:50
just take Ludwig as something that
40:52
we consume downstream, but that we
40:54
also actively, you know,
40:56
are investing back into Ludwig and
40:58
making it better that that will make
41:00
front of us better by default.
41:02
Right? But a really good example there is
41:04
on the work we've done on
41:06
scalability recently. we have some customers that we've worked with are
41:08
training on larger datasets,
41:10
terabyte plus. And so we've
41:12
had to
41:13
think a lot about you
41:16
know, what are the bottlenecks and literally, you know, literally is a
41:18
very complex system in a lot of ways. We
41:20
deal with every type of modality of data
41:22
at the same time potentially. Right?
41:25
and need to have an efficient way to pipeline
41:28
it all in terms of both the data
41:30
processing and the model training and
41:32
the prediction. And
41:32
so we've invested quite a lot in building that out
41:35
specifically to improve credit base. But
41:37
the nice thing is all of
41:39
those features that become ultimately
41:42
part of Loomy because that's where
41:44
the
41:44
core of those capabilities live. I'd
41:46
say
41:46
two other big features that are
41:49
coming to Ludwig, being driven by requirements on the Predibase
41:51
side, have been. One would be the
41:53
improved
41:53
AutoML capabilities that we've been
41:56
investing in tools. So this would be
41:58
kind of suggesting
41:59
configurations and suggesting hyperparameter search
42:02
ranges based on
42:03
the data, based on
42:06
past trials. trainings and things like that. And then the other
42:08
is on the serving side.
42:09
One thing we definitely found on credit
42:10
basis is that there's a
42:13
very strong need to
42:15
make sure that the serving environment
42:17
is isolated and doesn't
42:20
have tons of
42:22
external dependencies that
42:22
blow up the deployments, I mean, now adds to
42:24
your overhead. Since moving from TensorFlow to PyTorch
42:26
last year, we've invested quite
42:30
a lot and
42:30
building out a transcript layer for doing serving, which
42:33
allows us to strip out all
42:35
of the
42:35
Python dependencies on Ludwig
42:37
at serving time.
42:40
and
42:40
provide a very low latency end to end, serveable
42:42
that does not only model inference,
42:44
but also does the preprocessing, so
42:47
the data transformation. as
42:49
well as
42:49
the post processing. And this is
42:51
something that, you know, it was very important what we're doing
42:54
in front of base, but we've made all of that open
42:56
source as part of labeling as well. So now
42:58
the community to take advantage of
43:00
it as well.
43:01
In terms of the
43:03
early applications of the Predibase
43:05
platform that you're building and how you've
43:07
been working with with some of your early design partners. I'm
43:09
wondering what are some of most interesting or innovative or
43:12
unexpected ways that you've seen the
43:14
platform used? Yeah. There have
43:15
definitely been some ones that
43:18
surprised us. So we
43:20
definitely
43:20
expected that Tabula
43:22
was
43:22
going to be a very important use case.
43:24
for folks. And so we invested a lot in making sure that we had state
43:27
the art architectures and capabilities on
43:29
tablet data. And so that
43:31
turned out to be true.
43:33
But we've also found that there are quite a lot
43:35
of really interesting unstructured
43:37
data sets that people have been
43:39
working with as well where
43:40
they're trying to predict, you know, anomalies and, like,
43:42
image data, very large image data sets
43:45
are doing kind of a
43:46
really interesting mixed modality training
43:50
with like text and tabular. We've also found that
43:52
there
43:52
are a lot of situations where users
43:55
wanna do machine learning training without a
43:57
lot of labeled data. And
43:59
that's I think
43:59
particularly interesting one because
44:02
it's been leading
44:02
us to invest a lot more heavily
44:04
in building out self supervised learning
44:07
capabilities into labeling. And
44:09
so, you know, one thing that we're working
44:11
on actively right now is
44:13
building out a really sleek pretraining
44:15
API for Lululeaks so
44:17
that you can without
44:18
needing to specify, you know, a target column or anything like that,
44:20
do some initial training to learn a good
44:22
representation of the data that
44:24
you can
44:25
then apply downstream to a
44:27
lot of different task. And so that's one that has definitely
44:29
been informed by what we've been
44:31
seeing from
44:32
customers as being a a very critical
44:35
need for them. that's
44:36
now informing a lot of products right now. In your own
44:38
experience
44:38
of going from working
44:41
at Uber and helping
44:43
to solve the problems that they have for machine
44:46
learning and producing these
44:48
useful open source projects
44:50
that have been available to
44:52
the community. and then turning that into building a business around
44:54
those capabilities and from the lessons
44:56
that you learned in Uber. I'm
44:58
wondering if you can just talk to some of
45:00
the most interesting or
45:02
unexpected or challenging lessons that you've learned in
45:04
the process of building base?
45:06
Yeah.
45:06
So I would say that
45:09
there
45:09
have been some really interesting problems
45:12
that mirror a lot of problems
45:14
that we encountered at
45:16
Uber. So I
45:16
think that when you look back at my time Uber, the
45:18
story there was
45:19
that I was very keen
45:22
on unification
45:24
of infrastructure. And
45:26
so one of the things that I was really heavily flushing
45:28
towards the end of my time there was on
45:30
moving away from
45:31
a a spark and
45:34
then plus random bespoke,
45:36
like training architecture built on top of
45:38
Oravod and some other things towards
45:40
using Ray as a unified infrastructure
45:42
layer And
45:43
so that very heavily informs the
45:46
direction that we took with credit
45:48
base in terms of building out
45:50
our training
45:52
system of being capable of being this single
45:54
compute cluster that is
45:56
capable
45:56
of doing the preprocessing,
45:59
the training, batch
46:00
prediction, kind of the whole thing end to end. That's worked out really well.
46:02
And then, you know, when we were starting to build
46:04
credit base, then we had to take
46:07
this kind of data
46:08
plane that came from all these years of working at Uber and, like, all the
46:10
lessons that we learned along the way to think about,
46:13
okay, now how we're gonna
46:14
make this into
46:16
a
46:17
truly serverless enterprise experience. Right? And
46:19
so we did a lot
46:20
in terms of the early days of,
46:22
like, building out the control plane layer
46:26
I
46:26
think there were quite a lot of lessons
46:28
we learned along the way about how you should think about coupling
46:30
in these sorts of big
46:33
complex distributed systems. where,
46:36
you know, we interface boundaries between
46:39
the control plane or the data plane
46:41
that were not particularly
46:43
well defined,
46:44
you know, there was a lot of tight coupling. So sometimes failures
46:47
would occur and certain things
46:49
that should not
46:49
have failed would fail
46:51
because there was much
46:53
coupling in there. And what we've
46:55
done over time is rearchitect
46:57
the platform to be much better
46:59
isolated so that we use more
47:02
kind of event driven architecture,
47:04
so more message brokers and things like that that
47:06
kind of makes things very clearly
47:08
separated. And
47:08
that's been a very big learning,
47:10
building an enterprise platform is, you know, how important
47:12
it is to really define the service
47:14
boundaries well between the different
47:16
points in the system. And overall, you
47:19
know, we found that reliability, robustness,
47:21
stability. These have been like
47:23
concerns that when you start building
47:25
the company, you don't initially think, oh, yeah. These are
47:27
gonna be the top things that I'm gonna put on the road map. Right? But
47:29
now that's definitely, like,
47:30
top of mind for us at all times is
47:33
how do we the
47:34
platform in a way where we
47:36
account for as
47:36
many things going wrong as possible and have
47:38
a story around making sure that at the
47:40
end of the day,
47:41
the user gets a very
47:43
clean and a very responsive experience,
47:46
right, that doesn't fail in some weird
47:48
unexpected way. Because of the
47:49
fact that
47:50
you are running a large and
47:53
scalable and multi cloud
47:55
system with a lot of distributed
47:57
systems going on. I'm curious how
47:59
you have approached the kind of
48:02
that as you iterate on the product, you're
48:04
able to very quickly get feedback
48:06
as to
48:08
whether a change has caused a regression in terms of your,
48:11
you know, ability to quickly recover
48:13
or being able to
48:16
identify potential issues with fault tolerance and just how you're
48:18
able to think about managing
48:20
forward progress and iterative development
48:22
on the platform well and ensuring
48:25
that you maintain those principles of stability and
48:28
scalability and fall tolerance?
48:30
Yeah. That has
48:30
been, I'd say, one of the more
48:33
difficult challenges to solve. I had to
48:35
say that we're still figuring out the right
48:37
way to think about some of these
48:39
things. But we've definitely invested quite
48:41
a lot in both ensuring, like, the
48:43
benchmarking of the BlueJeans sign. And so there's
48:46
an active project from one of our
48:48
employees working on building
48:49
out an entire benchmarking
48:51
pipeline for Ludwig so that every
48:54
time a change happens, we can,
48:56
you
48:56
know, validate it against different workloads
48:58
and make sure that model
49:00
performance is good. GPU utilization is good.
49:02
Memory utilization is good. That's sort of all
49:04
the kind of metrics that we care about.
49:07
for the workload are there at that level so
49:09
that we know that, okay,
49:11
it's not a change in load weight
49:13
that is causing failure to spike or something like
49:15
that after this change is made. So
49:17
that's I think the first aspect we have to
49:19
get right is, like, making sure that the open
49:22
source is very stable
49:24
and meets the requirements that we
49:26
have said. And
49:26
then from there, we have quite a lot that we've done
49:28
on the platform side in terms of building out continuous
49:32
integration and different tiers
49:34
of deployments for
49:36
the whole system to make sure that it's all well tested before
49:38
we
49:38
do a release. So we do
49:40
have
49:41
a regular release cadence that we have
49:43
set up with our customers. every
49:45
change we make goes into a live
49:48
staging environment that we test out
49:50
internally, goes through a full battery of
49:52
integration tests that
49:54
actually run cubanese cluster on, you know, live
49:56
compute resources and make sure that all
49:58
the different models that we, you know,
49:59
regularly test out are working
50:02
correctly and
50:04
not failing in any unexpected ways. And then we've also
50:06
invested a lot
50:06
on the observability side as well
50:09
in terms of making
50:10
sure
50:11
that we know Okay.
50:12
So if this workload used to take a minute to run and now
50:14
it takes five minutes, you know, what's the
50:16
part of the
50:17
system that's suddenly taking longer? Like,
50:19
what's the part in system
50:21
that sounded taking more memory. Right? Be able to see what
50:23
that trend line looks like and what
50:25
the inflection point
50:28
was. Right? And
50:29
so that's been a big area of focus for us lately
50:31
because it's just very important for
50:33
us
50:33
to ensure that as
50:36
we get more and more people contributing to code and more
50:38
and more moving parts that we
50:40
identify as
50:41
quickly as possible, like
50:43
when something changes, and then can go
50:45
back and and address it.
50:48
Right? So having every single
50:50
connect go through full CI process has
50:52
been very critical to
50:54
that, and I
50:54
think pretty good policies in place where, you know, we make sure
50:56
that
50:56
we don't commit anything to the main
50:58
line if, you know, the tests aren't in
51:01
a good stage and that's we
51:03
always make sure that we prioritize
51:05
stability and bug fixes above
51:07
new feature development. So all of those
51:09
best practices, I think, are very key to
51:11
getting it right. it's
51:12
still something we're learning as we go. And so
51:14
for
51:14
individuals or organizations that are
51:17
looking to be able to
51:20
accelerate the rate at which they're able to experiment with
51:22
and adopt machine learning to
51:24
address some of the organizational
51:26
and product problems that they're
51:29
trying to all four. What are the cases where is the
51:31
wrong choice? I mean, that's, I think,
51:33
a very valid question. And I
51:35
think there are definitely
51:37
times
51:37
when it might not be the
51:39
right choice for your organization. When
51:41
we
51:41
think about, like, where
51:43
the market segments you know, you can
51:45
kind of think of it for quadrants, I guess. Right? You on the one hand to maybe two
51:47
axes to make a sense. I was like, on the one
51:49
hand, you have organizations
51:51
that have low data
51:54
versus
51:54
organizations that have high data. And then on the other
51:56
axis, you have organizations that
51:59
have
51:59
high
51:59
ML experience and low ML
52:02
experience. Right?
52:04
And
52:04
so definitely, you know, the bread and butter customer for us would
52:06
be a company that's very high in terms
52:09
of, like, data volume, and quantity,
52:12
but not as high in terms of, you know, having a big sophisticated
52:14
ML team. You can certainly have an
52:16
ML team, but, you know, not one that
52:18
wouldn't want to necessarily say that, like,
52:22
Google research should be a target customer in the works.
52:24
Right?
52:24
And then on the
52:24
flip side, you have organizations that maybe
52:27
don't have a lot of
52:30
data at all. And certainly, I think there are companies
52:32
out there that are
52:32
trying to think about ways that they can bring in
52:35
alternate companies that don't have but,
52:38
you know, for specialized use cases where it's like using pre
52:40
trained models and things like that. But
52:42
that's not
52:42
what we're currently looking at.
52:45
We definitely still are thinking, you know, companies that have a lot of
52:47
data and don't quite know how to get enough value out of
52:50
it. That's very core to
52:51
to what we do
52:54
well. Right? And
52:55
I would also say that it's very important for a customer,
52:57
a credit base to have some variety of of use
52:59
cases that they wanna solve.
53:02
it's
53:02
definitely not a prerequisite, but I would say that when you look at
53:04
the market, there are companies that only
53:07
do fraud detection or only do
53:09
computer vision or something like
53:12
that. And,
53:12
you know, I won't necessarily wanna say that database is
53:15
gonna be all of them all the time
53:17
on every task. Right? So, like, what
53:19
I would say is that we
53:21
provide a very good solution
53:23
for time to value relative
53:26
to these other platforms. Right? If
53:28
you have a
53:28
good variety of different things you wanna do in the space. So certainly,
53:30
I think if you wanna do,
53:32
you know, computer vision
53:36
and NLP, from, like, a purely
53:38
cost benefit standpoint, I think that
53:40
we have a much stronger value proposition
53:42
there than if you were to try to do
53:43
point solutions for all of these different things.
53:46
Right? So that's
53:47
the other aspect that's maybe less of a hard requirement, but still I
53:49
think an important differentiator. As
53:51
you continue to iterate on
53:52
the product and now that
53:56
you have gone out of stealth
53:58
and you're starting to accept new customers onto the platform? What are some
54:00
of the things you have planned for the near to
54:02
medium term or
54:04
any particular areas
54:06
of focus or new features that you're excited to dig into?
54:08
I'm certainly very
54:09
excited about having a full SaaS
54:11
version of the product that people
54:13
can try out Right
54:15
now, we're in a closed beta. So,
54:17
you
54:17
know, we are certainly really
54:20
excited when people come to us and say they wanna
54:22
try it out, and we'll set up some time
54:24
to do a pilot with them. But I'm, you know, very
54:26
excited
54:26
about the possibility of having a website people
54:28
should just log in to start using it,
54:30
you know, without any commitment. Right?
54:33
And so something
54:34
that we're definitely working on right now and
54:36
and thinking about how we can put that in people's
54:38
hands. From a product
54:39
standpoint, there's also a
54:42
lot that for things about now. So I mentioned the self supervised learning
54:44
work before. There's also some work
54:46
that we're
54:46
doing to the open source community as well
54:50
around better support for custom components and kind of
54:52
user defined functions, if you will.
54:54
So, you know, with Ludwig and
54:56
Creditbase, there's quite a lot of
54:58
flexibility and terms of, like, your
55:00
degree to specify, like, every
55:02
parameter of the model. But if you
55:04
wanna add new model architectures,
55:06
it's possible today, but we wanna make that experience even easier for folks
55:08
so that just a very lightweight interface
55:10
you implement and then you can register
55:12
that component as
55:14
just another option in your config within database that other
55:17
people in the organization can use.
55:19
And then also
55:20
this concept of
55:22
the model hub slash model registry. I think is one
55:24
that I'm
55:24
very excited that will be provide benefits for
55:27
both the open source
55:29
users as well as the commercial
55:31
users where you can
55:32
do things like define canonical components
55:35
that you
55:35
wanna use in your organization. So if
55:38
there's, like, a
55:39
feature that gets used all the time in
55:41
different models, if I remember at Uber. We
55:43
had some features, like, related to,
55:45
like, customers related
55:48
to, like, locals that were just used in all different types of
55:50
models. Right? So being able to
55:52
have canonical
55:52
encoders for those that are
55:54
maybe pre trained even on you
55:57
know, a very large data set. So there's very low
55:59
cost
55:59
to fine tuning them. I'm very
56:01
excited about building out that capability
56:04
as well. Well,
56:05
for anybody who wants to get in touch with
56:07
you and follow along with the work that you're doing, I'll
56:09
have you add your preferred contact information to
56:11
the show notes. And as a final question, I'd like to get
56:13
your perspective on what you see as being the biggest barrier
56:15
to adoption for machine learning
56:18
today. So definitely,
56:19
I think that there is
56:21
a very big barrier to adoption that comes from just
56:23
not having
56:24
good enough abstractions to
56:27
start getting value out of machine learning.
56:30
So I think the analogy that
56:31
I like to draw up here is
56:33
really about software and kind
56:35
of what's enabled software
56:36
to keep the world is the famous article in The Wall Street Journal once
56:38
said. Right? And it really comes down to
56:41
this idea of modularity and being able
56:43
to kind
56:43
of stand on
56:46
the shoulders giant. So instead of being re implement every, you know,
56:48
great new idea that comes
56:49
along, like you just download a library
56:51
and use that software. I
56:53
think hasn't
56:54
had this abstraction before,
56:56
right? And I think that it's been a very
56:58
big inhibitor to people actually being able
57:00
to adopt it is you know, new idea comes out
57:03
from research, but companies aren't
57:05
able
57:05
to productionize it and
57:07
actually get it
57:08
deliver value because they're too busy trying to reinvent the wheel and reinvent
57:11
the infrastructure and figuring out how to
57:13
get data from one
57:14
place to another, clean up their data.
57:18
So
57:19
definitely I think having better abstractions and better
57:21
canonical sources of data as well are the
57:23
two biggest variants of my opinion. So I think once you
57:25
get to a point
57:28
where all
57:28
the data is clean and then standard data warehouse systems and
57:30
is ready for machine learning. And
57:32
then you have very powerful abstractions like
57:35
Predibase that allow you
57:38
to take best
57:38
in class models and just run it right on this, you know,
57:40
nice clean medical data source, then
57:43
you'll have a very,
57:44
very fast path to production. And
57:48
so we
57:48
definitely think we can move the needle on the modeling
57:50
side. And
57:50
I think certainly companies like
57:53
DVT, Snowflake, others are doing
57:55
a great job on data side. And
57:57
once these two things converge, then hopefully, we'll be able to
57:59
really
57:59
start, you know, delivering
58:00
more value. But that's definitely, I think,
58:02
where companies struggle than us today.
58:06
Alright. Well,
58:06
thank you very much for taking the time today to
58:08
join me and share the work that you've been doing at
58:10
Prada based. It's definitely a very interesting
58:14
plot form and product that you're building there. So I'm excited to see
58:16
where you go from here. So thank you
58:18
again for all the time and energy that you
58:20
and your team are putting into
58:23
making it easier for organizations
58:25
to get onboarded with ML and
58:27
be able to experiment with it and gain
58:29
some of the value from its capabilities. thank
58:31
you again for that, and I hope you enjoy the rest of
58:33
your day. Awesome.
58:34
Thank you, Tobias. I really appreciate
58:36
it and as
58:38
well. Thank
58:40
you for listening. Don't forget to check out
58:42
our other shows. The Data Engineering podcast,
58:44
which covers the latest on modern data
58:46
management, and the machine learning podcast, which
58:48
helps you go from idea to production with
58:51
learning. Visit the site at
58:53
pythonpodcasts dot com to subscribe to the show, sign
58:55
up for the mailing list and read the show notes.
58:57
And if you'll learn something or try it out a project from
58:59
the show, then tell us about it. email hosts
59:02
at pythonpodcast with your
59:04
story. And to help other people
59:06
find the show, please leave a review on Apple
59:08
and tell
59:09
your friends and coworkers.
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More