Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:11
Hello, and welcome to the Data Engineering
0:13
Podcast, the show about modern data management. Introducing
0:17
Routerstack Profiles. Routerstack
0:19
Profiles takes the SAS guesswork and SQL
0:21
grunt work out of building complete customer
0:24
profiles so you can quickly ship actionable,
0:26
enriched data to every downstream team. You
0:29
specify the customer traits, then Profiles runs
0:32
the joins and computations for you to
0:34
create complete customer profiles. Get
0:36
all of the details and try the
0:39
new product today at dataengineeringpodcast.com/ Routerstack.
0:42
You shouldn't have to throw away the database
0:44
to build with fast-changing data. You
0:46
should be able to keep the familiarity of
0:48
SQL and the proven architecture of cloud warehouses
0:51
but swap the decades-old batch computation model for
0:53
an efficient incremental engine to get complex queries
0:55
that are always up to date. With
0:58
Materialize, you can. It's the only
1:00
true SQL streaming database built from the ground up
1:02
to meet the needs of modern data products. Whether
1:06
it's real-time dashboarding and analytics, personalization
1:08
and segmentation, or automation and alerting,
1:10
Materialize gives you the ability to
1:12
work with fresh, correct, and scalable
1:14
results, all in a familiar SQL
1:16
interface. Go to dataengineeringpodcast.com/Materialize
1:19
today to get two weeks
1:21
free. And
1:25
now bringing us to the Anomstack project, you
1:27
said that some of its origin comes from
1:29
the work that you're doing at NetData. But
1:31
I'm wondering if you can just give an
1:33
overview about what it is that you've built,
1:35
some of the story behind how it came
1:37
to be, and why you decided that you
1:40
wanted to make it as accessible and approachable
1:42
as possible. Yeah, so probably
1:44
primarily it's because I've had to build
1:46
versions of this in every job I've
1:48
been in for the last 10 years.
1:51
It's always been kind of custom every time,
1:53
and a little bit, you know, not very,
1:55
very custom and specific to whatever infrastructure or
1:58
data stack you're using. Nowadays,
2:00
there's a lot of open source projects and tools that
2:02
we can build on. And I just felt
2:04
like the time is right now to actually save myself
2:07
from building it the next time for the next five
2:09
years, I should just build a project that I can
2:11
open source and see if I can get some contributions
2:13
around. And so the
2:15
idea there is this is focusing on smaller
2:18
teams, smaller data
2:20
operations, give
2:22
them a simple way to just bring their
2:24
metrics and get really decent and
2:27
non-detection out of the box, basically. And
2:29
in terms of the
2:31
term metrics, given
2:33
your background in that data, that
2:36
makes me think about metrics from
2:38
an operations and infrastructure standpoint about
2:41
what is the CPU load, what
2:43
is the available memory. But
2:45
the term metrics in the data ecosystem has
2:48
also become overloaded with this idea of the
2:50
semantic layer and business metrics. And what does
2:52
it mean for somebody to be a customer?
2:54
And I'm wondering if you can maybe give
2:57
some sense about how you're thinking about metrics
2:59
in the context of an AMP stack and
3:01
the ways that it can be applied. Yeah,
3:03
so actually, metric trees is another
3:06
thing I've seen recently. There's a lot of talk
3:08
around metric trees and building these relationships on the
3:10
metrics. The main goal is
3:12
simplicity. And so there is
3:14
lots of different metric concepts in the
3:16
observability space. But we're
3:19
not using that here necessarily. So the
3:21
definition of a metric, basically, is a
3:24
row on your data frame or a
3:26
row on your database in the metrics
3:28
table, where it's literally just a metric
3:30
name, timestamp, and value. And that's it.
3:33
So that's kind of the idea there is this
3:35
makes it really easy for users. That's all a
3:37
user has to produce in these three fields.
3:41
And so we're not going too fancy
3:43
in terms of complex metric definition, because
3:46
that just adds kind of a little
3:48
bit more of a ramp for people
3:50
to actually use the system. So
3:53
there's pros and cons to each, of course,
3:55
like in observability. And you have all these
3:57
concepts in tools like Prometheus. different
4:00
types of metrics and how you work dimensions in and
4:02
stuff like that. But for our case, for an unstacked
4:04
idea, it's just keep it as simple as possible, basically,
4:06
to begin with. And that also makes
4:09
it very flexible because if you
4:12
don't necessarily have a constrained
4:14
definition of what that metric can be and
4:16
what it's supposed to mean, then that means
4:18
that everybody can map it to whatever semantic
4:21
attribute they want it to in order to
4:23
determine what are the anomalies and how does
4:25
that impact whatever it is that I'm trying
4:27
to measure. Yeah, and this is kind
4:29
of actually something that I have got on the
4:31
roadmap for the project is to extend
4:34
a little bit so that when you're defining the metric,
4:36
you also define some metadata. Obviously,
4:38
the first thing being like a metric description,
4:40
say, and because the idea there is actually
4:43
if we could do that, even if you
4:45
had a useful description, that
4:47
would help a lot more with then like the
4:50
saliency of the anomalies because an anomaly is an
4:52
anomaly, but whether it's something you care about or
4:54
not is a different question. And so if we
4:56
can get some of this metadata, like maybe things
4:59
like priority P1, P2, or whatever different tags you
5:01
want, you could
5:03
obviously then do different kind of route. You can
5:05
route the lyrics differently. But actually, like longer term,
5:07
I'm thinking that there could be something where this
5:09
could be something that large language models
5:11
could obviously use as well. So if they had this
5:14
kind of rich metadata that they could make sense of,
5:16
that could also be useful in terms of, you
5:18
might say, oh, how's my, what's my
5:20
nominees in the sales today? And the fact that you
5:22
have all this stuff in the descriptions would make that
5:25
a lot easier. So previously with the semantic, all the
5:27
semantic stuff was good, but there's a lot of overhead
5:29
to maintain it, you have to agree
5:31
in your structure front and implement it. Whereas
5:33
if we just allow some kind of free
5:35
texty, more higher level stuff, there's
5:37
definitely roles where I think language models could help
5:40
make sense of it as well in terms of sorting
5:42
through the metrics. Yeah. And giving you
5:44
some human level understanding about, is this
5:47
something that you actually care about? Yeah,
5:49
that's always the problem because oftentimes
5:52
systems like this, you end up metrics, there's thousands of
5:54
metrics. And the idea is, we want metrics to just
5:56
be like cattle, you don't have to think about them.
5:59
They're not special. just produce
6:01
your metrics, metrics, metrics. And then that's great because then you
6:03
have all these metrics but then the problem can be how
6:05
do you make sense of it when you maybe have a
6:08
hundred layers today and maybe 50
6:10
of those layers are on metrics that, they're
6:12
nice to know but they're not that important.
6:15
And so it's things like that where if you could
6:17
have each of these layers be like a little insight
6:19
snippet, you could actually maybe have a language model make
6:21
sense of it or ultimately longer term, if you had
6:24
a sort of a feedback loop on top of the
6:26
system, like an on stack where you could give thumbs
6:28
up, thumbs down to sort of try and start measuring
6:30
saliency of like, okay, what do people care about more
6:32
than average that then kind of could become a whole
6:35
different layer on top of it, but that's an open
6:37
problem. I don't think anyone's really solved that yet, to
6:39
be honest. Absolutely. And also even
6:41
if some single metric is anomalous,
6:43
it maybe doesn't matter unless it's
6:46
correlated with another anomaly in
6:48
a different metric. And it's
6:50
that conjunction of anomalies across
6:52
different metrics series, or maybe even across different
6:54
service boundaries that will let you know, oh,
6:57
hey, there's actually something really wrong here. You
7:00
need to do something about it. Yeah, yeah. And that's it.
7:02
That's like another, that's something that I've seen
7:04
some people do really well. So Amadot
7:06
is another tool I've used in the past with
7:08
for an ambience section. And they do a really
7:10
good job of this where they stack all the
7:12
alerts together into sort of, so each alert becomes
7:14
like a stack of alerts. And then you can
7:16
kind of quickly, really quickly see based on the
7:19
map of like, okay, what's making up this batch
7:21
of alerts basically. That's something
7:23
that I would like to add in the future as
7:26
well actually could be really interesting. And
7:28
you mentioned that one of your objectives of building
7:30
this project and releasing it as open source is
7:32
so that you don't have to build it again
7:35
in whatever future role you have. I'm wondering if
7:37
you can just give an overview about what are
7:39
the core objectives that you have and what are
7:41
the things that you would like to see come
7:44
out of this project and some of the direction
7:46
that you'd like to see it taken in. Yeah,
7:49
so the main objective is just have
7:51
a nice, easy open source solution for
7:53
people to get good anomaly detection on.
7:55
Typically business metrics is what I have
7:58
in my head here. and
8:01
low overhead. And then, so
8:03
if you're like someone that's kind of, you
8:06
don't necessarily have to be an infrastructure engineer, just
8:08
technical enough to maybe, you bring your own SQL
8:11
to define the metrics or you can define custom
8:13
poison functions to define the metrics as well. But
8:15
the idea is like, you could be a business
8:17
analyst who can actually just bring your metrics and
8:19
then actually stand this up yourself. And
8:22
it's just a Docker container. So that's
8:24
the main idea is like, keep it
8:26
as easy as possible for like smaller
8:28
teams to either can't afford bigger
8:30
expensive SaaS solutions, or
8:33
they don't necessarily have the
8:35
time or expertise to like build their own custom
8:38
solution. That's just use the tool like this and
8:40
get decent enough anomaly detection on all your metrics
8:42
out of the box. That's the
8:44
main aim. And for people
8:46
who are interested in being able to get
8:48
these alerts and understand, okay, I've got lots
8:50
of metrics. I don't wanna have to care
8:52
about them and keep a close eye on
8:55
them. I just want something to let me
8:57
know when there are things going wrong. What
8:59
are some of the other tools or products that
9:01
they might be evaluating when they
9:04
come across a NOM stack and what are the aspects
9:06
of a NOM stack that might sway them in its
9:08
favor? Yeah, so there's lots
9:10
of, there's kind of, there's
9:12
a couple of different solutions here, a couple
9:14
of different approaches. There's like vendors who I've
9:16
actually used in the past. Amadot is probably
9:18
the biggest and the oldest player here. Like
9:21
they really go and anomaly detection across all
9:23
types of metrics. And I'm
9:25
not the butter stuff. It was a few
9:27
years ago and I used them and it's done a
9:29
lot since as well. And so these are like services
9:31
that you pay for and enterprise setting that they're very
9:34
expensive and there's a bit of configuration involved once
9:37
they're up and running, they're good. And
9:39
then there's also lots of like
9:41
newer SaaS type startups in
9:44
the kind of modern data stack space and era that we're
9:46
in. So KOS Genius
9:48
is another one there that's actually, I've been looking at
9:50
recently that's pretty good and pretty cool. But
9:52
there's also then the other approach there. A
9:55
lot of the data warehouses now are starting to build
9:57
some of these ML features into their, into
10:00
their stacks themselves. So like Snowflake, BigQuery,
10:03
they all actually now typically have their
10:05
own anomaly detection functions
10:07
and ML functions that you
10:09
can train models and save models just
10:11
within your SQL. That's another option as
10:14
well. If you're using a platform like this, you can always,
10:16
of course, try and... It's a little bit easier now to
10:18
try and roll your own because
10:21
you can do a lot of it now in SQL
10:23
itself. And then the other vendors, like Metaflane is actually
10:25
one I've used as well. Metaflane is pretty cool. It's
10:27
a little bit more focused on the data data
10:30
engineer and data upside of
10:32
the metrics. But you can tweak some of
10:34
these things to also cover business metrics as
10:36
well. And digging
10:39
more into that concept of the
10:41
business metrics and being able to
10:43
generate alerts and detect when there
10:45
are anomalies, I guess that's
10:47
another vague term that might
10:49
be worth digging further into is that idea
10:52
of anomalies and what makes something actually anomalous.
10:54
Is it just because it is
10:56
two standard deviations away from the mean? Is
10:58
it because there's something, some specific
11:00
rule that you have that this value
11:02
can never exceed this threshold? I'm wondering
11:05
what are some of the specific types
11:07
of anomalies that you're looking to address
11:09
and alert on and some of the
11:11
ways that people need to be thinking
11:13
about how to understand when something is
11:15
actually anomalous versus just a little bit
11:17
weird. Yeah, yeah, that's a good point. And
11:19
this is kind of, I'm a little bit obsessed with
11:21
anomaly detection, to be honest, because it's one of those
11:23
areas of machine learning and data
11:25
science that still has, there's another kind of
11:28
art and science involved in
11:30
it. So there's a lot
11:33
of subjective decisions as to like, well, does this
11:35
look anomalous to you? It does to me. And
11:37
it's not as easy as just doing something like
11:42
regression or classification, where you have a simple
11:44
metric like accuracy. In anomaly detection, you don't
11:46
have any metrics like this that you can
11:49
use as a source of truth. So it's
11:51
a little bit subjective. And so
11:54
that's one of the reasons why we use good
11:57
defaults, basically. So we're using PoIOD, which is
11:59
an old open source project around
12:01
anomaly detection. And basically
12:03
we have defaults there to use
12:05
like as flexible a model by
12:07
default as possible. So it's using,
12:09
you know, best practice standard sensible
12:11
things around feature pre-processing. And then
12:14
it's using like a PCA based
12:16
anomaly detection model, which is more
12:18
flexible and it'll cover more types
12:20
of anomalies as opposed to say,
12:22
if you just have single spikes,
12:24
you know, okay, they're the obvious ones that people
12:26
always think of, but sometimes it's basically like instead
12:28
of a single spike, is it a strange little
12:30
squiggle that's changed recently? Or is it an increase
12:32
in trend and different, a different wider kind of
12:35
cast the net as wide as possible? And
12:37
so that's why we're using PYOD with like
12:39
a general, you know, flexible model underneath, but
12:42
there's also then of course, like if you're
12:44
a user, you can define your own pre-processing
12:46
functions or you can define your own
12:48
model as well. So you can, if
12:50
you wanted to, you can extend it to be like,
12:53
maybe you know, for instance, if it's say, well,
12:55
okay, this metric here is daily sales
12:58
and you actually know that there's a big impact on
13:00
whether it's the weekend or whether
13:02
it's the, you know, the weekday say, or even
13:04
time of day. So you could actually build your
13:06
own, your own pre-processing function to say, okay, I
13:09
wanna like, when it's the weekend, I want it
13:11
to be, you know, is weekend equals one. And
13:14
then when it's during the week, is weekend equals zero.
13:16
And you can then pass that through to the model
13:18
to use that as a feature. So it can
13:21
get quite sort of, it can
13:23
depend a lot on exactly how you
13:25
want to do it, but the idea here
13:27
with, you know, I'm stuck approach is to
13:29
like use as general and sensible a default
13:31
as possible that, you know, will cover
13:33
all metrics reasonably well. And then if you want to
13:35
kind of go more complex, you can, but
13:38
yeah, it can get quite sort of subjective and
13:40
complicated in terms of, you know, what
13:42
is in the nominee or not. Data
13:46
projects are notoriously complex with
13:49
multiple stakeholders to manage across varying
13:51
backgrounds and tool chains. Even simple
13:53
reports can become unwieldy to maintain.
13:56
Miro is your single pane of glass where
13:58
everyone can discover, try. and collaborate on
14:01
your organization's data. I especially
14:03
like the ability to combine your
14:05
technical diagrams with data documentation and
14:07
dependency mapping, allowing your data engineers
14:10
and data consumers to communicate seamlessly
14:12
about your projects. Find
14:14
simplicity in your most complex projects with Miro.
14:17
Your first three Miro boards are free
14:19
when you sign up today at dataengineeringpodcast.com
14:22
slash Miro. That's
14:25
three free boards
14:27
at dataengineeringpodcast.com/M-I-R-O. Digging
14:33
now into the idea of
14:35
metrics definition and identifying what are the metrics
14:37
that you should care about, what are the
14:39
metrics that are useful to be alerted on,
14:41
what are some of the
14:43
ways that data teams or operations
14:46
teams should be approaching that question
14:48
and thinking about how do I
14:50
decide what are the metrics that are actually
14:53
going to matter, what are the ones that
14:55
will give me a useful signal of something
14:57
needs to be addressed and it's going to
14:59
have some sort of business impact versus just,
15:01
hey, it might be neat to know about this
15:03
thing. Yeah, yeah. Typically,
15:07
what metrics are you reporting to your senior
15:10
management basically? Start with them. So
15:12
there's typically like there's business metrics where you
15:14
start with that. Typically
15:16
they're obviously headline business metrics like
15:19
users, payments, sign-ins.
15:22
Depending on your business, they're usually pretty
15:24
obvious, the main bright stars. And
15:27
then there's also technical metrics as well. So we
15:29
use sometimes a lot of technical metrics for underneath
15:32
the stuff for the health of the app
15:34
itself and things like that. But
15:36
generally, it should
15:39
be obvious and if it's not obvious, then
15:41
it's probably a question for, okay, well, maybe
15:43
this isn't a metric I should
15:45
use. The way I think
15:47
about things though is that
15:49
metrics are increasing, everything has become a time
15:51
series as you have more and more data
15:54
and metrics are becoming just more and
15:56
more commonplace. So it's okay to
15:58
have lots and lots of metrics. It's just...
16:00
that you want to have like priority one
16:02
level of metrics, priority two level of metrics.
16:04
So you can kind of embrace the messiness
16:06
of like, okay, we've got loads of metrics
16:08
across all these other types of business objectives,
16:11
secondary objectives, we'll put them in a different bucket
16:14
then you put your main kind of executive level
16:16
metrics. And they obviously
16:18
then would, they get a special route
16:20
when they go off versus
16:22
when all the other metrics go off. Because
16:24
then it's like, okay, well, if the
16:26
P1 metric alerts, I want
16:29
that to go straight into the Slack or I want that
16:31
to email me straight away. But then I also want like
16:33
all the other metrics that are like lower priority
16:35
or lower interest, maybe every now and then I
16:37
want to just open up that inbox and
16:40
browse through those kind of, read the newspaper
16:42
as such to see. And
16:44
that's very useful as well then because, if
16:47
you have a good anomaly detection system, it almost becomes
16:49
like a BI tool in that sense as well. And
16:51
that it's actually uncovering insights and you can quickly, it's
16:53
then more just about the UI UX, like can you
16:55
quickly scan 50 alerts and
16:59
see, oh, there's one thing there that it's actually might be interesting.
17:03
That's like that's gold dust if you can get that
17:05
in terms of an insight. Because otherwise you would have
17:07
had to pre-configure a dashboard and maybe it's in, it's
17:09
in some dashboard in the second tab and the third,
17:12
then the second quarter of the page. And you have
17:14
to get so lucky that your eyeball happens to land
17:16
on that chart. That's just, it's not
17:18
really a scalable approach to analytics, especially in this
17:20
day and age when there's just so much more
17:23
data. So that's the other flip side of
17:25
it as well. It's like, it's more about
17:27
sort of how you make, how you wrote the insights that
17:29
you get from these two. And
17:32
before we dig too much further into
17:34
the implementation of a NOM
17:36
stack, another thing that I noticed
17:38
as I was reviewing the project is that you
17:40
put in a lot of effort to make it
17:42
as easy to get up and running and get
17:44
started with and evaluate as possible with including
17:47
out of the box pipelines for
17:49
DAGSTER, having a GitHub Codespaces available.
17:52
I forget what the other options
17:54
were, but it was just very
17:56
much a, I really want you to
17:58
use this thing. And I'm wondering. what
18:01
was the impetus for putting in all
18:03
of that effort? And what
18:05
are some of the ways that that focus
18:07
of making it easy to adopt, making it
18:09
easy to test out influenced the overall design
18:11
of the project and the ways that you
18:14
were thinking about how to architect it so
18:16
that it was easy to adopt and implement.
18:19
Yeah, so main kind of consideration there is to
18:21
try and keep it like as easy as possible
18:23
in terms of like, it's not over engineered at
18:25
all. Basically under the hood, when you look into
18:27
it, everything is like a planless data frame that's
18:30
moving around. So I kind of wanted it, basically
18:32
build it for a version of myself maybe 10 years ago,
18:35
who was like, instead of back then, I
18:37
had to like stand up my own airflow VM and
18:40
come up with all the data engineering part
18:42
of it. If I can actually just Docker
18:44
compose up and then just focus on the
18:46
SQL and the metrics, then I'd be really
18:48
happy. And that's kind of what the aim
18:51
is here, is that you can easily run
18:53
through Docker or even serverless, DAGS or cloud
18:55
is really cool as well, the way they
18:57
have an integration on GitHub and it'll just
18:59
automatically deploy to DAGS or cloud. So you
19:01
don't even have any sort of operations. Then
19:04
you can just focus on a PR to
19:06
add new metrics or Azure metrics evolve. It's
19:08
all kind of get up type approach. And
19:11
the idea was there like, ideally I'd love to have,
19:13
it's still quite early on in the project. So I've
19:15
only been working on it kind of a month or
19:17
two. And the plan is kind
19:19
of have users that actually use it, could
19:21
also then become contributors as well, and
19:24
so lower the barrier to contribution as
19:26
much as possible as well. So that's
19:28
why we're kind of, all the concepts
19:30
are very straightforward and very simple. And
19:33
that's the idea, like is to actually have users that
19:35
can use it. And also like if they wanna make
19:37
an improvement, for sure, like, yeah, get
19:39
involved, make it pure, it'd be great, you know? So
19:41
that's the idea is to actually have users and contributors.
19:45
In terms of the implementation and
19:47
as you were defining the
19:49
scope of the project and thinking through, okay,
19:51
I want to have this open source anomaly
19:53
detection stack so that I don't have to
19:56
rebuild it over and over again. What
19:58
are the core capabilities? and
20:00
constraints that you were focused on
20:02
that informed the final implementation of
20:05
what you have built so far.
20:08
Yeah, so I actually originally started
20:10
with a nominee
20:12
detection provider in Airflow. So
20:14
we use Airflow and I
20:18
built a nominee
20:20
detection Airflow provider package. That's
20:22
also in the Airflow registry with
20:24
the astronomer folks. And that
20:27
works. So if you're using Airflow, that's
20:30
one approach. But I was
20:32
thinking as I was doing it, I was kind of thinking, well,
20:35
this kind of depends on Airflow. And it's a
20:37
bit silly for people to have to then stand
20:39
up Airflow to do a nominee detection. So I
20:41
wanted to look more standalone. And so I also
20:43
was aware, like at the time, a lot of
20:45
these data orchestration tools, there's so many options and
20:47
they're all great now. So
20:50
the approach there was actually
20:52
OK. I want to have a flexible
20:56
enough general simple orchestration
20:58
tool and then also use,
21:00
you know, PIO-D to all the ML stuff.
21:02
So it's basically putting all the ingredients together
21:04
into this little app approach that's kind of
21:07
fairly easy to stand up, fairly easy to reason about.
21:09
And that's
21:12
the main aim is to actually have as
21:14
little movement path as possible and just
21:17
get what we need for decent enough, you
21:19
know, a non-protection layer into your inbox. That's
21:22
the north star. And
21:24
now as far as the actual
21:27
implementation, the architecture, wondering if you
21:29
can describe how you implemented an
21:32
omstack and some of the ways
21:34
that you optimized for these particular
21:36
design constraints that you mentioned. Yeah,
21:40
so I had a look at
21:42
a few different orchestration platforms, basically.
21:44
And it was a good
21:46
excuse. I'd been aware of DAGSTIR, but
21:48
I hadn't really used it that much. I'd
21:51
mostly been used to Airflow and, you
21:53
know, other things like serverless options in
21:56
GCP and AWS. And so I
21:58
had a look at DAGSTIR and actually DAGSTIR... seems
22:00
almost perfect because the well,
22:02
that's to have an approach called software defined
22:05
assets. That's like really interesting approach that they
22:07
have. But actually a step underneath that is
22:09
basically just jobs. And a
22:11
job is the core kind
22:13
of building block here. So when the user
22:15
defines their their metrics, a metric match, basically,
22:18
it an armstack will just trigger six jobs
22:20
for four main jobs, like there's a job
22:22
to ingest a job to train a job
22:24
to score and a job to alert. And
22:26
so the main kind of
22:28
concept here is you bring your configuration
22:30
and then the tool itself will
22:32
do the orchestration. And then
22:34
also use, you know, pod for the ML stuff
22:36
as well. So like it kind of, it's more
22:39
mainly like putting, putting together these recipes of, of
22:41
different greetings that are already out there in the
22:43
ecosystem. And that's, that's kind of what
22:45
the culmination is. From the time
22:47
that you first started building this project
22:49
to where you are now, I'm wondering
22:52
what are some of the ways that
22:54
the overall goals and implementation have evolved
22:56
and maybe some of the dead ends
22:58
that you explored and ultimately discarded. Yeah,
23:01
I actually am one of the dead ends that
23:03
I almost talked with, like, I was
23:05
kind of jokey, but I have
23:07
been we've implemented a LLM alert
23:09
job itself. So instead of the
23:12
pod ML models for the anomaly
23:14
detection, we actually have a
23:16
LLM alert job that you can enable,
23:18
which basically just sends the data to,
23:21
to GPT and ask, does it look anomalous?
23:23
And it's kind of more so curious, because
23:25
it's a good, it's a
23:27
good example of like where the limits are in
23:29
terms of language models, because I wanted to see
23:31
like, how actually useful can it, can it be,
23:34
you know, getting sense back from the language model
23:36
and time series, time series data is still a
23:38
bit sort of at the edge of what LLMs
23:40
are really able to do really well. And
23:42
so it was kind of fun playing around with that. There's
23:44
a lot of iterations of doing as minute, like I started
23:46
with as minimal approach as possible, send the data to the
23:48
LLM, see what it gets back. And it was kind of,
23:51
it wasn't even kind of understanding the time series,
23:53
like it couldn't even get the order of the data
23:55
itself. And so there's been a few iterations of that,
23:58
like playing around with prompt engineer and give
24:00
it all the hints it needs to do it. And
24:03
it's actually, it's kind of surprisingly working
24:06
but it works technically, but like when you, and
24:08
it works and it makes sense, but when you look
24:10
at it then and take a higher picture as a
24:12
human, it's actually not that useful at all because the
24:14
anomalies that the LLM comes up with are technically often
24:16
they are anomalies, but they're not anomalies that you would
24:18
care as a human if you were eyeballing the data.
24:20
And so it's tricky. That was like, it was fun
24:23
to do all that. And I kind of mainly did
24:25
that just as a sort of
24:27
a joke almost, but that was something that
24:29
I think it's kind
24:31
of interesting to see, but it's being
24:33
definitely, I've just turned them all off today by default.
24:36
It's an optional kind of job that you can turn
24:38
on. It's a little bit of a dead end. I
24:40
don't think it's as useful as you
24:42
might think it is. For people who
24:44
are interested in testing it out, getting
24:46
it deployed, as we already discussed, there
24:49
is a very easy on-ramp, but for
24:52
people who want to then go from, okay,
24:54
I've tested it out, it seems interesting. Now
24:56
I want to run it in production. What
24:58
does that journey look like? And what are
25:00
some of the considerations and potential sharp edges
25:02
that people need to be thinking about as
25:04
they go from proof of concept to this
25:06
is business critical now. So there's a couple
25:08
of ways to use it. You can, the
25:10
repository itself is a GitHub template. So you
25:12
can kind of actually, you can clone, you
25:14
can obviously of course clone the repository, but you can
25:17
use the GitHub template to make a copy of it.
25:19
And then once you have that GitHub template repository, you
25:21
can then use that for your metrics and deploy
25:23
it wherever you want through Dijkstra
25:25
Cloud or just using your own kind
25:27
of CI CD and Docker Compose. So
25:31
some of the sharp edges are probably like this. There's
25:33
still, it's still very immature project. It's still
25:35
very, very young. I just finished like the
25:37
first set of proper tests today. So there's
25:40
always like, this is something that comes with
25:43
these open source projects as well is that
25:45
it's, especially when they're young like this, that
25:48
definitely is, I wouldn't, it's a pinch of salt
25:50
in terms of like, you're better off
25:52
like dog food, or gently on
25:55
stuff that's not production. And then once you're
25:57
comfortable with that, then you go from there.
25:59
So like, That's what I might
26:01
do at the moment. I'm kind of dog food
26:03
as we go. And so there's a little bit
26:05
like it's still, there's a small little bit of
26:07
infrastructure in terms of, okay, how
26:09
are you gonna run these Docker containers? How
26:12
are you gonna monitor them? How are you
26:14
gonna have availability, things like that. These are typical
26:16
enough kind of considerations with tools like
26:18
this. So there's still, there is still
26:21
a couple of kind of, it's not completely
26:23
hands-free. It's not completely paying this,
26:25
not yet, but the aim is to be
26:27
as painless as possible, basically. And
26:29
so there's definitely some typical kind of sharp
26:31
edges there in terms of like, we
26:34
don't have necessarily a standard deployment
26:36
or standard installation yet we've
26:38
given as many as possible. So you can use
26:40
Docker or you can use a local Python environment
26:43
yourself, or you can then use the serverless options
26:45
as well. And so we're kind of
26:47
waiting to see which approaches people
26:49
are most comfortable with as well. Data
26:53
lakes are notoriously complex. For
26:56
data engineers who battle to build and
26:58
scale high quality data workflows on the
27:00
data lake, Starburst powers petabyte scale SQL
27:03
analytics fast at a fraction of the
27:05
cost of traditional methods so that you
27:07
can meet all of your data needs,
27:09
ranging from AI to data applications to
27:12
complete analytics. Trusted by teams of all
27:14
sizes, including Comcast and DoorDash, Starburst is
27:16
a data lake analytics platform that
27:18
delivers the adaptability and flexibility a
27:20
lake has ecosystem promises. And
27:23
Starburst does all of this on an
27:25
open architecture with first-class support for Apache
27:27
Iceberg, Delta Lake and Hoodie. So
27:30
you always maintain ownership of your data. Wants
27:33
to see Starburst in action? Go
27:35
to dataengineeringpodcast.com/Starburst and get $500
27:37
in credits to try Starburst
27:39
Galaxy today. The easiest and
27:41
fastest way to get started
27:43
using Trino. There
27:46
are multiple different flavors of open source
27:49
projects where sometimes people just want to
27:51
produce something out in the open, but
27:53
they don't really care about getting contributions.
27:56
There's the corporate open source where we're going to
27:58
release this because it. furthers our business.
28:00
And if you happen to get use
28:03
out of it, that's great. And then
28:05
there are the open source projects that
28:07
are intended to be maintained and grown
28:09
by community. And I'm wondering what your
28:11
thoughts are on how you're approaching this
28:14
this particular project? Are you looking for
28:16
contributions? Are you just looking for feedback?
28:19
I'm wondering what types of engagement and
28:21
community you're looking to build around in
28:23
ways that folks can contribute and help
28:25
you out with this? Yeah, you
28:27
know, I'm always looking for contributions.
28:30
I would love some contributions. And kind
28:32
of I don't necessarily have like a
28:34
software engineering background myself. So that's always
28:37
been sort of a fear I've had
28:39
around the imposter syndrome and stuff like that.
28:41
So I would love if somebody came with
28:43
a contribution that completely showed me, Oh, you
28:45
know, your tests are all wrong, or you
28:47
can do something better. Or like, here's more,
28:49
here's better abstractions we can use. There's definitely
28:51
like room for improvements across the board. And
28:53
so I would love contributions. And that's been
28:55
the aim of like keeping it as simple
28:57
as possible, where, you know,
28:59
everything is basically all the main concepts
29:01
are you have like a metric batch,
29:04
which is just the definition of of
29:06
your of your metrics. And then you have jobs
29:08
which are like, you know, ingest, train score alert,
29:11
and then it would end under the hood when
29:13
you're looking at the code, really, it relies heavily
29:15
on on pandas data frames, and every job basically,
29:17
you know, produces a pandas data frame, or it
29:19
takes in a pandas data frame and produces a
29:21
pandas data frame. So it's quite easy to reason
29:23
about. And so that's the idea is that
29:25
like, if you're someone that's a comfortable enough,
29:28
and in Python developer, like it's a perfect
29:30
project to do, you know, first open source
29:32
contributions on as well, which should be really
29:34
fun, like, and for people
29:37
who are looking to get engaged
29:39
with the project, and maybe they
29:41
don't necessarily want to modify the
29:44
core of what you're building, but
29:46
they are interested in extending or
29:48
augmenting its capabilities, what are some
29:50
of the interfaces that you've built
29:52
in to make it open for
29:54
extension and customization and adapting to
29:56
a particular customer or operating
29:59
environment? Yeah, so that was a good
30:01
example of where I haven't tried to be too complicated
30:03
from the start. So obviously we support,
30:05
you know, BigQuery, Snowflake, DuckDB, a couple
30:07
of other databases. And I didn't, originally
30:10
I was thinking like, okay, do I
30:12
need to build some fancy plugin architecture,
30:14
a plugin system where somebody could bring
30:16
their own plugin? And I
30:18
said, I decided not to do that because probably
30:20
it's at the edge of my capability, but also
30:22
it makes it harder to contribute on
30:25
as well. So the way the approach would be
30:27
at the moment, for example, I'm working on Redshift
30:29
at the moment and adding Azure Blob
30:31
Storage. And, you know,
30:33
just make a fork, make
30:35
a PR, and everything's kind of
30:37
easily testable. And so that's where
30:40
we haven't gone. It's not as complicated yet in terms
30:42
of like taking, say, something like the
30:44
Airflow approach where you have plugins that you can provide,
30:46
you can install separately dependencies and stuff like that. We
30:48
haven't kind of, we're not, it's
30:50
not sort of taking that approach yet,
30:53
mainly for that goal to have like the, as
30:55
low a barrier as possible to contribution. And, but
30:57
definitely at some stage, if, you know, if the
30:59
project does become more mature and stuff like that,
31:02
then yeah, like that would be something that I
31:04
would imagine would be refactored at some stage. Digging
31:07
more into the, I'm using
31:09
this, I'm running it. I want to
31:11
feed in these different metrics. You mentioned
31:13
that it has support for pulling from
31:15
databases, running Python scripts. I'm wondering if
31:17
you can talk a little bit more
31:19
about the process of producing the metrics
31:21
that an OMS stack is going to
31:23
work from and the
31:26
overall flow of data
31:28
in evaluation, alert out, or,
31:30
you know, ignore because there's
31:32
nothing to alert on. Yeah.
31:34
Yeah. So like the main approach there, the
31:37
inputs are, there's a metrics folder basically in
31:39
the root of the project. And even in
31:41
the metrics folder, then you have, you
31:44
can, you can have a folder for, you know, each
31:46
subject area or each metric batch, or you can kind
31:48
of do, you can, you can organize the metrics however
31:50
you want, as long as they're running the metrics folder.
31:52
And then all a metric batch is,
31:55
is some ingest SQL. So
31:57
there's a template that you just define
31:59
an ingest. a SQL file, which is
32:01
basically just whatever
32:03
SQL you want to use
32:05
to generate your metrics. And
32:08
so basically, this is SQL that generates a
32:10
table which just has a metric name, a
32:12
metric value, and a metric timestamp. That's all
32:14
that's required. So once you have that, then
32:18
that's the basis for the ingestion. And then
32:20
there's also then a YAML configuration file. And
32:22
the YAML configuration file has all the other
32:24
things like schedules and parameters for the models.
32:26
And again, you don't have to fill any
32:28
of them. You can kind of just leave
32:30
that file pretty much empty and it'll use
32:32
the defaults. There's also like a default YAML
32:34
that you can edit your defaults as well.
32:37
So the idea is you just bring your ingest logic,
32:41
basically. And you
32:43
can use an ingest
32:46
SQL function, or you can actually, if you want,
32:48
you can also use your own, you
32:50
can make a custom Python function. So all you have to
32:53
define if you're doing something that maybe say you're
32:55
scraping metrics from a website or from some public
32:57
metrics, or even it doesn't, it could be anywhere.
32:59
But if it's a Python function, you can then
33:01
also just use it. You can just bring your
33:03
own Python function as long as that Python function
33:05
generates a kind of data frame that then has
33:07
those same three columns, metric name, metric value,
33:09
metric timestamp, that works as well.
33:12
So we have all the examples in the repository
33:14
that do that. Like there's examples that pull metrics
33:16
from Hacker News and Weather Metrics and Yahoo Finance
33:18
and all that sort of stuff. And once you
33:20
have that, then you can obviously, you can customize
33:23
anything under the hood. There's default templates. So like
33:25
there's a default template for the pre-processing function that
33:27
the ML uses. You don't ever have to worry
33:29
about that, but if you want to, you can
33:32
bring your own for each individual metric batch. Likewise,
33:34
for the for the
33:37
alert logic, you can also define your
33:39
own alert SQL template if you want, or
33:41
you can edit the default on that stair.
33:43
So the idea is once you bring your
33:45
ingest logic and your configuration, then
33:47
that will trigger off everything. So the ingest
33:50
jobs, the train jobs, the score jobs, and
33:52
then all that's happening behind the scenes is
33:54
it's going to kind of run that ingest script, save
33:57
the results onto a metrics table, and as I see
33:59
on the score. it also saved the
34:01
scores on the metrics table and at the
34:03
student alert, it also saved the alert on
34:05
the metrics table. So this all then just
34:07
becomes kind of orchestration that's reading to and
34:09
writing from this metrics table in your warehouse
34:12
basically, which could be Snowflake, BigQuery, whatever. And
34:14
this is like a long format metrics table where each
34:16
row is basically a new metric. So it's kind of
34:18
easy to think about as well because as you add
34:21
new metric batches, you're just depending on to the end
34:23
of that table. Or you can of course also have
34:25
like, if you want you can have different metric batches
34:27
go into different metric tables, that's all flexible. But it's
34:29
easiest to think about just having like one start with
34:31
one single metrics table that an arm stack is reading
34:33
from and writing to. And that kind
34:35
of becomes then the, that's the
34:38
actual heart of what's going on here basically. And
34:40
you can plug that into your own tools as well. So if you have your
34:42
own BI tools, or your own alert
34:44
tools, or anything like that, that
34:47
then it's just another table in your data
34:49
warehouse. So you can kind of use it like anything else basically.
34:52
And recognizing that it's still a very
34:54
early project that you are still working
34:56
on gaining visibility and getting feedback. I'm
34:58
wondering what are some of the most
35:00
interesting or innovative or unexpected ways that
35:02
you've seen an arm stack used so
35:05
far? Well, I was a couple
35:07
of weeks back, I should always have enough. So
35:09
one of the examples we use in the
35:11
examples, yeah, out of the box examples is
35:13
like Hacker News, it scrapes the top,
35:16
the scores from the Hacker News top
35:18
stories, you know, I mean, and I was like,
35:21
as soon as all the Sam Altman drama with
35:23
open AI kicked off, I was kind of crossing
35:25
my fingers thinking, oh my God, this has to,
35:27
they'll get picked up. If this isn't picked up
35:30
by, in the example job, I'll be kind of
35:32
having my face. And funny enough, I have, it
35:35
was, as soon as all that kicked off, Hacker
35:37
News exploded. And, you know, I had anomalies
35:39
straight away from the Hacker News jobs. And
35:41
I've put them into the gallery. There's a little
35:44
gallery folder in the repository as well that has
35:46
examples of like real anomalies that I've been
35:48
using it on real data. And
35:50
there's a Sam Altman fired HN
35:52
explodes.png in there as well. I
35:55
was happy to have, but yeah, that was like, it's
35:58
interesting as well just recently with sort of. We're
36:00
also doing, looking at stock prices and stuff as well,
36:02
like just trying to get a wide range of as
36:04
many examples as possible to get like realistic data. And
36:06
just the other day, I noticed all of the tech
36:08
stocks when it were down a couple of points based
36:11
on the Yahoo Finance job. And they actually Googled it
36:14
and was like, yeah, actually they are all closed. I
36:16
thought it was a problem. I thought something was going
36:18
wrong somewhere, but actually, you know, it was valid. That's
36:21
an interesting use case as well, where maybe
36:23
it's not business metrics that you care about.
36:25
Maybe it's just personal curiosities and you can
36:27
build your own sort of Google trends style
36:30
of, hey, I want to know if something changes
36:33
in this particular ecosystem, as long as there's some
36:35
sort of API you can hit, then you can
36:37
build your own personal anomaly dashboard about what are
36:39
the anomalous things happening in the world today? Yeah,
36:42
yeah, no, and that's actually Google trends is another
36:44
example. We have a Google trends example as well.
36:47
So I'm kind of constantly building out this example
36:49
folder within the metrics folder, so that you can,
36:51
and you can turn them off as well, like
36:53
so you can, but they're just, they're useful to
36:56
kind of be realistic types of examples
36:58
that people can look at as well. Yeah,
37:00
it's definitely a very cool project in that way, where as you
37:03
mentioned, there are anomaly detection tools. A
37:06
lot of times though, they're very coupled
37:08
to the product that they're trying to
37:10
generate the alerts from. So Datadog has
37:13
some anomaly detection. I
37:15
know that the Grafana cloud product has
37:17
some ML capabilities for alerting on anomalies,
37:19
but again, all of those are very
37:21
tightly coupled to the ecosystem that they're
37:23
built for, whereas this is a little
37:25
bit more open-ended of, as long as
37:27
you can get data somewhere, we can
37:29
let you know if something is weird. Yeah,
37:31
and that was almost as well. One of
37:33
the kind of design principles here was to
37:35
have no UI and have like, it's all
37:37
basically config based and get up space so
37:39
that, it's what we're used to
37:41
working in as like data engineers. And it's lower
37:44
overhead. We don't have some crazy management UI
37:46
and admin console, but you have to go
37:48
and click around and configure stuff. It's all
37:50
kind of your metrics as code basically, and
37:52
everything as code, and that kind of helps
37:55
make it easier to, if you want to add
37:57
new metrics, it's just a PR, and then no problem, you know.
38:00
Absolutely. And in your experience of building
38:02
this project, publishing it to the community,
38:04
looking for feedback, what are some of
38:06
the most interesting or unexpected or challenging
38:09
lessons that you've learned in the process?
38:11
So it's been fun actually, I had
38:13
to learn quite a lot about Daxter.
38:16
Daxter is really at the heart of it doing
38:18
all the orchestration, so I had to go quite
38:21
deep in terms of getting familiar with even
38:23
some edge cases and stuff around how
38:25
Daxter works and all different configurations to
38:27
be able to support like running locally
38:30
in your own Docker versus Daxter Cloud
38:32
versus a Python environment. There's a few
38:34
different kinds of considerations there. That's
38:36
kind of been fun and been interesting
38:40
to start from new, and new technology is
38:42
always fun, especially all these modern data stacks
38:45
technologies. It's overwhelming, there's so many of them
38:47
that it's almost too much sometimes and you
38:49
kind of just turn on the blinker. But
38:51
it's been good to have an excuse to
38:53
actually then take one, just pick one and
38:56
use it and go to have
38:58
been useful. And yeah, also as well, just
39:00
my own capabilities. I would say actually I
39:02
should pre-fist like probably another part of like
39:04
projects like this are now actually easy to
39:06
do because we have all these tools that
39:08
we can use. And once you kind of
39:10
know enough to put the ingredients together, I've
39:13
also been using, you know, co-file and chat
39:15
GPT to help a lot with the code
39:17
as well. Like it's crazy how much more
39:19
productive you can be these days, especially with
39:21
an open source project like this, where it's
39:24
like you can develop fully in the open.
39:26
You don't have to be worried about anything
39:28
confidential or anything like that. You're just unconstrained
39:30
actually use these tools. And yeah, it's
39:32
been like I'd say probably 30% of
39:35
the code in parts has been at least
39:37
inspired by co-pilot and chat
39:39
GPT. So that's been really interesting because if you, it's like,
39:41
you know, when you used to ask for help on Stack
39:43
Overflow, you have to spend a lot of time reproducible
39:46
examples and ask the question in the right way
39:48
and show your work and things like that. Same
39:50
thing applies for, you know, the language models. And
39:52
once you do that, they can actually be ridiculously
39:54
useful. So it actually, it hasn't been half as
39:56
much work as I thought it would be because,
39:58
you know, we have, all the tools that
40:01
we're using are quite easy to work with.
40:03
And then like this assist of, you know,
40:05
co-coiler type approach, it just means, you
40:07
know, if I have an idea, I can make an idea and then spec
40:10
the idea out and actually get it done, probably, you know,
40:12
in half the time that it would have taken originally. So
40:14
that just means you've got more time, you get more done
40:16
with it, you know, if, you know, at the time you
40:18
can focus on a project like this, you can just get
40:20
so much more done for it, you know.
40:23
And for people who are interested in an
40:25
OMS stack, they want to start to incorporate
40:27
some measure of anomaly detection on their business
40:29
metrics. What are the cases where it's the
40:31
wrong choice? Yeah, so I think probably the
40:33
main cases there would be if it's like
40:35
low latency, you know, per second, like that's
40:37
some of the stuff that we've done with
40:39
NetData, it's all infrastructure per second metrics, you
40:41
know, thousands of metrics a second. That's a
40:43
completely different domain where you have like just
40:45
different design challenges. And so an OMS stack
40:47
wouldn't be wouldn't be right for anything like
40:49
that. And it's more typically like, you know,
40:52
hourly metrics. I do have like
40:54
10 minute metrics and things like that.
40:57
But anything below, anything too near real time, it
40:59
wouldn't make sense. And a situation like that, you're
41:01
in more of a data observability situation where
41:03
you like things like Prometheus and that sort of
41:05
would be more useful. But the other
41:09
use case would be, I guess, if you have
41:11
scale, like if you've got thousands and thousands and
41:13
thousands of metrics, I don't think I'm not sure
41:15
how well that would happen. You
41:17
know, how well, say, a DAGS are running in a container,
41:19
how well that would scale to if we had like hundreds
41:21
and hundreds of metric batches, I reckon there's probably that'd
41:23
be a nice problem to have, if we ever get that
41:26
far, we have that problem. But I would say that's probably
41:28
another issue, where I would say it's not right
41:30
for you. And then also, like, if you're not sort of, if
41:32
you're not comfortable enough with
41:34
sort of running a Docker app, basically, then it's
41:36
a good excuse to layer, it's a good chance
41:38
to kind of get your hands dirty. And it's
41:40
not as painful as like things
41:43
used to be. But also, that's something that like, it's,
41:45
there's a little bit of consideration there in terms
41:47
of like, are you comfortable enough running this yourself?
41:49
Or obviously, like you can use the DAGS to
41:51
cloud, you know, if you have a DAGS to
41:53
cloud account, that works as well. But yeah, that's
41:56
a situation like that, probably not quite the right
41:58
option. Also, if you're using Airflow, If you already have
42:00
an Airflow, you should probably look at the Airflow
42:02
anomaly detection provider, which is a different project that
42:04
I maintain. That would be really cool to get
42:06
some get some love in there at the moment
42:09
as well. Because that one only has, I've only
42:11
really set it up for for BigQuery. But you
42:13
know, obviously there's all the different types of operators
42:15
and all this stuff already exists in Airflow. So
42:17
it's not that hard to actually use them. It's
42:19
just if somebody is motivated to, you know, come
42:21
and use it, then they might be as well
42:23
to actually use your own Airflow that you already
42:25
have, you know. And as you
42:27
continue to build and iterate on the Anom
42:29
Stack project, as you work to onboard more
42:31
contributors, what are some of the things you
42:34
have planned for the near to medium term
42:36
or any particular projects you're excited to dig
42:38
into? Yeah, so I'm there's a
42:40
couple of open issues in the repository of ideas. And
42:42
I'm just kind of throwing issues in all the time.
42:44
And one thing I want to do, I have
42:47
a feature feature request open for time
42:49
GPT. So it's still kind of shaken
42:51
out these LLM approaches. There's time GPT,
42:54
which is a new sort of time
42:56
series friendly large language model. And
42:58
I'm hoping to see if I can start to new it's still
43:00
sort of in a closed beta. So I'm hoping to get access
43:02
to that actually see if we can use that so that that
43:05
might actually be more useful. And
43:07
also just there's a few things around wanting
43:09
to to let the user
43:12
run multiple models. So like at the moment,
43:14
it's free to metrics, you define one model.
43:17
And the default model is this PCA based model. But
43:19
actually, really, maybe you want to define like three or
43:21
four, three or four different models and actually just let
43:23
them run for a week or two. And
43:25
then you can actually see, okay, as the metric comes in,
43:28
how do the anomaly scores behave and which ones work best
43:30
for this metric. So there's definitely a whole load of stuff
43:32
where we could make the ML part of this easier as
43:34
well, I think. So if you could run multiple models, and
43:36
then over time pick them, that would be good. Or if
43:38
you could do if we could do some sort of way
43:40
where you could benchmark and simulate your metrics on different models
43:42
that could help with the ML part, I think that could
43:44
be really useful as well. Because that's always the challenge is
43:46
like, it's very hard, there is no one size fits all
43:48
model, and it can sometimes take a bit of iteration as
43:50
well. So if we could take the pain out of someone
43:52
else, that could be really useful as well. I'm kind of
43:54
fun to work on as well. And
43:56
given the time series nature of the data, it
43:58
might also be interesting. to bring in some
44:01
sort of time series predictive capability, whether that's
44:03
using the profit library or I think there's
44:05
another one, Grey Kite, there are a number
44:07
of them out there now to say, this
44:09
is the current trend line. If this continues,
44:12
then this will maybe then trigger an anomaly
44:14
and so here's some kind of preemptive alerting
44:16
of something to keep an eye out for.
44:18
Yeah, yeah, and there's like, there's also, there's
44:21
lots of other contexts on ML that we could bring
44:23
into this in terms of like, forecasting is an obvious
44:25
one as well, but then there's also like change detection
44:27
is another one where sometimes what you're interested in is
44:30
a sudden change, even if it's not an
44:32
anomaly, like maybe sudden change happen, they happen every
44:34
time, but you know, they're not gonna be flagged
44:37
as anomalous because the ML is gonna look at
44:39
those shifts as like, oh well, steps happen every
44:41
now and then, but actually if you have a
44:43
real focused area where you're interested
44:45
in, okay, what happened last night, something went
44:47
wrong, what you really wanna ask
44:49
a lot of times there is, okay, change detection, show me
44:52
the metrics that had a sudden change, and
44:54
that's like a different use case where it's like a
44:56
subset of an anomaly detection, it's not quite a little
44:58
bit different, so there's all these other kind of little
45:00
ML, time series based, you know,
45:02
ML use cases that we could for sure build
45:05
in, like over time that would be interesting. Are
45:07
there any other aspects of the Anom Stack
45:09
project or this overall space of business metrics
45:11
and anomaly detection that we didn't discuss yet
45:14
that you'd like to cover before we close
45:16
out the show? No, no,
45:18
just I definitely think it's an interesting time,
45:20
especially like as we can, you know, there's
45:22
a lot of modern data stack, there's lots
45:25
of stuff going on, it's crazy, but
45:27
I do think technology is catching up, you
45:30
know, in terms of actually the metadata and making
45:32
sense of like, you know, making sense of what's
45:34
going on in your data, that's like the
45:37
hard part, we have all the plumbing, we have all the
45:39
flows, we have all the details, it's just
45:41
how do you actually make sense of like what things
45:43
matter the most, that's still sort of an open problem
45:45
that I think, now like a
45:47
lot of these kind of AI,
45:50
I won't say that for the first time I think I've said AI,
45:53
I cringe every time I say AI, but actually
45:55
this is one case where like, it actually will, I
45:57
think really be useful over the next couple of years and like making
45:59
sense. of all of the crazy business
46:01
metrics and data that companies have. All
46:04
right. Well, for anybody who wants to get in
46:06
touch with you and follow along with the work
46:08
that you're doing or contribute to the project, I'll
46:10
have you add your preferred contact information to the
46:12
show notes. And as the final question, I'd like
46:14
to get your perspective on what you see as
46:17
being the biggest gap in the tooling or technology
46:19
that's available for data management today. I
46:22
think possibly the biggest gap
46:24
is just the complexity of the space. There's
46:26
still a, I'm not sure where I sit
46:28
on this as well. So there's point solutions
46:31
that kind of focus on one thing and
46:33
do one thing well. And then there's all
46:35
these platform options. And I
46:37
think that's the biggest complication now is just
46:39
navigating the space in terms of how do
46:41
you compose things together? There's
46:45
still like there's work on standards and stuff
46:47
like Open Lineage and all these kind of
46:49
standards that are trying to become a glue for
46:51
all these different solutions. But I think that's the
46:54
biggest challenge is actually how do you actually just
46:57
put things together and already actually try and go with
47:00
like just a big cloud provider
47:02
and just use whatever they have. That's
47:04
probably the biggest gap I see. Absolutely.
47:07
All right. Well, thank you very much for taking the
47:10
time today to join me and share
47:12
the work that you've been doing on the Anomstack
47:14
project and for building it in
47:16
the first place. It's definitely a very cool
47:18
project. Definitely excited to try that out for
47:20
my own data platform and explore the possibilities
47:23
that that opens up. So I appreciate all the
47:26
time and energy you've put into that and
47:28
for taking the time today. And I hope you enjoy the rest of your day.
47:31
Thanks. Thanks. Thanks a lot for having me on.
47:33
I'm a big fan of the show and anyone
47:35
else who's interested, just come check out the repo
47:37
and make sure you make some discussions. I will
47:39
be delighted to have people come out behind. I
47:41
think possibly the biggest gap is just the complexity
47:43
of the space. There's
47:50
still a, I'm not sure where I sit on this
47:52
as well. So there's point solutions that
47:54
kind of focus on one thing and do one
47:56
thing well. And then there's all these platform options.
48:00
I think that's the biggest complication now is
48:02
just navigating the space in terms of, you
48:04
know, how do you compose things together? There's
48:07
still like there's work on, you know, standards and
48:09
stuff like open lineage and all these kind
48:11
of standards that are trying to, you know,
48:14
become a glue for all these different solutions.
48:16
But I think that's the biggest challenge is actually like, you
48:18
know, how do you actually just put things
48:20
together and already actually try and go
48:22
like just a big cloud provider and
48:24
just use whatever they have. That's
48:27
probably the biggest gap
48:29
I see. Absolutely. All right. Well, thank you
48:31
very much for taking the time today to
48:33
join me and share the work that you've
48:35
been doing on the Anomstack project and for
48:37
building it in the first place. It's definitely
48:39
a very cool project. Definitely excited to try
48:41
that out for my own data platform and
48:43
explore the possibilities that that opens up. So
48:45
I appreciate all the time and energy you've
48:47
put into that and for taking the time
48:49
today. And I hope you enjoy the rest
48:52
of your day. Thanks. Thanks. Thanks a
48:54
lot for having me on. I'm a big fan
48:56
of the show and anyone else who's interested, just
48:58
come check out the repo and make some issues,
49:00
make some discussions. I will be delighted to have
49:02
people come along and say hi.
49:10
Thank you for listening. Don't forget to
49:12
check out our other shows, podcast.init, which
49:14
covers the Python language, its community and
49:16
the innovative ways it is being used.
49:18
And the machine learning podcast, which helps
49:21
you go from idea to production with
49:23
machine learning. Visit the site at dataengineeringpodcast.com
49:25
to subscribe to the show, sign up
49:27
for the mailing list and read the
49:29
show notes. And if you've learned something
49:31
or tried out a product from the show, then tell us about
49:33
it. Email hosts at
49:36
dataengineeringpodcast.com with your story. And
49:38
to help other people find the show, please leave
49:41
a review on Apple Podcasts or tell your
49:43
friends and followers.
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More