Addressing The Challenges Of Component Integration In Data Platform Architectures

Released Monday, 27th November 2023

Good episode? Give it some love!

Addressing The Challenges Of Component Integration In Data Platform Architectures

Monday, 27th November 2023

Good episode? Give it some love!

Rate Episode

SummaryBuilding a data platform that is enjoyable and accessible for all of its end users is a substantial challenge. One of the core complexities that needs to be addressed is the fractal set of integrations that need to be managed across the individual components. In this episode Tobias Macey shares his thoughts on the challenges that he is facing as he prepares to build the next set of architectural layers for his data platform to enable a larger audience to start accessing the data being managed by his team.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack)You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today to get 2 weeks free!Developing event-driven pipelines is going to be a lot easier - Meet Functions! Memphis functions enable developers and data engineers to build an organizational toolbox of functions to process, transform, and enrich ingested events “on the fly” in a serverless manner using AWS Lambda syntax, without boilerplate, orchestration, error handling, and infrastructure in almost any language, including Go, Python, JS, .NET, Java, SQL, and more. Go to dataengineeringpodcast.com/memphis (https://www.dataengineeringpodcast.com/memphis) today to get started!Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.Your host is Tobias Macey and today I'll be sharing an update on my own journey of building a data platform, with a particular focus on the challenges of tool integration and maintaining a single source of truthInterviewIntroductionHow did you get involved in the area of data management?data sharingweight of historyexisting integrations with dbtswitching cost for e.g. SQLMeshde facto standard of AirflowSingle source of truthpermissions management across application layersDatabase engineStorage layer in a lakehousePresentation/access layer (BI)Data flowsdbt -> table level lineageorchestration engine -> pipeline flowstask based vs. asset basedMetadata platform as the logical place for horizontal viewContact InfoLinkedIn (https://linkedin.com/in/tmacey)Website (https://www.dataengineeringpodcast.com)Parting QuestionFrom your perspective, what is the biggest gap in the tooling or technology for data management today?Closing AnnouncementsThank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning.Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] (mailto:[email protected])) with your story.To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workersLinksMonologue Episode On Data Platform Design (https://www.dataengineeringpodcast.com/data-platform-design-episode-268)Monologue Episode On Leaky Abstractions (https://www.dataengineeringpodcast.com/abstractions-and-technical-debt-episode-374)Airbyte (https://airbyte.com/)Podcast Episode (https://www.dataengineeringpodcast.com/airbyte-open-source-data-integration-episode-173/)Trino (https://trino.io/)Dagster (https://dagster.io/)dbt (https://www.getdbt.com/)Snowflake (https://www.snowflake.com/en/)BigQuery (https://cloud.google.com/bigquery)OpenMetadata (https://open-metadata.org/)OpenLineage (https://openlineage.io/)Data Platform Shadow IT Episode (https://www.dataengineeringpodcast.com/shadow-it-data-analytics-episode-121)Preset (https://preset.io/)LightDash (https://www.lightdash.com/)Podcast Episode (https://www.dataengineeringpodcast.com/lightdash-exploratory-business-intelligence-episode-232/)SQLMesh (https://sqlmesh.readthedocs.io/)Podcast Episode (https://www.dataengineeringpodcast.com/sqlmesh-open-source-dataops-episode-380)Airflow (https://airflow.apache.org/)Spark (https://spark.apache.org/)Flink (https://flink.apache.org/)Tabular (https://tabular.io/)Iceberg (https://iceberg.apache.org/)Open Policy Agent (https://www.openpolicyagent.org/)The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)