Designing Data Transfer Systems That Scale

Released Monday, 4th December 2023

Good episode? Give it some love!

Designing Data Transfer Systems That Scale

Monday, 4th December 2023

Good episode? Give it some love!

Rate Episode

SummaryThe first step of data pipelines is to move the data to a place where you can process and prepare it for its eventual purpose. Data transfer systems are a critical component of data enablement, and building them to support large volumes of information is a complex endeavor. Andrei Tserakhau has dedicated his careeer to this problem, and in this episode he shares the lessons that he has learned and the work he is doing on his most recent data transfer system at DoubleCloud.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementIntroducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack)You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today to get 2 weeks free!This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues for every part of your data workflow, from migration to deployment. Datafold has recently launched a 3-in-1 product experience to support accelerated data migrations. With Datafold, you can seamlessly plan, translate, and validate data across systems, massively accelerating your migration project. Datafold leverages cross-database diffing to compare tables across environments in seconds, column-level lineage for smarter migration planning, and a SQL translator to make moving your SQL scripts easier. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold (https://www.dataengineeringpodcast.com/datafold) today!Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.Your host is Tobias Macey and today I'm interviewing Andrei Tserakhau about operationalizing high bandwidth and low-latency change-data captureInterviewIntroductionHow did you get involved in the area of data management?Your most recent project involves operationalizing a generalized data transfer service. What was the original problem that you were trying to solve?What were the shortcomings of other options in the ecosystem that led you to building a new system?What was the design of your initial solution to the problem?What are the sharp edges that you had to deal with to operate and use that initial implementation?What were the limitations of the system as you started to scale it?Can you describe the current architecture of your data transfer platform?What are the capabilities and constraints that you are optimizing for?As you move beyond the initial use case that started you down this path, what are the complexities involved in generalizing to add new functionality or integrate with additional platforms?What are the most interesting, innovative, or unexpected ways that you have seen your data transfer service used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on the data transfer system?When is DoubleCloud Data Transfer the wrong choice?What do you have planned for the future of DoubleCloud Data Transfer?Contact InfoLinkedIn (https://www.linkedin.com/in/andrei-tserakhau/)Parting QuestionFrom your perspective, what is the biggest gap in the tooling or technology for data management today?Closing AnnouncementsThank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning.Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] (mailto:[email protected])) with your story.To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workersLinksDoubleCloud (https://double.cloud/)Kafka (https://kafka.apache.org/)MapReduce (https://en.wikipedia.org/wiki/MapReduce)Change Data Capture (https://en.wikipedia.org/wiki/Change_data_capture)Clickhouse (https://clickhouse.com/)Podcast Episode (https://www.dataengineeringpodcast.com/clickhouse-data-warehouse-episode-88/)Iceberg (https://iceberg.apache.org/)Podcast Episode (https://www.dataengineeringpodcast.com/iceberg-with-ryan-blue-episode-52/)Delta Lake (https://delta.io/)Podcast Episode (https://www.dataengineeringpodcast.com/delta-lake-data-lake-episode-85/)dbt (https://www.getdbt.com/)OpenMetadata (https://open-metadata.org/)Podcast Episode (https://www.dataengineeringpodcast.com/openmetadata-universal-metadata-layer-episode-237/)The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)Speaker - Andrei Tserakhau, DoubleCloud Tech Lead. He has over 10 years of IT engineering experience and for the last 4 years was working on distributed systems with a focus on data delivery systems.