Software is increasingly automating the boring parts of data engineering
- July 22, 2022
- Posted by: Ramya Anant
- Category: Data Engineering
In 2012, if you wanted to have a sophisticated analytics practice at your VC-backed startup, you needed one or more data engineers. These engineers were responsible for extracting data from your operational systems and piping it somewhere that analysts and business users could get at it. Oftenthey would do some transformation work to make the data easier to analyse. Without the data engineers, analysts and scientists didn’t have any data to work with, so frequently engineers were the very first members of a new data team.
Coming into 2019, you can buy technologies off-the-shelf to do most of that work. In most scenarios, you and your data analysts and scientists could build the entire pipeline without the need for anyone with hardcore data engineerin experience. And you wouldn’t be building some second-rate, shitty pipeline: off-the-shelf tools are actually the best-in-class way to solve these problems today.
This ability for data analysts and scientists to build self-service pipelines is new—about 2–3 years old at this point. It took several years for the products to get good, though—back in 2016 we were still in early-adopter land.
At this point a pipeline built is far more reliable than one built on top of custom-built Airflow tasks. This is an empirical statement, not a theoretical one: I’m not saying it’s not possible to build a reliable Airflow infrastructure, I’m just saying that most startups don’t. At Fishtown Analytics, we’ve worked with 100+ VC-backed data teams and have seen this play out over and over again. We’re consistently migrating people from custom-built pipelines onto off-the-shelf infrastructure and in literally every single case the impact has been tremendously positive.