Managing and Optimizing Core Data Infrastructure
- April 22, 2022
- Posted by: Ramya Anant
- Category: Data Engineering
While data engineers no longer need to manage Hadoop clusters or scale hardware for Vertica at VC-backed startups, there is still real engineering to do in this area. Making sure that your data technology is operating at its peak results in massive improvements to performance, cost, or both. That typically involves:
- building monitoring infrastructure to give visibility into the pipeline’s status,
- monitoring all jobs for impact on cluster performance,
- running maintenance routines regularly,
- tuning table schemas (i.e. partitions, compression, distribution) to minimize costs and maximize performance, and
- developing custom data infrastructure not available off-the-shelf.
These types of efforts are often overlooked at earlier stages of a data team’s maturity, but become incredibly important as that team and the dataset grow. In one project we were able to cut BigQuery costs for building a table incrementally from $500/day to $1/day by optimizing table partitions. This stuff is important.
Data engineers are also often responsible for building and maintaining the CI/CD pipeline that runs the data infrastructure. While many data teams had extremely poor VCS, environment management, and testing infrastructure in 2012, that’s changing, and it’s data engineers leading this charge.
Finally, data engineers at leading companies are often also involved in building tooling that doesn’t exist off-the-shelf. For instance, data engineers at Airbnb built Airflow because they didn’t have a way to effectively build and schedule DAGs. And data engineers at Netflix are responsible for building and maintaining a sophisticated infrastructure for developing and running tens of thousands of Jupyter notebooks.
You can get most of your core infrastructure off-the-shelf today, but someone still needs to monitor it and make sure it’s performing. And if you’re truly a cutting-edge data organization, you’ll likely want to push the boundaries on existing tooling. Data engineers can help with both.