Learning From Microservices — as a Data Engineer | by Daniel Mateus Pires | Jan, 2021


In Data Engineering, today, we tend to create less services and implement logic in frameworks (like Airflow, Spark, Flink…). But there are principles that are true nonetheless:

  • The scope of a piece of software should be clearly defined and easy to reason about: loosely defined scopes invite Scope Creep which could grow your application into a monolith.
  • Clear team boundaries and ownership allow teams to move faster.
  • Smaller applications allow for more granular scaling.
  • Smaller applications make it safer to “release early and release often” and not only for code developed by the team but also library updates and programming language updates.
  • Smaller applications make it easier to “compose” logic together, or in other words: increase re-usability. 🙌

Apache Airflow is a popular scheduler and data pipeline authoring framework. It is also a “Modular Monolith”: one application that does a lot, but with components that have clearly defined roles.

Airflow is made of:

  • Web Server(s).
  • Worker(s).
  • Scheduler(s), plural thanks to Airflow 2.0 😀.
  • Executor(s) which define how tasks run on infrastructure.
  • Hook(s) which define how tasks connect to external services.
  • Operator(s)/Sensor(s) which define what kind of work happens during a task execution.

When I started using Airflow ~2 years ago, Operators raised a red flag for me 🚩. When we write transformation, extraction and loading logic in our Operators, we are locking a lot of valuable and otherwise re-usable software inside Airflow. Other teams are then forced to depend on, or use Airflow directly in order to use these components.

We also end up introducing dependencies which can quickly pile up and create conflicts.

Learning from microservices, I think that limiting ourselves to Operators that trigger and monitor applications deployed and managed outside of Airflow is a better pattern.

Jessica Laughlin wrote a very detailed post about Bluecore Engineering’s move to the KubernetesPodOperator and packaging their logic in containers, which is one way of achieving that.

At Gilt, when we adopted Spark Scala, we created one single Scala project. The different pipelines we built depended on shared classes, which would define helpers and utilities for the Spark Session, reading/writing defaults and User Defined Functions.

Because all the data pipelines were tightly coupled due to their shared project we had to migrate “All or Nothing” when updating Spark. This is particularly painful in Spark where errors pop up under certain loads and resource configurations, so we want to test the whole project under production conditions before release.

We had only two teams working on the Spark project, I imagine that above that, there is a significant amount of coordination required to update a full Spark project, especially as errors might pop up in pipelines that are not owned and understood by the team pushing the update. 😢

Looking back, each team or maybe each pipeline should have been its own Spark project, that we could update and package separately. The usefulness of the shared helpers library did not outweigh the issues that came with having shared dependencies in our pipelines.

Note that having independent projects comes with its own overheads, just like for Microservices.

When slicing your applications into smaller pieces, think about your team structure, team sizes, and how you can reduce the overhead of managing small projects. For example, if you have many data pipelines and small teams, creating templates for new projects would be valuable. Shared CI/CD logic or/and a Monorepo setup can also help.

The specialization and decoupling of software in Data Engineering can be seen in the evolution from pre-Cloud Data Warehouses to the new Data Lakehouse systems.

Here is a list of tools that make up a Data Lakehouse, they really are “deconstructed” Data Warehouses:

  • Specialized file formats (e.g. Avro, Parquet).
  • A distributed file system (e.g. S3, GCS).
  • A metadata store (e.g. Hive Metastore).
  • A compute engine (e.g. Spark, Presto).
  • A transaction layer (e.g. Delta Lake, Apache Hudi).

I imagine the structure of the new generation of proprietary Cloud Data Warehouse offerings are also made of specialized components, maybe even using some version of the Open source tools I just listed, but have limited visibility into these architectures (so I can only guess 😃).

As Data Engineers we can learn from what Microservice architecture aims to solve and be aware of the dangers of tightly coupling software.

I personally find it interesting that the trend in the modern data stack is also going towards modular and specialized components, a trend we have seen in Back end engineering but also on the Front End with “micro front ends”. 🤔

So, Keep it Simple, and maybe Small too?

Additional Gilt Microservices Resources

I want to share some more Gilt.com resources that were sent to me by Gilt alumnus (thank you!!)

Read More …


Write a comment