The New Data Engineering Stack. Technologies for the Complete Data… | by Kovid Rathee | Nov, 2020
Technologies for the Complete Data Engineer
Remember the time when the software development industry realized that a single person can take on multiple technologies glued tightly with each other and came up with the notion of a Full Stack Developer — someone who does data modelling, writes backend code and also does front end work. Something similar has happened to the data industry with the birth of a Data Engineer almost half a decade ago.
For many, the Full Stack Developer remains a mythical creature because of the never-ending list of technologies that cover frontend, backend and data. A complete Data Engineer, on the other hand, doesn’t sound as far-fetched or mythical. One of the reasons for that could be the fact that visualisation (business intelligence) has become a massive field in its own right.
A Data Engineer is supposed to build systems to make data available, make it useable, move it from one place to another and so on. Although many companies want their data engineers to do visualisations, it is not a common practice. Still, the BI skillset is definitely a good-to-have for a Data Engineer.
Here, I am going to talk about the technologies which are too important to ignore. You don’t have to master all of them. That’s not possible, anyway. It is important to be aware and somewhat skilled at most of these technologies to do good things in the data engineering space. Don’t forget that newer technologies will keep coming and older technologies, at least some of them, will keep moving out.
The philosophy of listing these technologies comes from a simple idea of borrowed from the investing world — where the world is going.
Even with the bursting on the scene of many unconventional databases, the first thing that comes to mind when we talk about databases is still relational databases and SQL.
All relational databases, more or less, work in the same way. Internal implementation differs, obviously. It’s more than enough to be skilled in one or two of the four major relational databases — Oracle, MySQL, Microsoft SQL Server and PostgreSQL. I haven’t heard of a company that works without a relational databases, no matter how advanced and complex their systems are.
The Big Four — Oracle, MySQL, MS SQL Server, PostgreSQL.
This website maintains the database engine ratings for all kinds of databases. Head over just to see what kind of databases companies are using these days.
OLTP based relational databases are, by definition, meant for transactional loads. For analytical loads, data lakes, data warehouses, data marts, there’s another list of databases. In theory, you can create data warehouses using OLTP databases but at scale it never ends well. Been there, done that.
Data warehouses have a different set of database management systems, the most popular out of which are Google BigQuery, Amazon Redshift, Snowflake, Azure Data Warehouse and so on. The choice of a data warehouse usually defaults to the cloud service provider a company is using. For instance, if a company’s infrastructure is on AWS, they’d surely want to use Amazon Redshift as their data warehouse for reducing friction.
The Big Four — BigQuery, Redshift, Snowflake, Azure DW.
Having said that, there are good chances that the future of the cloud will not be a cloud, it is probably going to be multi-cloud, which means companies would be able to choose their data warehouses almost irrespective of where their existing infrastructure is without worrying too much about inter-cloud friction.
Different use cases require different solutions. Geospatial data requires geospatial databases like PostGIS, time-series data sometimes requires specialised time-series databases like InfluxDB or TimescaleDB. Document-oriented databases, key-value stores have made their place in the database ecosystem by offering something that relational database had struggled to offer for the longest period of time, i.e., the ability to efficiently store, retrieve and analyse semi-structured and unstructured data.
The Big Eight — MongoDB, InfluxDB, neo4j, Redis, Elasticsearch, CosmosDB, DynamoDB, Cloud Datastore.
Then there are graph databases, in-memory data stores and full-text search engines — which are solutions for very specific problems. It’s difficult to choose from hundreds of databases but these are the major ones. The ones I have left out are probably close seconds of these eight.
With the mainstreaming of cloud computing with cloud service providers like AWS, Azure and Google Cloud, infrastructure has been democratised to a great degree. Smaller companies don’t have to worry about CapEx that incurred from infrastructure anymore.
It couldn’t have been more of a blessing for data engineering that a host of amazing services by all the major providers are available which charge on the pay-what-you-use basis. Companies have moved to the serverless computing model where the infrastructure is up only for the time when the compute & memory are needed. Persistent storage is a separate service.
The Big Three — Google Cloud, Azure, AWS.
For a data engineer, it’s important to know all the major data-related cloud services provided by at least one of the three cloud providers. We’ll take an example of AWS. If you’re a Data Engineer who’s supposed to be working on AWS, you should know about S3 & EBS (for storage), EC2 & EMR (for compute & memory), Glue & Step Functions & Lambda (for orchestration) and more. Same goes for other cloud providers.
For more engineering-centric teams, Airflow has been the obvious choice for a an orchestrator in the last two to three years. Cloud platforms have their own orchestrators. For instance, with AWS, you can use a mix of Glue, Step Functions and Lambda. Google Cloud offers implemented a fully-managed cloud version of Airflow called Cloud Composer. Azure also offers similar services.
The Big One — Airflow.
Some of the old school orchestration, workflow engines and ETL tools have adapted well and are still very much relevant. For instance, Talend is still used widely as an orchestrator. This brings us to the much dreaded ETL.
All things considered, SQL has been the best option for doing ETL till date. Recently, many other technologies like Spark have come in the space where more compute & memory gives you quicker results by exploiting the principles of MPP computing.
Traditionally, ETL has been done mostly using proprietary software but those days are long gone now. More open-source toolkits are available in the market to be used by the community. There’s also a host of fully-managed ETL solutions provided by companies that are dedicated to data integration and ETL. Some of them are Fivetran, Panoply and Stitch. Most of these tools are purely scheduled or triggered SQL statements getting data from one database and inserting into another. This is easily achievable by using Airflow (or something similar).
The Big Two — SQL, Spark.
Fishtown Analytics’s dbt is one of the only tools that concentrates on solving the Transformation layer problems in the ETL. The fact that dbt is completely SQL-based makes it so attractive to use. I’m looking forward to having cloud dbt services by the major cloud providers. Something might already be in the works.
The DevOps space has split into three in the past couple of years —core DevOps, DataOps and DevSecOps. Data Engineers are expected to know their infrastructure now. This means that whatever stack they are using, they should be able to resolve operational issues concerning the infrastructure — databases, data pipelines, data warehouses, orchestrators, storage and so on.
For provisioning infrastructure and maintenance, there are several cloud platform independent tools like Pulumi and Terraform are available in the market. Platform specific tools like CloudFormation (for AWS) have also seen wide acceptance.
The Big Two — Terraform, Pulumi.
If you have drunk the kool-aid of a multi-cloud future, it’s better to know at least one of the two aforementioned Infrastructure-as-Code tools. IaC comes with its own benefits like the ease of implementing immutable infrastructure, increased speed of deployment and so on.
Whether it is deploying infrastructure or SQL scripts or Spark code, a continuous integration and continuous deployment pipeline is the standard way to do it. Gone are the days (or gone should be the days) when engineers used to have access to the machines and they’d log in to a database and execute the DDL for a stored procedure on the database server.
The Big Four — Jenkins, AWS CodePipeline, Google Cloud Build, Azure DevOps.
Many have realized the risk of doing that, unfortunately after many years of having suffered from unintended human errors.
The whole point of the data engineering exercise is to make the data available to the data scientists, data analysts, and business people. Without proper testing, any project is at risk of catastrophic failure. Manual testing of data is highly inefficient and, honestly, it isn’t doable at scale.
The Big Two — Pytest, JUnit.
So the best way out is to automate the tests. Any of the automation test frameworks available for testing backend code also works for testing Data Engineering components. Tools like dbt can also be used for automation testing. Otherwise, widely used tools like Cucumber, Gherkin for BDD are available. Pytest, JUnit and others can also be used.
I have already written about source control for SQL. I don’t want to repeat all the information I had shared in the other piece, so I am just sharing the link here
Source control everything. The pipelines, the database DDLs, the orchestrator code, test cases. Everything.
Although Python should be the obvious answer to the question of which language do data engineers use, there is a host of technologies built on Java & Scala. The whole Hadoop ecosystem is based on Java. Talend, the orchestrator + ETL tool is also written in Java.
Not everyone is required to know both the languages though. Most widely used technologies have a wrapper for the other language to make the product more acceptable. The most common example of this is PySpark which allows Data Engineers to use Python to interact with Spark.
The Big Three — SQL, Python, Java.
The same can be said for SQL. If there was one language that data engineers should understand, it is SQL. After-all, it is the language data speaks.
A Data Engineer is not just an ETL person now. They’re not just a database person either. A Data Engineer is an amalgamation of all the things we have talked about in this piece and maybe some more. Again, remember that mastery of all these technologies is not possible, but one can certainly be aware and be skilled in some of these. That’s what is the need of the hour. And this will probably be the case for next couple of years.
Read More …