How to Get a Job as a Data Engineer
By Anna Anisienia, Python Engineer at TrailStone Renewables.
Data engineering is a fascinating field. You get to work with a variety of interesting data, cutting-edge technologies, as well as with diverse teams of data professionals and domain experts. The entire field of data engineering is relatively new. As a data engineer, your role is crucial to the company’s success — many data professionals, including data analysts and data scientists, rely on you in order to do their work. You are responsible for equipping them with data that is always available, reliable, and in a proper structure.
The companies need you to make informed decisions based on real data and KPIs generated from it. And they are willing to pay you well if you are good at it! Let’s look at what skills are in high demand, what factors play a large role in future career prospects, and how to approach the technical interview.
Overall, it’s usually hard to give any truly general advice, but I summarize the skills that seem to be the most relevant, from what I saw being mentioned numerous times in job ads and from my experience in the field.
1. Being a T-shaped professional
It’s considered best to aim for being a generalist (the horizontal bar in T) in the sense that you understand the general concepts of databases, cloud computing, data warehousing, big data, and that you know at least some basics of SQL, Python, Docker, and creating ETL.
At the same time, you should have stronger skills in at least one particular area (the vertical bar in T). For instance, you might be really good at writing Spark or Dask data manipulations, or you may have some particular domain knowledge required by the company you apply for, which sets you apart from other applicants.
In many cases, knowing SQL well + the basics of Python, Linux and AWS can already get you to a fairly-paid junior position.
2. Cloud services for working with data
Cloud computing revolutionized and changed many industries. As a data engineer, you need to know the most important services for storage, compute, networking, and databases. If you don’t know much about those, I highly recommend learning Amazon Web Services — even if you would end up using Google Cloud Platform or Microsoft Azure, the concepts learned from AWS can be easily applied when switching to a different cloud vendor since many services across cloud vendors are analogous, and their concepts are virtually the same (ex. block storage vs. object storage vs. NFS).
If you are new to AWS, following this link, you can find great free courses on AWS — they are all offered directly from AWS. You don’t need to pay for the extra certificate — from my experience, recruiters and engineering managers don’t really care that much about certifications. They want people with hands-on experience who know a lot and can apply it to business problems.
The most important AWS services for a data engineering position are:
- Being able to programmatically interact with files onS3 ( to download and upload a CSV or parquet file)
- Being able to spin up and SSH to an EC2 instance + knowingsome Linux basics to be able to interact with it by using CLI
- IAM: knowing how to create IAM user, attach a policy for relevant services, use it to configure programmatic access with AWS CLI+ basics of how IAM roles work
- VPC: you should know what is a VPC, subnet, and knowing the basics of how they work ( your VPC exists in a specific AWS region and subnet in a specific Availability Zone within that region)
- RDS:knowing how to spin up or at least to interact with a relational database such as Postgres.
Additionally, it’s good to know AWS Lambda (serverless Function as a Service), ECS & EKS (running containers at scale), Amazon Redshift (cloud data warehouse), Athena (serverless query engine to query S3 data lake), and AWS Kinesis or Amazon MSK (both are used for real-time streaming data). But you can focus on the ones presented in the bulleted list first. The courses from Edx explain most of them. Plus, remember to practice: with AWS free tier, you get (limited) access to those basic services so that you can play around and learn by doing.
3. Building ETL pipelines
Being a data engineer is a lot about integrating data from various sources, bringing it to a form appropriate for analysis, and then loading it to some data lake or data warehouse. You should have some experience in creating ETL. It doesn’t mean that you must have worked on a Big Data project for some large companies — even your self-driven projects shared on Github or in a blog post can get you far in the application process and make you stand out from the crowd.
4. Managing, monitoring, and scheduling ETL pipelines
One of the main responsibilities of data engineers is to ensure that the data is always available, reliable, and in a proper structure. To achieve this, you need to schedule and monitor your data pipelines. Many companies use workflow management systems such as Apache Airflow or Prefect for this purpose, so knowing one of them may significantly improve your chances of getting a great data engineering job. If you want to learn more about those, read my previous stories, such as this one — in that article, I’m demonstrating how to easily set up a workflow management system with a serverless Kubernetes cluster on AWS.
5. Ability to work with containers: Docker & Kubernetes
If you work with Python, you know that your code may suddenly no longer work because you upgraded to a new pandas version. Containerization is key, so being able to work with containerized workloads is one of the most crucial and most in-demand skills in (any) engineering jobs, as it makes your code self-contained, dependency-free and lets you deploy your code to virtually any environment.
6. Knowing basic concepts
This goes together with being a T-shaped professional: you should know the basics of data warehousing, data lakes, Big Data, REST APIs, and databases. It would be rather disappointing to fail at explaining the 3Vs of Big Data or data warehouse characteristics during your job interview. Additionally, it’s worth knowing the architectural components. For instance, in this post, I discuss data warehouse architectures and key considerations when migrating to the cloud.
7. Ability to work and learn independently
This goes without saying: with technologies evolving so fast, it’s crucial that you are a self-directed learner and that you are willing to continuously learn and experiment with new tools. It doesn’t mean that you need to follow every hype, but rather that you stay open-minded.
8. Coding skills
Programming doesn’t mean that you must be a “hacker,” and you need to spend all days doing nothing else but writing code. It’s rather about being able to learn quickly and to know how to write good abstractions. In the field of data engineering, this means that you know how to create code that is DRY (Don’t Repeat Yourself), meaning: you don’t copy-and-paste the same code from one script to another, but you know how to write functions or classes in a modular and reusable way. Clean code that can be reused, extended and parametrized, is easy to maintain and will save you and others time.
To give you an example: I once worked for a company where there has been almost no modularity in place. In almost every Python project, people were copying over the same code to establish logging, connect to a data warehouse and load some data to and from it, or to establish an S3 client and download a CSV file from some S3 bucket. To improve this, I created a Python package:
- it included all the functions that were needed in almost any project, and I pushed it to a new GitHub repository
- This package could then be installed anywhere via:
pip install git+https://github.com/<COMPANY>/<PACKAGE_NAME>.git.
This package saved us all a lot of time in the long run and made the codebase much cleaner.
If you are a Python beginner, then you don’t need to learn how to create packages. At first, it may be enough if you can write good Python functions and if you know how to work with basic packages for data manipulation such as Pandas.
Many companies also look for data engineers who know Scala, Java, R, or C (or any other language you can think of) — regardless of the programming language, you can get a much better job if you understand the basic data types for working with data, as well as the principles of functional programming and modularity.
9. Command Line
Being able to work with the Linux operating system and interacting with it by using bash commands is one of the most crucial skills that will make you much more efficient.
Many frameworks and cloud services work in such a way that we define our resources and services via a declarative language (such as Dockerfile or Kubernetes YAML files), which can then be deployed via Command Line Interface (CLI). This paradigm is often known as Infrastructure as Code. For instance, AWS CLI allows you to provision an entire cluster of resources simply by submitting bash commands to the AWS API. Other cloud providers (such as GCP or Azure) offer similar command-line interfaces.
10. Soft skills
Some may expect a data engineer to be a person who is doing nothing but writing ETL and crunching numbers. But in every job, it pays off to have skills that complement your profile. Imagine that you have two candidates:
- An excellent coder but a poor public speaker,
- An average coder but at the same time a great public speaker.
Which one would you hire? Many companies would pick the latter. Employers look for well-rounded individuals who also have important soft skills such as project management, public speaking, documenting, or great at moderating and organizing events.
Factors playing a large role in your career prospects
Salaries in data engineering jobs vary depending on the location, industry, required skills, and level of experience. Below, I list the 7 most important factors that determine salary and future growth. Some of them are obvious, but others may surprise you:
- Location— even if you apply for a remote job, chances are that the company is paying you based on the standards of the country you live in to reflect the costs of living, etc.
- Industry — companies in the finance, automotive, tech, or pharmaceutical industry often pay much better than startups and e-commerce.
- Years of experience — recruiters are obsessed with it, even though the years themselves don’t really tell much about how much you learned from your previous jobs,
- Expertise — the years of experience are not equivalent to the expertise (at least I think so). Often people are just great at Spark, Linux, Dask, or advanced SQL. And if you can prove that you really know it well, it may be worth more than 20 years of experience doing drag-and-drop ETL.
- Hands-on experience— nothing is worth more in engineering than hands-on experience. Nobody can benefit from our knowledge if we can’t apply it in real life. Do personal projects and practice. Don’t just read something and think that you already know it — if you didn’t apply it, it’s all just theory that you will soon forget.
- Education— I personally found that recruiters don’t look as much at your education as I would expect. Of course, they check whether you have a Bachelor’s or Master’s degree or even a Ph.D., but it often doesn’t matter much to recruiters what university did you attend or what was your subject. The same is true with certifications — many technical managers value your actual experience with specific tools or programming languages higher than any official proof of your knowledge, and they might prefer to verify your knowledge themselves in the technical interview rather than relying on certificates.
- Your special skills, domain knowledge, and soft skills(for instance, the ability to handle conflicts) are more important than you might expect. Often recruiters may reject somebody because they feel that this person simply doesn’t fit into the team’s and company’s culture.
I heard about cases when an applicant couldn’t answer in a phone interview the question about what the company he or she applied at is doing. Also, questions like: tell me about yourself and why do you want to switch to a new company are so common that it’s good to think about it in advance.
Additionally, if you plan to apply, you should be prepared for some (basic) technical questions. Many data engineering managers ask to design a star schema based on some situation or give you some coding questions like what are SQL window functions, generators, broadcasting, or list comprehensions in Python, what is the difference between Docker image and Docker container, or how would you go about creating a Docker image and running a Docker container.
And lastly, believe in yourself and stay confident.
Original. Reposted with permission.
Read More …