Managed Apache Airflow on AWS — New AWS Service For Data Pipelines | by Anna Anisienia | Dec, 2020
There are many potential reasons why AWS may have decided to create this managed service.
1. Market share
The first, quite an obvious reason, could be:
However, I’m not sure if this was one of AWS’s main concerns when deciding whether to create a service for Apache Airflow. If it were the case, they would have done it much sooner.
2. The missing piece in the AWS data engineering space as compared to GCP
So far, Google Cloud Platform has been (subjectively) offering the most comprehensive range of products with respect to data engineering space, offering services such as Big Query (serverless, cost-effective, and ridiculously feature-rich DWH), Cloud Composer (Apache Airflow), Cloud Dataflow (Apache Beam), and many more.
AWS could provide corresponding services for storing data such as Redshift (DWH), AWS Glue (metadata catalog), and Amazon Athena (serverless Presto), but the missing part was an orchestrator to glue those data services together. With MWAA, this missing piece can potentially be filled.
Side note: You may say that AWS has Simple Workflow Service or Step Functions. But those services are hard to use to orchestrate data pipelines for ETL and ML, where most people prefer to do it in Python — the language of data.
Apart from that, MWAA makes it easier to build ML pipelines using Sagemaker, as it allows you to combine all preprocessing and model training steps in a repeatable process.
3. Making open-source tools easily accessible from AWS
AWS consistently demonstrated its commitment to making open-source technologies more accessible and easy to manage. So far, AWS introduced managed services for:
- Kafka (Amazon MSK)
- Presto (Amazon Athena)
- Hadoop (Amazon EMR)
- Open-source relational databases such as Postgres, MySQL, MariaDB (RDS)
- Kubernetes (AWS EKS)
- Elasticsearch (Amazon Elasticsearch Service)
- Kibana (directly integrated within Elasticsearch service)
- Apache MXNet, PyTorch, Tensorflow & Jupyter (Sagemaker)
- Redis and Memcached (ElastiCache).
Now, AWS is adding to this list Apache Airflow with the MWAA service.
4. Customer centricity & easier way of deploying secure Airflow environments
“Customers have told us they really like Apache Airflow because it speeds the development of their data processing and machine learning workflows, but they want it without the burden of scaling, operating, and securing servers,”
— Jesse Dougherty, Vice President, Application Integration, AWS. [Source]
AWS is customer-centric and security-aware. According to the above quote from the press release, many customers wanted to have Airflow on AWS, but it was simply too painful to deploy and maintain. To have a production-ready cloud-based Airflow environment, one needs to consider so many components that many customers likely decided to switch to competitors in order to get an easier and more reliable setup.
Personally, I came across Airflow UIs that were publicly exposed to anyone without RBAC functionality. This happened because some people deployed Airflow via AWS Elastic Beanstalk without proper security mechanisms. This means that anyone could access the environment simply by using the URL:
Thanks to the managed service that introduces security by default (with encryption and Single-Sign-On), hopefully, fewer people will encounter such vulnerabilities in their Airflow deployments.
Read More …