By Deep Menasigi, Senior Consultant – Altis Sydney
Workflow management plays a vital role in the data engineering domain. Traditionally ETL pipelines were orchestrated using legacy client-server based monolithic applications such as Control-M, Tivoli, Autosys etc. These legacy applications provided limited control to the developers with little or no room for customization.
Both from the scalability as well as costing standpoint these tools posted major concerns. There was a need for a more robust, scalable and extensible tool to solve the workflow management and orchestration problems of the modern Data engineering world.
Airflow was designed and developed at Airbnb in the year 2014, as an open-source project to address these problems. It later joined the Apache Software Foundation’s Incubator program in March 2016 and the Foundation announced Apache Airflow as a Top-Level Project in January 2019. Some of the key benefits of using Airflow for pipeline orchestration are:
- Dynamic: Airflow pipelines are configuration as code (Python), allowing for dynamic pipeline generation. This allows you to write code that instantiates pipelines dynamically.
- Extensible: Easily define your own operators or extend the library so that it fits the level of abstraction that suits your environment.
- Elegant: Airflow pipelines are lean and explicit. Parameterizing your scripts is built into the core of Airflow using the powerful Jinja templating engine.
- Scalable: Airflow has a modular architecture and uses a message queue to orchestrate an arbitrary number of workers. Airflow is ready to scale to infinity.
MWAA, which stands for Managed workflow for Apache Airflow, is a managed offering from AWS for the same open-source Apache-Airflow. With MWAA you get all the features of Airflow without the overhead for managing the cluster.
It is quite easy to set up a working instance for MWAA and you can do it either through the AWS console or through CloudFormation. You can set up a standard airflow instance that provide the access to the Web-UI through a public endpoint or you can set up a secure instance which only uses private subnets and can only be accessed from within the VPC.
Security is a major concern for every enterprise and one of the key benefits of using a managed AWS service is that you can leverage the shared security model of AWS to secure your application and its data.
MWAA is well integrated with IAM and the access to the application can be controlled through a combination of IAM roles and VPC security groups. ThisWorkflow management plays a vital role in the data engineering domain.
If you’d like to learn more watch our webinar below on MWAA, which will provide you with an overview of Apache Airflow, including a small demo where we will spin up a fresh airflow instance.
Interested in watching more of our past webinars? View our previous Free Data & Analytics webinar recordings here.