Mage: Your New Go-To Tool for Data Orchestration

Shreyash Panchal

Data Engineering

Tags:

ETL

python

data orchestration

Data Engineering

In our journey to automate data pipelines, we've used tools like Apache Airflow, Dagster, and Prefect to manage complex workflows. However, as data automation continues to change, we've added a new tool to our toolkit: Mage AI.

Mage AI isn't just another tool; it's a solution to the evolving demands of data automation. This blog aims to explain how Mage AI is changing the way we automate data pipelines by addressing challenges and introducing innovative features. Let's explore this evolution, understand the problems we face, and see why we’ve adopted Mage AI.

What is Mage AI?

Mage is a user-friendly open-source framework created for transforming and merging data. It's a valuable tool for developers handling substantial data volumes efficiently. At its heart, Mage relies on “data pipelines,” made up of code blocks. These blocks can run independently or as part of a larger pipeline. Together, these blocks form a structure known as a directed acyclic graph (DAG), which helps manage dependencies. For example, you can use Mage for tasks like loading data, making transformations, or exportation.

Mage Architecture:

Before we delve into Mage's features, let's take a look at how it works.

When you use Mage, your request begins its journey in the Mage Server Container, which serves as the central hub for handling requests, processing data, and validation. Here, tasks like data processing and real-time interactions occur. The Scheduler Process ensures tasks are scheduled with precision, while Executor Containers, designed for specific tasks like Python or AWS, carry out the instructions.

Mage's scalability is impressive, allowing it to handle growing workloads effectively. It can expand both vertically and horizontally to maintain top-notch performance. Mage efficiently manages data, including code, data, and logs, and takes security seriously when handling databases and sensitive information. This well-coordinated system, combined with Mage's scalability, guarantees reliable data pipelines, blending technical precision with seamless orchestration.

Scaling Mage:

To enhance Mage's performance and reliability as your workload expands, it's crucial to scale its architecture effectively. In this concise guide, we'll concentrate on four key strategies for optimizing Mage's scalability:

Horizontal Scaling: Ensure responsiveness by running multiple Mage Server and Scheduler instances. This approach keeps the system running smoothly, even during peak usage.

Multiple Executor Containers: Deploy several Executor Containers to handle concurrent task execution. Customize them for specific executors (e.g., Python, PySpark, or AWS) to scale task processing horizontally as needed.

External Load Balancers: Utilize external load balancers to distribute client requests across Mage instances. This not only boosts performance but also ensures high availability by preventing overloading of a single server.

Scaling for Larger Datasets: To efficiently handle larger datasets, consider:

a. Allocating more resources to executors, empowering them to tackle complex data transformations.

b. Mage supports direct data warehouse transformation and native Spark integration for massive datasets.

Features:

1) Interactive Coding Experience

Mage offers an interactive coding experience tailored for data preparation. Each block in the editor is a modular file that can be tested, reused, and chained together to create an executable data pipeline. This means you can build your data pipeline piece by piece, ensuring reliability and efficiency.

2) UI/IDE for Building and Managing Data Pipelines

Mage takes data pipeline development to the next level with a user-friendly integrated development environment (IDE). You can build and manage your data pipelines through an intuitive user interface, making the process efficient and accessible to both data scientists and engineers.

3) Multiple Languages Support

Mage supports writing pipelines in multiple languages such as Python, SQL, and R. This language versatility means you can work with the languages you're most comfortable with, making your data preparation process more efficient.

4) Multiple Types of Pipelines

Mage caters to diverse data pipeline needs. Whether you require standard batch pipelines, data integration pipelines, streaming pipelines, Spark pipelines, or DBT pipelines, Mage has you covered.

5) Built-In Engineering Best Practices

Mage is not just a tool; it's a promoter of good coding practices. It enables reusable code, data validation in each block, and operationalizes data pipelines with built-in observability, data quality monitoring, and lineage. This ensures that your data pipelines are not only efficient but also maintainable and reliable.

6) Dynamic Blocks

Dynamic blocks in Mage allow the output of a block to dynamically create additional blocks. These blocks are spawned at runtime, with the total number of blocks created being equal to the number of items in the output data of the dynamic block multiplied by the number of its downstream blocks.

7) Triggers

Schedule Triggers: These triggers allow you to set specific start dates and intervals for pipeline runs. Choose from daily, weekly, or monthly, or even define custom schedules using Cron syntax. Mage's Schedule Triggers put you in control of when your pipelines execute.

Event Triggers: With Event Triggers, your pipelines respond instantly to specific events, such as the completion of a database query or the creation of a new object in cloud storage services like Amazon S3 or Google Storage. Real-time automation at your fingertips.

API Triggers: API Triggers enable your pipelines to run in response to specific API calls. Whether it's customer requests or external system interactions, these triggers ensure your data workflows stay synchronized with the digital world.

Different types of Block:

Data Loader: Within Mage, Data Loaders are ready-made templates designed to seamlessly link up with a multitude of data sources. These sources span from Postgres, Bigquery, Redshift, and S3 to various others. Additionally, Mage allows for the creation of custom data loaders, enabling connections to APIs. The primary role of Data Loaders is to facilitate the retrieval of data from these designated sources.

Data Transformer: Much like Data Loaders, Data Transformers provide predefined functions such as handling duplicates, managing missing data, and excluding specific columns. Alternatively, you can craft your own data transformations or merge outputs from multiple data loaders to preprocess and sanitize the data before it advances through the pipeline.

Data Exporter: Data Exporters within Mage empower you to dispatch data to a diverse array of destinations, including databases, data lakes, data warehouses, or local storage. You can opt for predefined export templates or craft custom exporters tailored to your precise requirements.

Custom Blocks: Custom blocks in the Mage framework are incredibly flexible and serve various purposes. They can store configuration data and facilitate its transmission across different pipeline stages. Additionally, they prove invaluable for logging purposes, allowing you to categorize and visually distinguish log entries for enhanced organization.

Sensor: A Sensor, a specialized block within Mage, continuously assesses a condition until it's met or until a specified time duration has passed. When a block depends on a sensor, it remains inactive until the sensor confirms that its condition has been satisfied. Sensors are especially valuable when there's a need to wait for external dependencies or handle delayed data before proceeding with downstream tasks.

Getting Started with Mage

There are two ways to run mage, either using docker or using pip:
Docker Command

Create a new working directory where all the mage files will be stored.

Then, in that working directory, execute this command:

Windows CMD:

docker run -it -p 6789:6789 -v %cd%:/home/src mageai/mageai /app/run_app.sh mage start [project_name]

Linux CMD:

docker run -it -p 6789:6789 -v $(pwd):/home/src mageai/mageai /app/run_app.sh mage start [project_name]

Using Pip (Working directory):

pip install mage-ai

Mage start [project_name]

You can browse to http://localhost:6789/overview to get to the Mage UI.

Let's build our first pipelineto fetch CSV files from the API for data loading, do some useful transformations, and export that data to our local database.

Dataset invoices CSV files stored in the current directory of columns:

(1) First Name; (2) Last Name; (3) E-mail; (4) Product ID; (5) Quantity; (6) Amount; (7) Invoice Date; (8) Address; (9) City; (10) Stock Code

Create a new pipeline and select a standard batch (we’ll be implementing a batch pipeline) from the dashboard and give it a unique ID.

Project structure:

├── mage_data

└── [project_name]

├── charts

├── custom

├── data_exporters

├── data_loaders

├── dbt

├── extensions

├── pipelines

│ └── [pipeline_name]

│ ├── __init__.py

│ └── metadata.yaml

├── scratchpads

├── transformers

├── utils

├── __init__.py

├── io_config.yaml

├── metadata.yaml

└── requirements.txt

This pipeline consists of all the block files, including data loader, transformer, charts, and configuration files for our pipeline io_config.yaml and metadata.yaml files. All block files will contain decorators’ inbuilt function where we will be writing our code.

1. We begin by loading a CSV file from our local directory, specifically located at /home/src/invoice.csv. To achieve this, we select the "Local File" option from the Templates dropdown and configure the Data Loader block accordingly. Running this configuration will allow us to confirm if the CSV file loads successfully.

2. In the next step, we introduce a Transformer block using a generic template. On the right side of the user interface, we can observe the directed acyclic graph (DAG) tree. To establish the data flow, we edit the parent of the Transformer block, linking it either directly to the Data Loader block or via the user interface.

The Transformer block operates on the data frame received from the upstream Data Loader block, which is passed as the first argument to the Transformer function.

3. Our final step involves exporting the DataFrame to a locally hosted PostgreSQL database. We incorporate a Data Export block and connect it to the Transformer block.

To establish a connection with the PostgreSQL database, it is imperative to configure the database credentials in the io_config.yaml file. Alternatively, these credentials can be added to environmental variables.

With these steps completed, we have successfully constructed a foundational batch pipeline. This pipeline efficiently loads, transforms, and exports data, serving as a fundamental building block for more advanced data processing tasks.

Mage vs Other tools:

Consistency Across Environments: Some orchestration tools may exhibit inconsistencies between local development and production environments due to varying configurations. Mage tackles this challenge by providing a consistent and reproducible workflow environment through a single configuration file that can be executed uniformly across different environments.

Reusability: Achieving reusability in workflows can be complex in some tools. Mage simplifies this by allowing tasks and workflows to be defined as reusable components within a Magefile, making it effortless to share them across projects and teams.

Data Passing: Efficiently passing data between tasks can be a challenge in certain tools, especially when dealing with large datasets. Mage streamlines data passing through straightforward function arguments and returns, enabling seamless data flow and versatile data handling.

Testing: Some tools lack user-friendly testing utilities, resulting in manual testing and potential coverage gaps. Mage simplifies testing with a robust testing framework that enables the definition of test cases, inputs, and expected outputs directly within the Mage file.

Debugging: Debugging failed tasks can be time-consuming with certain tools. Mage enhances debugging with detailed logs and error messages, offering clear insights into the causes of failures and expediting issue resolution.

Conclusion:

Mage offers a streamlined and user-friendly approach to data pipeline orchestration, addressing common challenges with simplicity and efficiency. Its single-container deployment, visual interface, and robust features make it a valuable tool for data professionals seeking an intuitive and consistent solution for managing data workflows.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Mage: Your New Go-To Tool for Data Orchestration

What is Mage AI?

Mage Architecture:

Before we delve into Mage's features, let's take a look at how it works.

Scaling Mage:

Horizontal Scaling: Ensure responsiveness by running multiple Mage Server and Scheduler instances. This approach keeps the system running smoothly, even during peak usage.

Multiple Executor Containers: Deploy several Executor Containers to handle concurrent task execution. Customize them for specific executors (e.g., Python, PySpark, or AWS) to scale task processing horizontally as needed.

External Load Balancers: Utilize external load balancers to distribute client requests across Mage instances. This not only boosts performance but also ensures high availability by preventing overloading of a single server.

Scaling for Larger Datasets: To efficiently handle larger datasets, consider:

a. Allocating more resources to executors, empowering them to tackle complex data transformations.

b. Mage supports direct data warehouse transformation and native Spark integration for massive datasets.

Features:

1) Interactive Coding Experience

2) UI/IDE for Building and Managing Data Pipelines

3) Multiple Languages Support

4) Multiple Types of Pipelines

Mage caters to diverse data pipeline needs. Whether you require standard batch pipelines, data integration pipelines, streaming pipelines, Spark pipelines, or DBT pipelines, Mage has you covered.

5) Built-In Engineering Best Practices

6) Dynamic Blocks

7) Triggers

Schedule Triggers: These triggers allow you to set specific start dates and intervals for pipeline runs. Choose from daily, weekly, or monthly, or even define custom schedules using Cron syntax. Mage's Schedule Triggers put you in control of when your pipelines execute.

Event Triggers: With Event Triggers, your pipelines respond instantly to specific events, such as the completion of a database query or the creation of a new object in cloud storage services like Amazon S3 or Google Storage. Real-time automation at your fingertips.

API Triggers: API Triggers enable your pipelines to run in response to specific API calls. Whether it's customer requests or external system interactions, these triggers ensure your data workflows stay synchronized with the digital world.

Different types of Block:

Getting Started with Mage

There are two ways to run mage, either using docker or using pip:
Docker Command

Create a new working directory where all the mage files will be stored.

Then, in that working directory, execute this command:

Windows CMD:

docker run -it -p 6789:6789 -v %cd%:/home/src mageai/mageai /app/run_app.sh mage start [project_name]

Linux CMD:

docker run -it -p 6789:6789 -v $(pwd):/home/src mageai/mageai /app/run_app.sh mage start [project_name]

Using Pip (Working directory):

pip install mage-ai

Mage start [project_name]

You can browse to http://localhost:6789/overview to get to the Mage UI.

Let's build our first pipelineto fetch CSV files from the API for data loading, do some useful transformations, and export that data to our local database.

Dataset invoices CSV files stored in the current directory of columns:

(1) First Name; (2) Last Name; (3) E-mail; (4) Product ID; (5) Quantity; (6) Amount; (7) Invoice Date; (8) Address; (9) City; (10) Stock Code

Create a new pipeline and select a standard batch (we’ll be implementing a batch pipeline) from the dashboard and give it a unique ID.

Project structure:

├── mage_data

└── [project_name]

├── charts

├── custom

├── data_exporters

├── data_loaders

├── dbt

├── extensions

├── pipelines

│ └── [pipeline_name]

│ ├── __init__.py

│ └── metadata.yaml

├── scratchpads

├── transformers

├── utils

├── __init__.py

├── io_config.yaml

├── metadata.yaml

└── requirements.txt

The Transformer block operates on the data frame received from the upstream Data Loader block, which is passed as the first argument to the Transformer function.

3. Our final step involves exporting the DataFrame to a locally hosted PostgreSQL database. We incorporate a Data Export block and connect it to the Transformer block.

Mage vs Other tools:

Conclusion:

About the Author

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

Explore current openings

Subscribe to get the latest technology updates

Mage: Your New Go-To Tool for Data Orchestration

Shreyash Panchal

What is Mage AI?

Mage Architecture:

Scaling Mage:

Features:

Different types of Block:

Getting Started with Mage

Mage vs Other tools:

Conclusion:

MORE POSTS BY THIS AUTHOR

Shreyash Panchal

You may also like

Centralized Governance of Data Lake, Data Fabric with adopted Data Mesh Setup

Sagar Jaswani

Data Engineering: Beyond Big Data

Pratyush Pranav

Iceberg: Features and Hands-on (Part 2)

Abhishek Sharma

Mage: Your New Go-To Tool for Data Orchestration

What is Mage AI?

Mage Architecture:

Scaling Mage:

Features:

Different types of Block:

Getting Started with Mage

Mage vs Other tools:

Conclusion:

About the Author

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

About Velotio

Subscribe to get the latest technology updates

Related Posts

Centralized Governance of Data Lake, Data Fabric with adopted Data Mesh Setup

Data Engineering: Beyond Big Data

Iceberg: Features and Hands-on (Part 2)

Data QA: The Need of the Hour

Iceberg - Introduction and Setup (Part - 1)

Confluent Kafka vs. Amazon Managed Streaming for Apache Kafka (AWS MSK) vs. on-premise Kafka

The Data Lake Revolution: Unleashing the Power of Delta Lake

Unlocking the Potential of Knowledge Graphs: Exploring Graph Databases

Spatial Data Analytics : The What, Why, and How?

Apache Flink - A Solution for Real-Time Analytics

An Introduction to Stream Processing & Analytics

Modern Data Stack: The What, Why and How?

Best Practices for Kafka Security

Parallelizing Heavy Read and Write Queries to SQL Datastores using Spark and more!

ClickHouse - The Newest Data Store in Your Big Data Arsenal

How to Load Unstructured Data into Apache Hive

Building an ETL Workflow Using Apache NiFi and Hive

Unit Testing Data at Scale using Deequ and Apache Spark

Elasticsearch - Basic and Advanced Concepts

BigQuery 101: All the Basics You Need to Know

Your Quintessential Guide to AWS Athena

Real Time Analytics for IoT Data using Mosquitto, AWS Kinesis and InfluxDB

Lessons Learnt While Building an ETL Pipeline for MongoDB & Amazon Redshift Using Apache Airflow

The Ultimate Beginner’s Guide to Jupyter Notebooks

Product Engineering

Data and AI

Cloud & DevOps

Strategy and Consulting