Starting Data Pipelines | Fundamentals of Data Engineering

Data Engineering

This article includes a comprehensive introduction with step-by-step definitions and code in data pipelines to introduce the basics of data engineering. Data pipelines are widely used in data science and machine learning and are essential in the process of machine learning to integrate data from multiple streams to gain business intelligence for competitive and profitable analysis.

What is a Data Pipeline?

Data pipeline is a set of rules that motivates and converts data from multiple sources to an area where new values ​​can be obtained. In the simplest way, the pipeline can only extract data from various sources such as REST API, websites, feeds, live streams, and so on. This is loaded locally, like a SQL table in a database. Data pipelines are the basis for statistics, reporting, and machine learning capabilities.

Data pipelines are constructed with multiple steps such as data extraction, data preprocessing, data validation, data storage, and others. They can be developed by using multiple programming languages and tools.

Well-designed data pipelines do more than just extract data from sources and load them into controllable website tables or flat files for analysts to use. They do a few steps with raw data, including cleaning, structure, familiarity, merging, merging, and more. The data pipeline also requires other functions such as monitoring, maintenance, development, and support for various infrastructure.

Data pipelines workflow.

ETL vs ELT

There may be no pattern known as ETL and its modern sibling, ELT. Both patterns are widely used in data storage and business intelligence. They encouraged data science pipeline patterns and machine learning models that have been in production for the past few years.
Both patterns are a data processing method used to feed data in a data warehouse and to make data useful to analysts and reporting tools. The extraction step collects data from various sources for upload and modification. The upload step brings raw data (in the case of ELT) or completely converted data (in the case of ETL) into a repository – a conversion step, where raw data from each source system is compiled and formatted to assist analysts. , visual aids, and any use of our active pipeline.
The combination of extraction and loading steps is often called data import.

Data Ingestion and Its Interfaces

The term data import refers to the transfer of data from one source to another. Data entry occurs in real time, in groups, or in combination (usually called lambda architecture).
When data is imported into batches, the task is organized periodically. It interacts with and communicates with multiple sources, requiring several different types of interaction with data structures.
The following are the most common import links with data formats:

  • Stream Processing Platforms: RabbitMQ, Kafka.
  • Databases: Postgres, MySQL, HDFS, or HBase database.
  • Data warehouse or data lake.
  • JSON, CSV, REST API.
  • Shared network file system or a cloud storage bucket.
 Different technology stacks.

Extracting Data from MySQL Databases

We can extract data from a MySQL database in a couple of different ways:

  • Full or incremental extraction using SQL
  • Binary Log (also known as binlog) replication

Full or Incremental Extraction using SQL

Full or incremental output using SQL is straightforward to use but very limited to large databases with common changes.
If we need to import an entire or a set of columns from a MySQL table to a data repository or data pool, we may use a full domain or additional extraction.

  • Every record in the table is extracted on each run of the extraction job.
  • High-volume tables can take a long time to run.
SELECT * FROM Customers

Binary Log (binlog) Replication

Binary log duplication is very complex to process, it is best suited in situations where the volume of data changes in source tables is high, or there is a need for standard data import from MySQL source. It is also a way to create streaming data entry [2]

SQL in a word cloud

Pipeline Orchestration

Orchestration ensures that the pipeline steps are carried out properly and that the dependence between these steps is properly controlled. Pipeline steps are always directed, which means they start with a lot of work or activities and end with a specific task or activities. Such is necessary to confirm the execution method. In other words, it ensures that tasks do not perform before all dependent tasks are successfully completed.
Pipe graphs must also be acyclic, meaning that the function cannot point back to a previously completed task. In other words, he can’t go back on the bike. If possible, the pipe can run indefinitely.

 Steps in a data pipeline.

For example, Apache Airflow is an open source data orchestrator tool designed to solve the day-to-day challenges that teams of data developers face: how to build, manage, and monitor workflows that include many highly dependent tasks.
Airflow has excellent configuration options, such as:

  • Schedulers.
  • Executors.
  • Operators.

A graph view of the ETL DAG(Airflow):

DAG steps.

DAG code implementation:

from datetime import timedelta
from airflow import DAG
from airflow.operators.bash_operator \
  import BashOperator
from airflow.operators.postgres_operator \
  import PostgresOperator
from airflow.utils.dates import days_ago

dag = DAG(
    'elt_pipeline_sample',
    description='A sample ELT pipeline',
    schedule_interval=timedelta(days=1),
    start_date = days_ago(1),
)

extract_orders_task = BashOperator(
    task_id='extract_order_data',
    bash_command='python /p/extract_orders.py',
    dag=dag,
)

extract_customers_task = BashOperator(
    task_id='extract_customer_data',
    bash_command='python /p/extract_customers.py',
    dag=dag,
)

load_orders_task = BashOperator(
    task_id='load_order_data',
    bash_command='python /p/load_orders.py',
    dag=dag,
)

load_customers_task = BashOperator(
    task_id='load_customer_data',
    bash_command='python /p/load_customers.py',
    dag=dag,
)

revenue_model_task = PostgresOperator(
    task_id='build_data_model',
    postgres_conn_id='redshift_dw',
    sql='/sql/order_revenue_model.sql',
    dag=dag,
)

extract_orders_task >> load_orders_task
extract_customers_task >> load_customers_task
load_orders_task >> revenue_model_task
load_customers_task >> revenue_model_task

Data Validation in Pipelines

Indeed, even with well-designed data pipelines, something can go wrong. A few problems can be prevented, or at least minimized, by good process design, orchestration, and infrastructure. Data verification is also a necessary step in ensuring data quality and validity because untested data is not safe to use in statistics.
Detecting a data quality problem at the end of a pipe and tracing it back to the beginning is a very serious situation. By confirming each step in the pipeline, we are more likely to find the origin of the current step than the previous one.
In the development of quality issues in the source system, there is the possibility of the data import process itself leading to a data quality problem. These are just a few of the common data import risks or import measures

  • A logical error in incremental ingestions.
  • Parsing issues in an extracted file.
Data validation steps

Verifying the data at each step of the pipeline is important. Even if the source data exceeds the verification from the beginning, it is always a good practice to start validating the data models built at the end of the pipeline.
The following steps show how to evaluate this process:

  • Assuring a metric is within certain lower and upper bounds.
  • Reviewing row count growth (or reduction) in the data model.
  • Checking to see if there is unexpected inconstancy in the value of a particular metric.

Python scripts can be written to perform data validation, also there are several frameworks available:

  • dbt: It is a command-line tool that enables data analysts and engineers to transform data in their warehouse more effectively.
  • Tensor Flow Data Validation: TensorFlow Data Verification identifies any confusion in data entry by comparing data statistics with schema.
  • Yahoo’s Validator

TensorFlow data validation (TFDV) can analyze training and serving to:

Best Practices to Maintain Data Pipelines

There are a few challenges to maintaining a data pipeline. One of the most popular challenges for data development developers is addressing the fact that the systems in which they import data are not static. Developers often make changes to their operating system, adding features, updating codebase, or fixing bugs.
If those changes change the schema or definition of data to be entered, the pipe is at risk of failure or accuracy. The reality of modern data infrastructure is that data is derived from many different sources. It is difficult to find a one-size-fits-all solution to manage schema and business mindset changes in resource systems. Therefore best practices are needed to build a measurable data pipeline.
Here are some of the best ways to recommend:

  • Add abstraction.
  • Maintain data contract.
  • Move from a schema-on-write design to schema-on-read.

Why Add Abstraction?

It is very useful to install an access layer between the source system and the import process whenever possible. It is also necessary for the owner of the source system to take care of or monitor the extraction method.
For example:
Instead of entering data directly from the Postgres website, consider working with a website owner to create a REST API that draws on the site and can be asked to extract data.

Why Maintain Data Contract?

f we import data directly from the source system website or in a way that is not explicitly designed for our issuance, creating and maintaining a data contract is a solution under schema management technology and conceptual changes.
Basically, a data contract is a written agreement between the source system owner and the importing team from that system for use in the data line. The data contract can be written in text, but the best is in a standard configuration file.

Example:

{
  ingestion_jobid: "orders_postgres",
  source_host: "my_host.com",
  source_db: "ecommerce",
  source_table: "orders",
  ingestion_type: "full",
  ingestion_frequency_minutes: "60",
  source_owner: "dev-team@mycompany.com",
  ingestion_owner: "data-eng@mycompany.com"
};

Move from a schema-on-write Design to Schema-on-Read

Schema-on-read pattern where data is written in a data pool, S3 bucket, or other storage systems that do not have a solid schema.
For example:
An event that defines an order entered into a system may be defined as a JSON item, but features of that item may change over time as new ones are added, or existing ones are removed.
So, in this case, the data schema is not known until it is read, hence it is called schema-on-read.
Differences between schema-on-read and schema-on-write:

Schema-on-Write

  • Fast reads.
  • Slower loads.
  • Not agile.
  • Structured.
  • Fewer errors.
  • SQL.

Schema-on-Read

  • Slower reads.
  • Fast loads.
  • Structured/Unstructured.
  • More errors.
  • NoSQL.

Standardizing Data Ingestion

When it comes to complexity, the number of programs we import is usually smaller than the fact that each system is completely different.
There are two challenges to repairing pipelines:

  • Ingestion should be written to manage a combination of source system types (Postgres, Kafka, etc.). The types of additional resources we need to access, our main codebase and its storage.
  • Ingestion Jobs of the same source system type cannot be simplified. For example, even if we only feed on REST APIs, if those APIs do not have standard encryption methods, incremental data access, and other features, data engineers may create “single” import functions that do not re-code and share logically. can be stored in one place.

Here are some approaches to control:

  • Standardize whatever code we can, and reuse.
  • Strive for config-driven data ingestions.
  • Consider our abstractions.

Conclusion

Data pipeline sets rules that move and convert data from different sources to a place where new content can be found. They are the result of mathematical, reporting, and machine learning capabilities.


The complexity of a data pipeline depends on the size, shape, and structure of the source data and the requirements of the mathematical project. In the simplest way, pipelines can extract data from only one source, such as the REST API, and load it into targets such as a SQL table in a data repository.


The data developer policy is not just about loading data into a database. Data engineers work closely with data scientists and analysts to determine what the data will gain and help bring their needs into a critical production environment.


Data engineers take pride in ensuring the accuracy and timeliness of the data they deliver. That means checking, warning, and creating emergency plans in case something goes wrong. Plus, yes, something is going to go wrong in the end!

Important Notice for college students

If you’re a college student and have skills in programming languages, Want to earn through blogging? Mail us at geekycomail@gmail.com

For more Programming related blogs Visit Us Geekycodes . Follow us on Instagram.

By geekycodesco

Leave a Reply

%d bloggers like this: