Data Processing Pipeline Patterns

Data produced by applications, devices, or humans must be processed before it is consumed. By definition, a data pipeline represents the flow of data between two or more systems. It is a set of instructions that determine how and when to move data between these systems.
My last blog conveyed how connectivity is foundational to a data platform. In this blog, I will describe the different data processing pipelines that leverage different capabilities of the data platform, such as connectivity and data engines for processing.
There are many data processing pipelines. One may:

“Integrate” data from multiple sources
Perform data quality checks or standardize data
Apply data security-related transformations, which include masking, anonymizing, or encryption
Match, merge, master, and do entity resolution
Share data with partners and customers in the required format, such as HL7

Consumers or “targets” of data pipelines may include:

Data warehouses like Redshift, Snowflake, SQL data warehouses, or Teradata
Reporting tools like Tableau or Power BI
Another application in the case of application integration or application migration
Data lakes on Amazon S3, Microsoft ADLS, or Hadoop – typically for further exploration
Artificial intelligence algorithms
Temporary repositories or publish/subscribe queues like Kafka for consumption by a downstream data pipeline

Search This Blog

Notes about Cutting-Edge Technologies and Everything

Data Processing Pipeline Patterns

Comments

Post a Comment