Data Processing Pipeline Patterns

Data Processing Pipeline
Data produced by applications, devices, or humans must be processed before it is consumed. By definition, a data pipeline represents the flow of data between two or more systems. It is a set of instructions that determine how and when to move data between these systems.
My last blog conveyed how connectivity is foundational to a data platform. In this blog, I will describe the different data processing pipelines that leverage different capabilities of the data platform, such as connectivity and data engines for processing.
There are many data processing pipelines. One may:
  • “Integrate” data from multiple sources
  • Perform data quality checks or standardize data
  • Apply data security-related transformations, which include masking, anonymizing, or encryption
  • Match, merge, master, and do entity resolution
  • Share data with partners and customers in the required format, such as HL7
Consumers or “targets” of data pipelines may include:
  • Data warehouses like Redshift, Snowflake, SQL data warehouses, or Teradata
  • Reporting tools like Tableau or Power BI
  • Another application in the case of application integration or application migration
  • Data lakes on Amazon S3, Microsoft ADLS, or Hadoop – typically for further exploration
  • Artificial intelligence algorithms
  • Temporary repositories or publish/subscribe queues like Kafka for consumption by a downstream data pipeline
Continue reading >>>

Comments