Posts

Showing posts with the label Data Ingestion

10 Reasons to Choose Apache Pulsar Over Apache Kafka

Image
Apache Pulsar's unique features such as tiered storage, stateless brokers, geo-aware replication, and multi-tenancy may be a reason to choose it over Apache Kafka. Today, many data architects, engineers, dev-ops, and business leaders are struggling to understand the pros and cons of Apache Pulsar and Apache Kafka. As someone who has worked with Kafka in the past, I wanted to compare these two technologies.  If you are looking for insights on when to use Pulsar, here are 10 advantages of the technology that might be the deciding factors for you. Continue reading >>>

Data Processing Pipeline Patterns

Image
Data produced by applications, devices, or humans must be processed before it is consumed. By definition, a data pipeline represents the flow of data between two or more systems. It is a set of instructions that determine how and when to move data between these systems. My last blog conveyed how connectivity is foundational to a data platform. In this blog, I will describe the different data processing pipelines that leverage different capabilities of the data platform, such as connectivity and data engines for processing. There are many data processing pipelines. One may: “Integrate” data from multiple sources Perform data quality checks or standardize data Apply data security-related transformations, which include masking, anonymizing, or encryption Match, merge, master, and do entity resolution Share data with partners and customers in the required format, such as HL7 Consumers or “targets” of data pipelines may include: Data warehouses like ...

Gartner Market Guide for Data Preparation Tools 2019

Image
Data preparation tools have matured from initially being self-service-focused to now supporting data integration, analytics and data science use cases in production. Data and analytics leaders must use this research to understand the dynamics of and popular vendors in this rapidly evolving market. Key Findings The market for data preparation tools has evolved from being able to support only self-service use cases. Modern data preparation tools now enable data and analytics teams to build agile datasets at an enterprise scale, for a range of distributed content authors. The market for data preparation tools remains crowded and complex. The choices range from stand-alone specialists to vendors that embed data preparation — as a key capability — into their broader analytics/BI, data science or data integration tools. While most data preparation tool capabilities have been maturing at a steady state, organizations continue to cite “operationalization” — the ability to promote ...

Self-Service Data Preparation: Research to Practice

Image
The story of Self-Service Data Preparation and academic research behind Trifacta, which is also a SaaS offering in GCP Dataprep:  http://sites.computer.org/debull/A18june/p23.pdf

Processing streams of data with Apache Kafka and Spark

Image
Data Data is produced every second, it comes from millions of sources and is constantly growing. Have you ever thought how much data you personally are generating every day? Data: direct result of our actions There’s data generated as a direct result of our actions and activities: Browsing twitter Using mobile apps Performing financial transactions Using a navigator in your car Booking a train ticket Creating an online document Starting a YouTube live stream Obviously, that’s not it. Data: produced as a side effect For example, performing a purchase where it seems like we’re buying just one thing – might generat...