Posts

Showing posts with the label Streaming

10 Reasons to Choose Apache Pulsar Over Apache Kafka

Image
Apache Pulsar's unique features such as tiered storage, stateless brokers, geo-aware replication, and multi-tenancy may be a reason to choose it over Apache Kafka. Today, many data architects, engineers, dev-ops, and business leaders are struggling to understand the pros and cons of Apache Pulsar and Apache Kafka. As someone who has worked with Kafka in the past, I wanted to compare these two technologies.  If you are looking for insights on when to use Pulsar, here are 10 advantages of the technology that might be the deciding factors for you. Continue reading >>>

ETL and How it Changed Over Time

Image
Modern world data and its usage has drastically changed when compared to a decade ago. There is a gap caused by the traditional ETL processes when processing modern data. The following are some of the main reasons for this:  Modern data processes often include real-time streaming data, and organizations need real-time insights into processes.  The systems need to perform ETL on data streams without using batch processing, and they should handle high data rates by scaling the system. Some single-server databases are now replaced by distributed data platforms ( e.g., Cassandra, MongoDB, Elasticsearch, SAAS apps ), message brokers( e.g., Kafka, ActiveMQ, etc. ) and several other types of endpoints. The system should have the capability to plugin additional sources or sinks to connect on the go in a manageable way. Repeated data processing due to ad hoc architecture has to be eliminated. Change data capture technologies used with traditional ETL has to be integ...

The Forrester Wave™: Streaming Analytics, Q3 2019

Image
Key Takeaways Software AG, IBM, Microsoft, Google, And TIBCO Software Lead The Pack Forrester's research uncovered a market in which Software AG, IBM, Microsoft, Google, and TIBCO Software are Leaders; Cloudera, SAS, Amazon Web Services, and Impetus are Strong Performers; and EsperTech and Alibaba are Contenders. Analytics Prowess, Scalability, And Deployment Freedom Are Key Differentiators Depth and breadth of analytics types on streaming data are critical. But that is all for naught if streaming analytics vendors cannot also scale to handle potentially huge volumes of streaming data. Also, it's critical that streaming analytics can be deployed where it is most needed, such as on-premises, in the cloud, and/or at the edge. Read report >>>

Move Beyond a Monolithic Data Lake to a Distributed Data Mesh (Martin Fowler)

Image
Many enterprises are investing in their next generation data lake, with the hope of democratizing data at scale to provide business insights and ultimately make automated intelligent decisions. Data platforms based on the data lake architecture have common failure modes that lead to unfulfilled promises at scale. To address these failure modes we need to shift from the centralized paradigm of a lake, or its predecessor data warehouse. We need to shift to a paradigm that draws from modern distributed architecture: considering domains as the first class concern, applying platform thinking to create self-serve data infrastructure, and treating data as a product. Becoming a data-driven organization remains one of the top strategic goals of many companies I work with. My clients are well aware of the benefits of becoming intelligently empowered: providing the best customer experience based on data and hyper-personalization; reducing operational costs and time through data-driven optimi...

What’s Behind Lyft’s Choices in Big Data Tech

Image
Lyft was a late entrant to the ride-sharing business model, at least compared to its competitor Uber, which pioneered the concept and remains the largest provider. That delay in starting out actually gave Lyft a bit of an advantage in terms of architecting its big data infrastructure in the cloud, as it was able to sidestep some of the challenges that Uber faced in building out its on-prem system. Lyft and Uber, like many of the young Silicon Valley companies shaking up established business models, aren’t shy about sharing information about their computer infrastructure. They both share an ethos of openness in regards to using and developing technology. That openness is also pervasive at Google, Facebook, Twitter, and other Valley outfits that created much of the big data ecosystem, most of which is, of course, open source. So when the folks at Lyft were blueprinting how to construct a system that could do all the things that a ride-sharing app has to do – tracking and connectin...

Real-Time Stock Processing With Apache NiFi and Apache Kafka

Image
Implementing Streaming Use Case From REST to Hive With Apache NiFi and Apache Kafka Part 1 With Apache Kafka 2.0 and Apache NiFi 1.8, there are many new features and abilities coming out. It's time to put them to the test. So to plan out what we are going to do, I have a high-level architecture diagram. We are going to ingest a number of sources including REST feeds, Social Feeds, Messages, Images, Documents, and Relational Data. We will ingest with NiFi and then filter, process, and segment it into Kafka topics. Kafka data will be in Apache Avro format with schemas specified in the Hortonworks Schema Registry. Spark and NiFi will do additional event processing along with machine learning and deep learning. This will be stored in Druid for real-time analytics and summaries. Hive, HDFS, and S3 will store the data for permanent storage. We will do dashboards with Superset and Spark SQL + Zeppelin. We will also push back cleaned and aggregated data to subscribers via Kafka ...

How Pinterest runs Kafka at scale

Image
Pinterest runs one of the largest Kafka deployments in the cloud. We use Apache Kafka extensively as a message bus to transport data and to power real-time streaming services, ultimately helping more than 250 million Pinners around the world discover and do what they love. As mentioned in an earlier post, we use Kafka to transport data to our data warehouse, including critical events like impressions, clicks, close-ups, and repins. We also use Kafka to transport visibility metrics for our internal services. If the metrics-related Kafka clusters have any glitches, we can’t accurately monitor our services or generate alerts that signal issues. On the real-time streaming side, Kafka is used to power many streaming applications, such as fresh content indexing and recommendation, spam detection and filtering, real-time advertiser budget computation, and so on. We’ve shared out experiences at the Kafka Summit 2018 on incremental db ingestion using Kafka, and building real-time ads pl...

What’s new in Hortonworks DataFlow 3.3?

With the upcoming HDP 3.1 release, we also bring about some exciting innovations to enhance our Kafka offering – New Hive Kafka Storage Handler (for SQL Analytics) – View Kafka topics as tables and execute SQL via Hive with full SQL Support for joins, windowing, aggregations, etc. New Druid Kafka Indexing Service (for OLAP Analytics) – View Kafka topics as cubes and perform OLAP style analytics on streaming events in Kafka using Druid. HDF 3.3 includes the following major innovations and enhancements: Core HDF Enhancements Support for Kafka 2.0, the latest Kafka release in the Apache community, with lots of enhancements into security, reliability and performance. Support for Kafka 2.0 NiFi processors NiFi Connection load balancing – This feature allows for bottleneck connections in the NiFi workflow to spread the queued-up flow files across the NiFi cluster and increase the processing speed and therefore lessen the effect of the bottleneck. MQTT performance improvements inc...

KSQL: The Open Source SQL Streaming Engine for Apache Kafka

The rapidly expanding world of stream processing can be daunting, with new concepts such as various types of time semantics, windowed aggregates, changelogs, and programming frameworks to master. KSQL is an open-source, Apache 2.0 licensed streaming SQL engine on top of Apache Kafka which aims to simplify all this and make stream processing available to everyone. Even though it is simple to use, KSQL is built for mission-critical and scalable production deployments (using Kafka Streams under the hood). Benefits of using KSQL include: no coding required; no additional analytics cluster needed; streams and tables as first-class constructs; access to the rich Kafka ecosystem. This session introduces the concepts and architecture of KSQL. Use cases such as streaming ETL, real-time stream monitoring, and anomaly detection are discussed. A live demo shows how to set up and use KSQL quickly and easily on top of your Kafka ecosystem. Key takeaways: KSQL includes access to the rich...

Processing streams of data with Apache Kafka and Spark

Image
Data Data is produced every second, it comes from millions of sources and is constantly growing. Have you ever thought how much data you personally are generating every day? Data: direct result of our actions There’s data generated as a direct result of our actions and activities: Browsing twitter Using mobile apps Performing financial transactions Using a navigator in your car Booking a train ticket Creating an online document Starting a YouTube live stream Obviously, that’s not it. Data: produced as a side effect For example, performing a purchase where it seems like we’re buying just one thing – might generat...