Posts

Showing posts with the label Stream processing

10 Reasons to Choose Apache Pulsar Over Apache Kafka

Image
Apache Pulsar's unique features such as tiered storage, stateless brokers, geo-aware replication, and multi-tenancy may be a reason to choose it over Apache Kafka. Today, many data architects, engineers, dev-ops, and business leaders are struggling to understand the pros and cons of Apache Pulsar and Apache Kafka. As someone who has worked with Kafka in the past, I wanted to compare these two technologies.  If you are looking for insights on when to use Pulsar, here are 10 advantages of the technology that might be the deciding factors for you. Continue reading >>>

The Forrester Wave™: Streaming Analytics, Q3 2019

Image
Key Takeaways Software AG, IBM, Microsoft, Google, And TIBCO Software Lead The Pack Forrester's research uncovered a market in which Software AG, IBM, Microsoft, Google, and TIBCO Software are Leaders; Cloudera, SAS, Amazon Web Services, and Impetus are Strong Performers; and EsperTech and Alibaba are Contenders. Analytics Prowess, Scalability, And Deployment Freedom Are Key Differentiators Depth and breadth of analytics types on streaming data are critical. But that is all for naught if streaming analytics vendors cannot also scale to handle potentially huge volumes of streaming data. Also, it's critical that streaming analytics can be deployed where it is most needed, such as on-premises, in the cloud, and/or at the edge. Read report >>>

Real-Time Stock Processing With Apache NiFi and Apache Kafka

Image
Implementing Streaming Use Case From REST to Hive With Apache NiFi and Apache Kafka Part 1 With Apache Kafka 2.0 and Apache NiFi 1.8, there are many new features and abilities coming out. It's time to put them to the test. So to plan out what we are going to do, I have a high-level architecture diagram. We are going to ingest a number of sources including REST feeds, Social Feeds, Messages, Images, Documents, and Relational Data. We will ingest with NiFi and then filter, process, and segment it into Kafka topics. Kafka data will be in Apache Avro format with schemas specified in the Hortonworks Schema Registry. Spark and NiFi will do additional event processing along with machine learning and deep learning. This will be stored in Druid for real-time analytics and summaries. Hive, HDFS, and S3 will store the data for permanent storage. We will do dashboards with Superset and Spark SQL + Zeppelin. We will also push back cleaned and aggregated data to subscribers via Kafka ...

Event Sourcing with AWS Lambda

Image
This article will explore one possible way to implement the event sourcing pattern in AWS using AWS services. Event sourcing is a pattern that involves saving every state change to your application allowing you to rebuild the application state from scratch via event playback. When to Use Event sourcing adds extra complexity to your application and is overkill for many use cases but can be invaluable in the right circumstances. The following are some instances of when you would want to implement the event sourcing pattern. When you need an audit log of every event in the system or micro service as opposed to only the current state of the system. When you need to be able to replay events after code bug fixes to fix data. When you need to be able to reverse events When you want to use the Event Log for debugging. When it’s valuable to expose the Event Log to support personnel to fix account specific issues by knowing exactly how the account got into a compromised state. ...

What’s new in Hortonworks DataFlow 3.3?

With the upcoming HDP 3.1 release, we also bring about some exciting innovations to enhance our Kafka offering – New Hive Kafka Storage Handler (for SQL Analytics) – View Kafka topics as tables and execute SQL via Hive with full SQL Support for joins, windowing, aggregations, etc. New Druid Kafka Indexing Service (for OLAP Analytics) – View Kafka topics as cubes and perform OLAP style analytics on streaming events in Kafka using Druid. HDF 3.3 includes the following major innovations and enhancements: Core HDF Enhancements Support for Kafka 2.0, the latest Kafka release in the Apache community, with lots of enhancements into security, reliability and performance. Support for Kafka 2.0 NiFi processors NiFi Connection load balancing – This feature allows for bottleneck connections in the NiFi workflow to spread the queued-up flow files across the NiFi cluster and increase the processing speed and therefore lessen the effect of the bottleneck. MQTT performance improvements inc...

Processing streams of data with Apache Kafka and Spark

Image
Data Data is produced every second, it comes from millions of sources and is constantly growing. Have you ever thought how much data you personally are generating every day? Data: direct result of our actions There’s data generated as a direct result of our actions and activities: Browsing twitter Using mobile apps Performing financial transactions Using a navigator in your car Booking a train ticket Creating an online document Starting a YouTube live stream Obviously, that’s not it. Data: produced as a side effect For example, performing a purchase where it seems like we’re buying just one thing – might generat...

Improving Netflix’s Operational Visibility with Real-Time Insight Tools

Image
For Netflix to be successful, we have to be vigilant in supporting the tens of millions of connected devices that are used by our 40+ million members throughout 40+ countries. These members consume more than one billion hours of content every month and account for nearly a third of the downstream Internet traffic in North America during peak hour From an operational perspective, our system environments at Netflix are large, complex, and highly distributed. And at our scale, humans cannot continuously monitor the status of all of our systems. To maintain high availability across such a complicated system, and to help us continuously improve the experience for our customers, it is critical for us to have exceptional tools coupled with intelligent analysis to proactively detect and communicate system faults and identify areas of improvement. In this post, we will talk about our plans to build a new set of insight tools and systems that create greater visibility int...

KSQL the new streaming SQL engine for Apache Kafka

The recently introduced  KSQL , the streaming SQL engine for Apache Kafka, substantially lowers the bar to entry for the world of stream processing. Instead of writing a lot of programming code, all you need to get started with stream processing is a simple SQL statement, such as: SELECT * FROM payments - kafka stream WHERE fraud_probability > 0.8 , That’s it! And while this might not be immediately obvious, the above streaming query of KSQL is distributed, scalable, elastic, and real time to meet the data needs of businesses today. Of course, you can do much more with KSQL than I have shown in the simple example above. KSQL is open source (Apache 2.0 licensed) and built on top of Kafka’s Streams API. This means it supports a wide range of powerful stream processing operations, including filtering, transformations, aggregations, joins, windowing, and sessionization. This way you can detect anomalies and fraudulent activities in real time, monitor infrastructure and...

Gartner Hype Cycle for Data Science and Machine Learning, 2017

Image
The hype around data science and machine learning has increased from already high levels in the past year. Data and analytics leaders should use this Hype Cycle to understand technologies generating excitement and inflated expectations, as well as significant movements in adoption and maturity. The Hype Cycle The Peak of Inflated Expectations is crowded and the Trough of Disillusionment remains sparse, though several highly hyped technologies are beginning to hear the first disillusioned rumblings from the market. In general, the faster a technology moves from the innovation trigger to the peak, the faster the technology moves into the trough as organizations quickly see it as just another passing fad. This Hype Cycle is especially relevant to data and analytics leaders, chief data officers, and heads of data science teams who are implementing machine-learning programs and looking to understand the next-generation innovations. Technology provider product marketers and strategists...

Apache Beam: A unified model for batch and stream processing data

Apache Beam , a new distributed processing tool that's currently being incubated at the ASF, provides an abstraction layer allowing developers to focus on Beam code, using the Beam programming model. Thanks to Apache Beam, an implementation is agnostic to the runtime technologies being used, meaning you can switch technologies quickly and easily. Apache Beam also offers a programming model that is agnostic in terms of coverage—meaning the programming model is unified, which allows developers to implement both batch and streaming data processing. It’s actually where the Apache Beam name comes from: B (for Batch) and EAM (for strEAM). To implement your data processes using the Beam programming model, you will use an SDK or DSL provided by Beam. Now, you really have only one SDK: The Java SDK. However, a Python SDK is expected to be released and Beam will provide a Scala SDK and additional DSL (Declarative DSL with XML for instance) soon. With Apache Beam, first ...

Why Apache Beam? A Google Perspective

When we made the decision (in partnership with data Artisans, Cloudera, Talend, and a few other companies) to move the Google Cloud Dataflow SDK and runners into the Apache Beam incubator project, we did so with the following goal in mind: provide the world with an easy-to-use, but powerful model for data-parallel processing, both streaming and batch, portable across a variety of runtime platforms. Now that the dust on the initial code drops is starting to settle, we wanted to talk briefly about why this makes sense for us at Google and how we got here, given that Google hasn’t historically been directly involved in the OSS world of data-processing. Why does this make sense for Google? Google is a business, and as such, it should come as no surprise there’s a business motivation for us behind the Apache Beam move. That motivation hinges primarily on the desire to get as many Apache Beam pipelines as possible running on Cloud Dataflow. Given that, it may not seem intu...