Posts

Showing posts with the label Real-time

10 Reasons to Choose Apache Pulsar Over Apache Kafka

Image
Apache Pulsar's unique features such as tiered storage, stateless brokers, geo-aware replication, and multi-tenancy may be a reason to choose it over Apache Kafka. Today, many data architects, engineers, dev-ops, and business leaders are struggling to understand the pros and cons of Apache Pulsar and Apache Kafka. As someone who has worked with Kafka in the past, I wanted to compare these two technologies.  If you are looking for insights on when to use Pulsar, here are 10 advantages of the technology that might be the deciding factors for you. Continue reading >>>

ETL and How it Changed Over Time

Image
Modern world data and its usage has drastically changed when compared to a decade ago. There is a gap caused by the traditional ETL processes when processing modern data. The following are some of the main reasons for this:  Modern data processes often include real-time streaming data, and organizations need real-time insights into processes.  The systems need to perform ETL on data streams without using batch processing, and they should handle high data rates by scaling the system. Some single-server databases are now replaced by distributed data platforms ( e.g., Cassandra, MongoDB, Elasticsearch, SAAS apps ), message brokers( e.g., Kafka, ActiveMQ, etc. ) and several other types of endpoints. The system should have the capability to plugin additional sources or sinks to connect on the go in a manageable way. Repeated data processing due to ad hoc architecture has to be eliminated. Change data capture technologies used with traditional ETL has to be integ...

The Future of Data Engineering

Image
Data engineering’s job is to help an organization move and process data. This generally requires two different systems, broadly speaking: a data pipeline, and a data warehouse. The data pipeline is responsible for moving the data, and the data warehouse is responsible for processing it. I acknowledge that this is a bit overly simplistic. You can do processing in the pipeline itself by doing transformations between extraction and loading with batch and stream processing. The “data warehouse” now includes many storage and processing systems (Flink, Spark, Presto, Hive, BigQuery, Redshift, etc), as well as auxiliary systems such as data catalogs, job schedulers, and so on. Still, I believe the paradigm holds. The industry is working through changes in how these systems are built and managed. There are four areas, in particular, where I expect to see shifts over the next few years. Timeliness: From batch to realtime Connectivity: From one:one bespoke integrations to many:many Cen...

What’s new in Hortonworks DataFlow 3.3?

With the upcoming HDP 3.1 release, we also bring about some exciting innovations to enhance our Kafka offering – New Hive Kafka Storage Handler (for SQL Analytics) – View Kafka topics as tables and execute SQL via Hive with full SQL Support for joins, windowing, aggregations, etc. New Druid Kafka Indexing Service (for OLAP Analytics) – View Kafka topics as cubes and perform OLAP style analytics on streaming events in Kafka using Druid. HDF 3.3 includes the following major innovations and enhancements: Core HDF Enhancements Support for Kafka 2.0, the latest Kafka release in the Apache community, with lots of enhancements into security, reliability and performance. Support for Kafka 2.0 NiFi processors NiFi Connection load balancing – This feature allows for bottleneck connections in the NiFi workflow to spread the queued-up flow files across the NiFi cluster and increase the processing speed and therefore lessen the effect of the bottleneck. MQTT performance improvements inc...

Improving Netflix’s Operational Visibility with Real-Time Insight Tools

Image
For Netflix to be successful, we have to be vigilant in supporting the tens of millions of connected devices that are used by our 40+ million members throughout 40+ countries. These members consume more than one billion hours of content every month and account for nearly a third of the downstream Internet traffic in North America during peak hour From an operational perspective, our system environments at Netflix are large, complex, and highly distributed. And at our scale, humans cannot continuously monitor the status of all of our systems. To maintain high availability across such a complicated system, and to help us continuously improve the experience for our customers, it is critical for us to have exceptional tools coupled with intelligent analysis to proactively detect and communicate system faults and identify areas of improvement. In this post, we will talk about our plans to build a new set of insight tools and systems that create greater visibility int...

Balance Between Collecting Data and Connecting to Data

Because data is the most valuable resource in the digital business era, collecting it using only a centralized management approach is no longer viable. Data and analytics leaders need to take an aggressive approach that creates an appropriate balance between data collection and data connection. Key Challenges Data is distributed between cloud and premises, and hybrid deployments are becoming the default approach. The scale and pace of creation of data, as well as the need to harness it in real time, make it impossible to always collect data and then process it for a single value proposition or use case. As organizations prioritize operational efficiency and analytics, these two forces are making organizations rethink their data management strategies and investments. Data governance and regulatory requirements need to span all use cases and data distribution is further challenging centralized data governance approaches. Deploying different data management ...

KSQL the new streaming SQL engine for Apache Kafka

The recently introduced  KSQL , the streaming SQL engine for Apache Kafka, substantially lowers the bar to entry for the world of stream processing. Instead of writing a lot of programming code, all you need to get started with stream processing is a simple SQL statement, such as: SELECT * FROM payments - kafka stream WHERE fraud_probability > 0.8 , That’s it! And while this might not be immediately obvious, the above streaming query of KSQL is distributed, scalable, elastic, and real time to meet the data needs of businesses today. Of course, you can do much more with KSQL than I have shown in the simple example above. KSQL is open source (Apache 2.0 licensed) and built on top of Kafka’s Streams API. This means it supports a wide range of powerful stream processing operations, including filtering, transformations, aggregations, joins, windowing, and sessionization. This way you can detect anomalies and fraudulent activities in real time, monitor infrastructure and...

Apache Flink: API, runtime, and project roadmap

Detailed presentation on Apache Flink: https://www.slideshare.net/KostasTzoumas/apache-flink-api-runtime-and-project-roadmap