Posts

Showing posts from December, 2018

Design Thinking, Lean Startup and Agile

Image
What is the difference between Design Thinking, Lean Startup and Agile? I often get asked what the difference is between those terms. “Is lean startup opposite of design thinking? oh no, maybe it is the same?” and “Ah ok, so you mean agile?” or “I think Agile is a better word for it”. Those are some of the comments I get whenever I talk about one of terms above. I will hereby try to clarify what these terms relate to, and how they can be integrated with each other. Design thinking Design thinking is an iterative process in which we thrive to understand the user’s pain, challenge assumptions, redefine problems, in order to create new strategies and solutions. Opposed to “Brainstorming”, Design thinking promotes “Painstorming”, in order to fully understand the user’s pain. The usual Design thinking phases are the following: Empathize with your users Define your users’ needs, their problem, and your insights Ideate by challenging assumptions and creating ideas for in...

How Pinterest runs Kafka at scale

Image
Pinterest runs one of the largest Kafka deployments in the cloud. We use Apache Kafka extensively as a message bus to transport data and to power real-time streaming services, ultimately helping more than 250 million Pinners around the world discover and do what they love. As mentioned in an earlier post, we use Kafka to transport data to our data warehouse, including critical events like impressions, clicks, close-ups, and repins. We also use Kafka to transport visibility metrics for our internal services. If the metrics-related Kafka clusters have any glitches, we can’t accurately monitor our services or generate alerts that signal issues. On the real-time streaming side, Kafka is used to power many streaming applications, such as fresh content indexing and recommendation, spam detection and filtering, real-time advertiser budget computation, and so on. We’ve shared out experiences at the Kafka Summit 2018 on incremental db ingestion using Kafka, and building real-time ads pl...

Event Sourcing with AWS Lambda

Image
This article will explore one possible way to implement the event sourcing pattern in AWS using AWS services. Event sourcing is a pattern that involves saving every state change to your application allowing you to rebuild the application state from scratch via event playback. When to Use Event sourcing adds extra complexity to your application and is overkill for many use cases but can be invaluable in the right circumstances. The following are some instances of when you would want to implement the event sourcing pattern. When you need an audit log of every event in the system or micro service as opposed to only the current state of the system. When you need to be able to replay events after code bug fixes to fix data. When you need to be able to reverse events When you want to use the Event Log for debugging. When it’s valuable to expose the Event Log to support personnel to fix account specific issues by knowing exactly how the account got into a compromised state. ...

What's the future of the pandas library?

Image
Pandas is a powerful, open source Python library for data analysis, manipulation, and visualization. I've been teaching data scientists to use pandas since 2014, and in the years since, it has grown in popularity to an estimated 5 to 10 million users and become a "must-use" tool in the Python data science toolkit. I started using pandas around version 0.14.0, and I've followed the library as it has significantly matured to its current version, 0.23.4. But numerous data scientists have asked me questions like these over the years: "Is pandas reliable?" "Will it keep working in the future?" "Is it buggy? They haven't even released version 1.0!" Version numbers can be used to signal the maturity of a product, and so I understand why someone might be hesitant to rely on "pre-1.0" software. But in the world of open source, version numbers don't necessarily tell you anything about the maturity or reliability ...

Deep Speech With Apache NiFi 1.8

Image
Tools:  Python 3.6, PyAudio, TensorFlow, Deep Speech, Shell, Apache NiFi Why : Speech-to-Text Use Case:  Voice control and recognition. Series : Holiday Use Case: Turn on Holiday Lights and Music on command. Cool Factor:  Ever want to run a query on Live Ingested Voice Commands? Other Options: Voice Controlled with AIY Voice and NiFi We are using Python 3.6 to write some code around PyAudio, TensorFlow, and Deep Speech to capture audio, store it in a wave file, and then process it with Deep Speech to extract some text. This example is running in OSX without a GPU on Tensorflow v1.11. The Mozilla Github repo for their Deep Speech implementation has nice getting-started information that I used to integrate our flow with Apache NiFi. Apache NiFi Flow   Read full article >>>  

What’s new in Hortonworks DataFlow 3.3?

With the upcoming HDP 3.1 release, we also bring about some exciting innovations to enhance our Kafka offering – New Hive Kafka Storage Handler (for SQL Analytics) – View Kafka topics as tables and execute SQL via Hive with full SQL Support for joins, windowing, aggregations, etc. New Druid Kafka Indexing Service (for OLAP Analytics) – View Kafka topics as cubes and perform OLAP style analytics on streaming events in Kafka using Druid. HDF 3.3 includes the following major innovations and enhancements: Core HDF Enhancements Support for Kafka 2.0, the latest Kafka release in the Apache community, with lots of enhancements into security, reliability and performance. Support for Kafka 2.0 NiFi processors NiFi Connection load balancing – This feature allows for bottleneck connections in the NiFi workflow to spread the queued-up flow files across the NiFi cluster and increase the processing speed and therefore lessen the effect of the bottleneck. MQTT performance improvements inc...

KSQL: The Open Source SQL Streaming Engine for Apache Kafka

The rapidly expanding world of stream processing can be daunting, with new concepts such as various types of time semantics, windowed aggregates, changelogs, and programming frameworks to master. KSQL is an open-source, Apache 2.0 licensed streaming SQL engine on top of Apache Kafka which aims to simplify all this and make stream processing available to everyone. Even though it is simple to use, KSQL is built for mission-critical and scalable production deployments (using Kafka Streams under the hood). Benefits of using KSQL include: no coding required; no additional analytics cluster needed; streams and tables as first-class constructs; access to the rich Kafka ecosystem. This session introduces the concepts and architecture of KSQL. Use cases such as streaming ETL, real-time stream monitoring, and anomaly detection are discussed. A live demo shows how to set up and use KSQL quickly and easily on top of your Kafka ecosystem. Key takeaways: KSQL includes access to the rich...