Posts

Showing posts from 2018

Design Thinking, Lean Startup and Agile

Image
What is the difference between Design Thinking, Lean Startup and Agile? I often get asked what the difference is between those terms. “Is lean startup opposite of design thinking? oh no, maybe it is the same?” and “Ah ok, so you mean agile?” or “I think Agile is a better word for it”. Those are some of the comments I get whenever I talk about one of terms above. I will hereby try to clarify what these terms relate to, and how they can be integrated with each other. Design thinking Design thinking is an iterative process in which we thrive to understand the user’s pain, challenge assumptions, redefine problems, in order to create new strategies and solutions. Opposed to “Brainstorming”, Design thinking promotes “Painstorming”, in order to fully understand the user’s pain. The usual Design thinking phases are the following: Empathize with your users Define your users’ needs, their problem, and your insights Ideate by challenging assumptions and creating ideas for in

How Pinterest runs Kafka at scale

Image
Pinterest runs one of the largest Kafka deployments in the cloud. We use Apache Kafka extensively as a message bus to transport data and to power real-time streaming services, ultimately helping more than 250 million Pinners around the world discover and do what they love. As mentioned in an earlier post, we use Kafka to transport data to our data warehouse, including critical events like impressions, clicks, close-ups, and repins. We also use Kafka to transport visibility metrics for our internal services. If the metrics-related Kafka clusters have any glitches, we can’t accurately monitor our services or generate alerts that signal issues. On the real-time streaming side, Kafka is used to power many streaming applications, such as fresh content indexing and recommendation, spam detection and filtering, real-time advertiser budget computation, and so on. We’ve shared out experiences at the Kafka Summit 2018 on incremental db ingestion using Kafka, and building real-time ads pl

Event Sourcing with AWS Lambda

Image
This article will explore one possible way to implement the event sourcing pattern in AWS using AWS services. Event sourcing is a pattern that involves saving every state change to your application allowing you to rebuild the application state from scratch via event playback. When to Use Event sourcing adds extra complexity to your application and is overkill for many use cases but can be invaluable in the right circumstances. The following are some instances of when you would want to implement the event sourcing pattern. When you need an audit log of every event in the system or micro service as opposed to only the current state of the system. When you need to be able to replay events after code bug fixes to fix data. When you need to be able to reverse events When you want to use the Event Log for debugging. When it’s valuable to expose the Event Log to support personnel to fix account specific issues by knowing exactly how the account got into a compromised state. Whe

What's the future of the pandas library?

Image
Pandas is a powerful, open source Python library for data analysis, manipulation, and visualization. I've been teaching data scientists to use pandas since 2014, and in the years since, it has grown in popularity to an estimated 5 to 10 million users and become a "must-use" tool in the Python data science toolkit. I started using pandas around version 0.14.0, and I've followed the library as it has significantly matured to its current version, 0.23.4. But numerous data scientists have asked me questions like these over the years: "Is pandas reliable?" "Will it keep working in the future?" "Is it buggy? They haven't even released version 1.0!" Version numbers can be used to signal the maturity of a product, and so I understand why someone might be hesitant to rely on "pre-1.0" software. But in the world of open source, version numbers don't necessarily tell you anything about the maturity or reliability

Deep Speech With Apache NiFi 1.8

Image
Tools:  Python 3.6, PyAudio, TensorFlow, Deep Speech, Shell, Apache NiFi Why : Speech-to-Text Use Case:  Voice control and recognition. Series : Holiday Use Case: Turn on Holiday Lights and Music on command. Cool Factor:  Ever want to run a query on Live Ingested Voice Commands? Other Options: Voice Controlled with AIY Voice and NiFi We are using Python 3.6 to write some code around PyAudio, TensorFlow, and Deep Speech to capture audio, store it in a wave file, and then process it with Deep Speech to extract some text. This example is running in OSX without a GPU on Tensorflow v1.11. The Mozilla Github repo for their Deep Speech implementation has nice getting-started information that I used to integrate our flow with Apache NiFi. Apache NiFi Flow   Read full article >>>  

What’s new in Hortonworks DataFlow 3.3?

With the upcoming HDP 3.1 release, we also bring about some exciting innovations to enhance our Kafka offering – New Hive Kafka Storage Handler (for SQL Analytics) – View Kafka topics as tables and execute SQL via Hive with full SQL Support for joins, windowing, aggregations, etc. New Druid Kafka Indexing Service (for OLAP Analytics) – View Kafka topics as cubes and perform OLAP style analytics on streaming events in Kafka using Druid. HDF 3.3 includes the following major innovations and enhancements: Core HDF Enhancements Support for Kafka 2.0, the latest Kafka release in the Apache community, with lots of enhancements into security, reliability and performance. Support for Kafka 2.0 NiFi processors NiFi Connection load balancing – This feature allows for bottleneck connections in the NiFi workflow to spread the queued-up flow files across the NiFi cluster and increase the processing speed and therefore lessen the effect of the bottleneck. MQTT performance improvements inc

KSQL: The Open Source SQL Streaming Engine for Apache Kafka

The rapidly expanding world of stream processing can be daunting, with new concepts such as various types of time semantics, windowed aggregates, changelogs, and programming frameworks to master. KSQL is an open-source, Apache 2.0 licensed streaming SQL engine on top of Apache Kafka which aims to simplify all this and make stream processing available to everyone. Even though it is simple to use, KSQL is built for mission-critical and scalable production deployments (using Kafka Streams under the hood). Benefits of using KSQL include: no coding required; no additional analytics cluster needed; streams and tables as first-class constructs; access to the rich Kafka ecosystem. This session introduces the concepts and architecture of KSQL. Use cases such as streaming ETL, real-time stream monitoring, and anomaly detection are discussed. A live demo shows how to set up and use KSQL quickly and easily on top of your Kafka ecosystem. Key takeaways: KSQL includes access to the rich Apa

AGILE-IoT: More Than Just Another IoT Project

The AGILE-IoT project ( www.agile-iot.eu ), co-funded by the Horizon 2020 programme of the European Union, aims to address this concern, by providing a solution based on four main pillars: Agnosticity : depending on the technical background of the platform user, the background of his organisation or the software components he has to (re-) use, we cannot predict what would be the programming language of the solution. It might be built with a combination of languages, some of them being compiled (e.g., C, C++), others translated into intermediate languages (e.g., Java, Python) and, finally, some others interpreted (e.g., JavaScript). If the user chooses one platform because of the programming language(s) it supports, he may limit his options for developing his solution. AGILE-IoT, by leveraging a micro-service-based architecture, supports all the programming languages a platform user might require to implement their solution. Openness : Lots of platforms are provided by

IEEE IoT - Nine IoT Predictions for 2019

Image
By 2020, the Internet of Things (IoT) is predicted to generate an additional $344B in revenues, as well as to drive $177B in cost reductions. IoT and smart devices are already increasing performance metrics of major US-based factories. They are in the hands of employees, covering routine management issues and boosting their productivity by 40-60% [1]. The following list of predictions (Figure 1) explores the state of IoT in 2019 and covering IoT impact on many aspects business and technology including Digital Transformation, Blockchain, AI, and 5G. Read full article >>>

Getting started with Apache Airflow

In this post, I am going to discuss Apache Airflow, a workflow management system developed by Airbnb. Earlier I had discussed writing basic ETL pipelines in Bonobo. Bonobo is cool for write ETL pipelines but the world is not all about writing ETL pipelines to automate things. There are other use cases in which you have to perform tasks in a certain order once or periodically. For instance: Monitoring Cron jobs transferring data from one place to other. Automating your DevOps operations. Periodically fetching data from websites and update the database for your awesome price comparison system. Data processing for recommendation based systems. Machine Learning Pipelines. Possibilities are endless. Before we move on further to implement Airflow in our systems, let’s discuss what actually is Airflow and it’s terminologies. What is Airflow? From the Website: Airflow is a platform to programmatically author, schedule and monitor workflows. Us

Understanding How Apache Pulsar Works

Image
I will be writing a series of blog posts about Apache Pulsar, including some Kafka vs Pulsar posts. First up though I will be running some chaos tests on a Pulsar cluster like I have done with RabbitMQ and Kafka to see what failure modes it has and its message loss scenarios. I will try to do this by either exploiting design defects, implementation bugs or poor configuration on the part of the admin or developer. In this post we’ll go through the Apache Pulsar design so that we can better design the failure scenarios. This post is not for people who want to understand how to use Apache Pulsar but who want to understand how it works. I have struggled to write a clear overview of its architecture in a way that is simple and easy to understand. I appreciate any feedback on this write-up. Claims The main claims that I am interested in are: guarantees of no message loss (if recommended configuration applied and your whole data center doesn't burn to the ground) strong