Posts

Showing posts from July, 2016

The road to a collaborative self-service model

Image
In a previous blog we discussed how you enable a highly collaborative and data driven organization through the concepts of multi-speed or bi-modal IT.  We then expanded on this through a discussion on the overall information and analytic lifecycle and the interaction with five persona across that lifecycle. You can read those blogs here:  Multi-speed IT drives fast business experiments and empowered citizen analysts Enabling a highly collaborative and data-driven organization Interestingly enough, Forrester Research recently published a report titled “The False Promise of Bimodal IT” which was referenced in an article on CIO.com . Forrester argues this paradigm is fundamentally a mistake as it creates a two class system with the implication that you have a slow moving entity focused on back office systems (IT) with a second group focused on fast roll out of digital products. From an organizational perspective the arguments being made are valid, but when I think of th

Benchmarking and Latency

The article by Tyler Treat (bravenewgeek.com) explaining why you should be very conscious of your monitoring and benchmarking tools and the data they report. HdrHistogram  is a tool which allows you to capture latency and retain high resolution. It also includes facilities for correcting coordinated omission and plotting latency distributions. The original version of HdrHistogram was written in Java, but there are versions for many other languages. Details: https://bravenewgeek.com/2015/12/

Top 5 Mistakes to Avoid When Writing Apache Spark Applications

Be careful in managing DAG People often do mistakes in DAG controlling. So in order to avoid such mistakes. We should do the following: Always try to use reducebykey instead of groupbykey :  The ReduceByKey and GroupByKey can perform almost similar functions, but GroupByKey contains large data. Hence, try to use ReduceByKey to the most. Make sure you stay away from shuffles as much as possible: Always try to lower the side of maps as much as possible Try not to waste more time in Partitioning Try not to shuffle more Try to keep away from Skews as well as partitions too Reduce should be lesser than TreeReduce:  Always use TreeReduce instead of Reduce, Because TreeReduce does much more work in comparison to the Reduce on the executors. Maintain the required size of the shuffle blocks In the shuffle operation, the task that emits the data in the source executor is “mapper”, the task that consumes the data into the target executor is “reducer”, and what happens between t

Apache Beam: A unified model for batch and stream processing data

Apache Beam , a new distributed processing tool that's currently being incubated at the ASF, provides an abstraction layer allowing developers to focus on Beam code, using the Beam programming model. Thanks to Apache Beam, an implementation is agnostic to the runtime technologies being used, meaning you can switch technologies quickly and easily. Apache Beam also offers a programming model that is agnostic in terms of coverage—meaning the programming model is unified, which allows developers to implement both batch and streaming data processing. It’s actually where the Apache Beam name comes from: B (for Batch) and EAM (for strEAM). To implement your data processes using the Beam programming model, you will use an SDK or DSL provided by Beam. Now, you really have only one SDK: The Java SDK. However, a Python SDK is expected to be released and Beam will provide a Scala SDK and additional DSL (Declarative DSL with XML for instance) soon. With Apache Beam, first

The role of Apache Atlas in the open metadata ecosystem

Image
Introducing Apache Atlas Apache Atlas emerged as an Apache incubator project in May 2015. It is scoped to provide an open source implementation for metadata management and governance. The initial focus was the Apache Hadoop environment although Apache Atlas has no dependencies on the Hadoop platform itself.  At its core, Apache Atlas has a graph database for storing metadata, a search capability based on Apache Lucene and a simple notification service based on Apache Kafka. There is a type definition language for describing the metadata stored in the graph and standard APIs for populating metadata, from business glossary terms, classification tags, data sources and lineage.   The start of an ecosystem What makes Apache Atlas different from other metadata solutions is that it is designed to ship with the platform where the data is stored. It is, in fact, a core component of the data platform. This means the different processes and engines that run on the platform