Posts

Showing posts from May, 2016

Why Apache Beam? A Google Perspective

When we made the decision (in partnership with data Artisans, Cloudera, Talend, and a few other companies) to move the Google Cloud Dataflow SDK and runners into the Apache Beam incubator project, we did so with the following goal in mind: provide the world with an easy-to-use, but powerful model for data-parallel processing, both streaming and batch, portable across a variety of runtime platforms. Now that the dust on the initial code drops is starting to settle, we wanted to talk briefly about why this makes sense for us at Google and how we got here, given that Google hasn’t historically been directly involved in the OSS world of data-processing. Why does this make sense for Google? Google is a business, and as such, it should come as no surprise there’s a business motivation for us behind the Apache Beam move. That motivation hinges primarily on the desire to get as many Apache Beam pipelines as possible running on Cloud Dataflow. Given that, it may not seem intu

Genome Analysis Toolkit and Apache Spark

Image
Users of the latest release of the Genome Analysis Toolkit, an open source framework for analyzing high-throughput DNA sequencing data, can now choose Apache Spark for data processing. Ever since the Human Genome Project produced the first draft sequence of the human genome in 2000, the cost of sequencing has dropped exponentially, from around US$100 million per genome then to around US$1,000 today. Over the same period, we have seen massive growth in the storage and processing capabilities of big data technologies like Apache Hadoop. It’s very fitting, then, to use tools from the Hadoop ecosystem for genomics, which is why Cloudera, in cooperation with the Broad Institute and other industry partners, is pleased to announce the alpha release of the Genome Analysis Toolkit (GATK) version 4 running on Apache Spark. Details: http://blog.cloudera.com/blog/2016/04/genome-analysis-toolkit-now-using-apache-spark-for-data-processing/