Posts

Showing posts from March, 2019

Building and Scaling Data Lineage at Netflix

Image
Netflix Data Landscape Freedom & Responsibility (F&R) is the lynchpin of Netflix’s culture empowering teams to move fast to deliver on innovation and operate with freedom to satisfy their mission. Central engineering teams provide paved paths (secure, vetted and supported options) and guard rails to help reduce variance in choices available for tools and technologies to support the development of scalable technical architectures. Nonetheless, Netflix data landscape (see below) is complex and many teams collaborate effectively for sharing the responsibility of our data system management. Therefore, building a complete and accurate data lineage system to map out all the data-artifacts (including in-motion and at-rest data repositories, Kafka topics, apps, reports and dashboards, interactive and ad-hoc analysis queries, ML and experimentation models) is a monumental task and requires a scalable architecture, robust design, a strong engineering team and above all, amazing cross-f

How Facebook Scales Machine Learning

Image
The software and hardware considerations they made to successfully scale AI/ML infrastructure per an excellent talk giv en by  Yangqing Jia , Facebook’s Director of AI Infrastructure, at the Scaled Machine Learning Conference. Watch Video >>> Full Article >>>

Serverless Computing: One Step Forward, Two Steps Back

Image
Serverless computing offers the potential to program the cloud in an autoscaling, pay-as-you go manner. In this paper we address critical gaps in first-generation serverless computing, which place its autoscaling potential at odds with dominant trends in modern computing: notably data-centric and distributed computing, but also open source and custom hardware. Put together, these gaps make current serverless offerings a bad fit for cloud innovation and particularly bad for data systems innovation. In addition to pinpointing some of the main shortfalls of current serverless architectures, we raise a set of challenges we believe must be met to unlock the radical potential that the cloud---with its exabytes of storage and millions of cores---should offer to innovative developers. Read full article >>>

Data scientist salaries and jobs in Europe - 2018 snapshot

Image
Glassdoor names “Data Scientist” as the best job in the United States for 2019 and LinkedIn ranks it number one among the top 10. Topping the list for four years in a row,  Data Scientist has a job score of 4.7, job satisfaction rating of 4.3 with 6,510 open positions paying a median base salary of $108,000 in the U.S. But what is the scenario for Data Scientists in Europe? What is the demand and supply? Which countries in EU are the best destinations for Data Scientists and what salaries can they expect? A recent report titled Data Science Salary Report 2019 Europe by Big Cloud  answers some of these critical questions.   First, a little flashback: According to a report by the European Commission in 2017, the number of data workers in Europe will increase up to 10.43 million, with a compound average growth rate of 14.1% by 2020. The EU forecasted to face a data skills gap corresponding to 769,000 unfilled positions by 2020 in the baseline scenario and being concentrated in particula

Effective Spark DataFrames With Alluxio

Image
Many organizations deploy Alluxio together with Spark for performance gains and data manageability benefits. Qunar recently deployed Alluxio in production, and their Spark streaming jobs sped up by 15x on average and up to 300x during peak times. They noticed that some Spark jobs would slow down or would not finish, but with Alluxio, those jobs could finish quickly. In this blog post, we investigate how Alluxio helps Spark be more effective. Alluxio increases performance of Spark jobs, helps Spark jobs perform more predictably, and enables multiple Spark jobs to share the same data from memory. Previously, we investigated how Alluxio is used for Spark RDDs. In this article, we investigate how to effectively use Spark DataFrames with Alluxio. Alluxio and Spark Cache Storing Spark DataFrames in Alluxio memory is very simple, and only requires saving the DataFrame as a file to Alluxio. This is very simple with the Spark DataFrame write API. DataFrames are commonly written as parquet fi

Introducing Ludwig, a Code-Free Deep Learning Toolbox

Image
Over the last decade, deep learning models have proven highly effective at performing a wide variety of machine learning tasks in vision, speech, and language. At Uber we are using these models for a variety of tasks, including customer support, object detection, improving maps, streamlining chat communications, forecasting, and preventing fraud. Many open source libraries, including TensorFlow, PyTorch, CNTK, MXNET, and Chainer, among others, have implemented the building blocks needed to build such models, allowing for faster and less error-prone development. This, in turn, has propelled the adoption of such models both by the machine learning research community and by industry practitioners, resulting in fast progress in both architecture design and industrial solutions. At Uber AI, we decided to avoid reinventing the wheel and to develop packages built on top of the strong foundations open source libraries provide. To this end, in 2017 we released Pyro, a deep probabilistic program

Gartner - 2019 Magic Quadrant for Data Management Solutions for Analytics

Image
Gartner defines a data management solution for analytics (DMSA) as a complete software system that supports and manages data in one or many file management systems, most commonly a database or multiple databases. These management systems include specific optimization strategies designed for supporting analytical processing — including, but not limited to, relational processing, nonrelational processing (such as graph processing), and machine learning or programming languages such as Python or R. Data is not necessarily stored in a relational structure, and can use multiple data models — relational, XML, JavaScript Object Notation (JSON), key-value, graph, geospatial and others. Our definition also states that: A DMSA is a system for storing, accessing, processing and delivering data intended for one or more of the four primary use cases Gartner identifies that support analytics (see Note 1). A DMSA is not a specific class or type of technology; it is a use case. A DMSA ma

Cloud Migration Best Practices: How to Move Your Project to Kubernetes

Image
Moving your app or web services to the cloud is more of a must than an option these days. The cloud infrastructure we have today is not only more capable and more stable, but also more scalable. By moving to the cloud, you gain a lot of benefits while also significantly reducing operational stress and costs. A more popular option now is to move to a container-based cloud environment, with Kubernetes being the most popular method to do so — and the most scalable in the long run. There are three approaches you can take and certain best practices to follow, which we are going to look into in this article. Why Kubernetes? Before we get to the best practices and methods you can use to migrate to Kubernetes, it is worth taking the time to understand why the container-based environment provided by Kubernetes is the way to go. For starters, Kubernetes offers the most flexibility when it comes to setting up your cloud environment. Kubernetes has two major parts: the master cluster and no