Posts

Showing posts from May, 2019

Microsoft best practices of software engineering for machine learning

Image
This paper  explains best practices that Microsoft teams discovered and compiled in creating large-scale AI solutions for the marketplace.

Decoding ‘Game of Thrones’ by way of data science

Image
With the final season of the television series ‘Game of Thrones’ upon us it is a good opportunity to take a closer look at the books that the series is based on. We will discover how a numerical processing of the books can help us reveal patterns that lie hidden in ‘A Song of Ice and Fire’. How does one begin to objectively measure a book? Isn’t it all about the subjective experience in the mind of the reader? Indeed, there are many ways of how literary critics have tried to capture and communicate the essence and measure of value of a book. A book, along with other forms of art, is often valued to the extent which it can give us new and nuanced insights into our own human experience. A fantasy novel series such as ‘A Song of Ice and Fire’ sets the story in a more boundless landscape allowing even more freedom to explore the hopes and fears that lies within us all. However, this article is not a literary critics review, but rather a data science exploration. This numerical explor

What’s Behind Lyft’s Choices in Big Data Tech

Image
Lyft was a late entrant to the ride-sharing business model, at least compared to its competitor Uber, which pioneered the concept and remains the largest provider. That delay in starting out actually gave Lyft a bit of an advantage in terms of architecting its big data infrastructure in the cloud, as it was able to sidestep some of the challenges that Uber faced in building out its on-prem system. Lyft and Uber, like many of the young Silicon Valley companies shaking up established business models, aren’t shy about sharing information about their computer infrastructure. They both share an ethos of openness in regards to using and developing technology. That openness is also pervasive at Google, Facebook, Twitter, and other Valley outfits that created much of the big data ecosystem, most of which is, of course, open source. So when the folks at Lyft were blueprinting how to construct a system that could do all the things that a ride-sharing app has to do – tracking and connectin

Python at Netflix

Image
As many of us prepare to go to PyCon, we wanted to share a sampling of how Python is used at Netflix. We use Python through the full content lifecycle, from deciding which content to fund all the way to operating the CDN that serves the final video to 148 million members. We use and contribute to many open-source Python packages, some of which are mentioned below. If any of this interests you, check out the jobs site or find us at PyCon. We have donated a few Netflix Originals posters to the PyLadies Auction and look forward to seeing you all there. Open Connect Open Connect is Netflix’s content delivery network (CDN). An easy, though imprecise, way of thinking about Netflix infrastructure is that everything that happens before you press Play on your remote control (e.g., are you logged in? what plan do you have? what have you watched so we can recommend new titles to you? what do you want to watch?) takes place in Amazon Web Services (AWS), whereas everything that happens after

How companies adopt and apply cloud native infrastructure

Image
Survey results reveal the path organizations face as they integrate cloud native infrastructure and harness the full power of the cloud. Driven by the need for agility, scaling, and resiliency, organizations have spent more than a decade moving from “trying out the cloud” to a deeper, more sustained commitment to the cloud, including adopting cloud native infrastructure. This shift is an important part of a trend we call the Next Architecture, with organizations embracing the combination of cloud, containers, orchestration, and microservices to meet customer expectations for availability, features, and performance. To learn more about the motivations and challenges companies face adopting cloud native infrastructure, we conducted a survey of 590 practitioners, managers, and CxOs from across the globe.[1] Key findings from the survey include: Nearly 50% of respondents cited lack of skills as the top challenge their organizations face in adopting cloud native infrastructure.

Scalable Log Analytics with Apache Spark: A Comprehensive Case-Study

Image
Introduction One of the most popular and effective enterprise case-studies which leverage analytics today is log analytics. Almost every small and big organization today have multiple systems and infrastructure running day in and day out. To effectively keep their business running, organizations need to know if their infrastructure is performing to its maximum potential. This involves analyzing system and application logs and maybe even apply predictive analytics on log data. The amount of log data is typically massive, depending on the type of organizational infrastructure and applications running on it. Gone are the days when we were limited by just trying to analyze a sample of data on a single machine due to compute constraints. Powered by big data, better and distributed computing, big data processing and open-source analytics frameworks like Spark, we can perform scalable log analytics on potentially millions and billions of log messages daily. The i

Amundsen — Lyft’s data discovery & metadata engine

Image
The problem Unprecedented growth in Data volumes has led to 2 big challenges: Productivity — Whether it’s building a new model, instrumenting a new metric, or doing adhoc analysis, how can I most productively and effectively make use of this data?   Compliance — When collecting data about a company’s users, how do organizations comply with increasing regulatory and compliance demands and uphold the trust of their users? The key to solving these problems lies not in data, but in the metadata. And, to show you how, let’s go through a journey of how we solved a part of the productivity problem at Lyft using metadata. Productivity At a 50,000 feet level, the data scientist workflow looks like the following.  Read full article >>>

Blockchain Requires Industry Collaboration: The Launch of INATBA

Image
When the web was developed over 25 years ago, the technologies in place significantly lowered the cost of building a global company. Thanks to the internet, it has become possible to reach a large part of the global population simply from behind your computer. Those companies who first understood the power of the web, and managed to execute their vision correctly, are now the leading global monopolies we are so familiar with: we use Google for finding information, Facebook or WeChat for social activities, Amazon to shop and Apple for our hardware, etc. But times are changing since Satoshi Nakamoto distributed a paper among a small group of cryptography enthusiasts. Fast forward 11 years, and the underlying technology of the proposed bitcoin is rapidly changing how we run our organisations. Blockchain is a fundamental technology that changes how we perform transactions, how we collaborate and how we build our organisations. Knowing what blockchain is and how it can contribute to impr

How to Use Data Preparation to Accelerate Cloud Data Lake Adoption

This TDWI Checklist offers six steps for data preparation processes and solutions that can help accelerate cloud data lake adoption.  

Tuning Snowflake Performance Using the Query Cache

Image
In terms of performance tuning in Snowflake, there are very few options available. However, it is worth understanding how the Snowflake architecture includes various levels of caching to help speed your queries. This article provides an overview of the techniques used, and some best practice tips on how to maximise system performance using caching. Snowflake Database Architecture Before starting it’s worth considering the underlying Snowflake architecture, and explaining when Snowflake caches data. The diagram below illustrates the overall architecture which consists of three layers:- Service Layer:   Which accepts SQL requests from users, coordinates queries, managing transactions and results.  Logically, this can be assumed to hold the  result cache  – a cached copy of the results of every query executed. Compute Layer:   Which actually does the heavy lifting.  This is where the actual SQL is executed across t

Awesome ML interpretability resources

Image
A curated list of awesome machine learning interpretability resources .  Comprehensive Software Examples and Tutorials Explainability- or Fairness-Enhancing Software Packages Browser Python R Free Books Other Interpretability and Fairness Resources and Lists Review and General Papers Limitations of Interpretability Teaching Resources Interpretable ("Whitebox") or Fair Modeling Packages C/C++ Python R