Posts

Showing posts with the label Best Practices

Data Reliability at Scale: How Fox Digital Architected its Modern Data Stack

Image
  As distributed architectures continue to become a new gold standard for data driven organizations, this kind of self-serve motion would be a dream come true for many data leaders. So when the Monte Carlo team got the chance to sit down with Alex, we took a deep dive into how he made it happen.  Here’s how his team architected a hybrid data architecture that prioritizes democratization and access, while ensuring reliability and trust at every turn. Exercise “Controlled Freedom” when dealing with stakeholders Alex has built decentralized access to data at Fox on a foundation he calls “controlled freedom.” In fact, he believes using your data team as the single source of truth within an organization actually creates the biggest silo.  So instead of becoming a guardian and bottleneck, Alex and his data team focus on setting certain parameters around how data is ingested and supplied to stakeholders. Within the framework, internal data consumers at Fox have the freedom to cr...
Image
 O ver the past few years, companies have been massively shifting their data and applications to the cloud that ended up raising a community of data users. They are encouraged to capture, gather, analyze, and save data for business insights and decision-making. More organizations are leading towards the use of multi-cloud, and the threat of losing data and securing has become challenging. Therefore, managing security policies, rules, metadata details, content traits is becoming critical for the multi-cloud. In this regard, the enterprises are in search of expertise and cloud tool vendors that are capable of providing the fundamental cloud security data governance competencies with excellence. Start with building policies and write them into code, or scripts that can be executed. This requires compliance and cloud security experts working together to build a framework for your complex business. You cannot start from scratch as it will be error-prone and will take too long. Try to in...

Turbocharging Analytics at Uber with Data Science Workbench

Image
Millions of Uber trips take place each day across nearly 80 countries, generating information on traffic, preferred routes, estimated times of arrival/delivery, drop-off locations, and more that enables us to facilitate better experiences for users. To make our data exploration and analysis more streamlined and efficient, we built Uber’s data science workbench (DSW), an all-in-one toolbox for interactive analytics and machine learning that leverages aggregate data. DSW centralizes everything a data scientist needs to perform data exploration, data preparation, ad-hoc analyses, model exploration, workflow scheduling, dashboarding, and collaboration in a single-pane, web-based graphical user interface (GUI). Leveraged by data science, engineering, and operations teams across the company, DSW has quickly scaled to become Uber’s go-to data analytics solution. Current DSW use cases include pricing, safety, fraud detection, and navigation, among other foundational elements of the trip experi...

Shopify's approach to data discovery

Image
Humans generate a lot of data. Every two days we create as much data as we did from the beginning of time until 2003! The International Data Corporation estimates the global datasphere totaled 33 zettabytes (one trillion gigabytes) in 2018. The estimate for 2025 is 175 ZBs, an increase of 430%. This growth is challenging organizations across all industries to rethink their data pipelines. The nature of data usage is problem driven, meaning data assets (tables, reports, dashboards, etc.) are aggregated from underlying data assets to help decision making about a particular business problem, feed a machine learning algorithm, or serve as an input to another data asset. This process is repeated multiple times, sometimes for the same problems, and results in a large number of data assets serving a wide variety of purposes. Data discovery and management is the practice of cataloguing these data assets and all of the applicable metadata that saves time for data professionals, increasing data ...

The Ins and Outs of Data Acquisition: Beliefs and Best Practices

Image
Data acquisition involves the set of activities that are required to qualify and obtain external data — and also data that may be available elsewhere in an organization — and then to arrange for it to be brought into or accessed by the company. This strategy is on the rise as organizations leverage this data to get access to prospects, learn information about customers they already work with, be more competitive, develop new products and more. Standardizing data acquisition can be an afterthought at the tail-end of a costly data journey — if it’s considered at all. Now we see companies starting to pay more attention to the finer, critical points of data acquisition as the need for more data grows. And this is a good thing because ignoring acquisition best practices and proper oversight will lead to a whole host of problems that can outweigh the benefits of bringing in new data. The costly, problematic issues we see organizations grapple with include: Data purchases that, once brought i...

What is the Microsoft's Team Data Science Process?

Image
The Team Data Science Process (TDSP) is an agile, iterative data science methodology to deliver predictive analytics solutions and intelligent applications efficiently. TDSP helps improve team collaboration and learning by suggesting how team roles work best together. TDSP includes best practices and structures from Microsoft and other industry leaders to help toward successful implementation of data science initiatives. The goal is to help companies fully realize the benefits of their analytics program. This article provides an overview of TDSP and its main components. We provide a generic description of the process here that can be implemented with different kinds of tools. A more detailed description of the project tasks and roles involved in the lifecycle of the process is provided in additional linked topics. Guidance on how to implement the TDSP using a specific set of Microsoft tools and infrastructure that we use to implement the TDSP in our teams is also provi...

Operating a Large, Distributed System in a Reliable Way

Image
"The article is the collection of the practices I've found useful to reliably operate a large system at Uber, while working here. My experience is not unique - people working on similar sized systems go through a similar journey. I've talked with engineers at Google, Facebook, and Netflix, who shared similar experiences and solutions. Many of the ideas and processes listed here should apply to systems of similar scale, regardless of running on own data centers (like Uber mostly does) or on the cloud (where Uber sometimes scales to). However, the practices might be an overkill for smaller or less mission-critical systems." There's much ground to cover: Monitoring Oncall, Anomaly Detection & Alerting Outages & Incident Management Processes Postmortems, Incident Reviews & a Culture of Ongoing Improvements Failover Drills, Capacity Planning & Blackbox Testing SLOs, SLAs & Reporting on Them SRE as an Independent Team Reliability as an...

Deep dive into how Uber uses Spark

Image
Apache Spark is a foundational piece of Uber’s Big Data infrastructure that powers many critical aspects of our business. We currently run more than one hundred thousand Spark applications per day, across multiple different compute environments. Spark’s versatility, which allows us to build applications and run them everywhere that we need, makes this scale possible. However, our ever-growing infrastructure means that these environments are constantly changing, making it increasingly difficult for both new and existing users to give their applications reliable access to data sources, compute resources, and supporting tools. Also, as the number of users grow, it becomes more challenging for the data team to communicate these environmental changes to users, and for us to understand exactly how Spark is being used. We built the Uber Spark Compute Service (uSCS) to help manage the complexities of running Spark at this scale. This Spark-as-a-service solution leverages Apache Livy, cu...

Understanding Apache Spark Failures and Bottlenecks

Image
Apache Spark is a powerful open-source distributed computing framework for scalable and efficient analysis of big data apps running on commodity compute clusters. Spark provides a framework for programming entire clusters with built-in data parallelism and fault tolerance while hiding the underlying complexities of using distributed systems. Spark has seen a massive spike in adoption by enterprises across a wide swath of verticals, applications, and use cases. Spark provides speed (up to 100x faster in-memory execution than Hadoop MapReduce) and easy access to all Spark components (write apps in R, Python, Scala, and Java) via unified high-level APIs. Spark also handles a wide range of workloads (ETL, BI, analytics, ML, graph processing, etc.) and performs interactive SQL queries, batch processing, streaming data analytics, and data pipelines. Spark is also replacing MapReduce as the processing engine component of Hadoop. Spark applications are easy to write and easy to understa...

Microsoft best practices of software engineering for machine learning

Image
This paper  explains best practices that Microsoft teams discovered and compiled in creating large-scale AI solutions for the marketplace.

How Facebook Scales Machine Learning

Image
The software and hardware considerations they made to successfully scale AI/ML infrastructure per an excellent talk giv en by  Yangqing Jia , Facebook’s Director of AI Infrastructure, at the Scaled Machine Learning Conference. Watch Video >>> Full Article >>>

Facebook Marketplace powered by artificial intelligence

Image
Facebook Marketplace was introduced in 2016 as a place for people to buy and sell items within their local communities. Today in the U.S., more than one in three people on Facebook use Marketplace, buying and selling products in categories ranging from cars to shoes to dining tables. Managing the posting and selling of that volume of products with speed and relevance is a daunting task, and the fastest, most scalable way to handle that is to incorporate custom AI solutions. On Marketplace’s second anniversary, we are sharing how we use AI to power it. Whether someone is discovering an item to buy, listing a product to sell, or communicating with a buyer or seller, AI is behind the scenes making the experience better. In addition to the product index and content retrieval systems, which leverage our AI-based computer vision and natural language processing (NLP) platforms, we recently launched some new features that make the process simpler for both buyers and sellers. Multimodal...

27 Great Resources About Logistic Regression

27 Great Resources About Logistic Regression: Customer Churn – Logistic Regression with R Predicting Flights Delay Using Supervised Learning, Logistic Regression Logistic Regression vs Decision Trees vs SVM: Part II Logistic Regression Vs Decision Trees Vs SVM: Part I Making data science accessible – Logistic Regression Logistic Regression using python Logistic Regression and Maximum Entropy explained with examples Decision tree vs Logistic Regression Excluding variables from a logistic regression model based on correlation Regression, Logistic Regression and Maximum Entropy  + Oversampling/Undersampling in Logistic Regression Fraud Detection using logistic regression Explaining variability in logistic regression Handling Imbalanced data when building regression models Multiple logistic Regression Power Analysis Model Accuracy - In logistic Regression Outliers in Logistic Regression Logistic Regression - Hosmer Lemeshow test Logistic regression intercept term n...

Analytics maturity powers company performance

Image
The fact that we believe analytics drive performance isn’t enough. This report by David Alles (International Institute for Analytics) provides a range of supporting evidence, using IIA’s proprietary analytics maturity data – from 74 leading companies like Amazon, Apple, Netflix and Google – and publicly available financial and company data, to illustrate the positive association between analytics maturity and superior company performance. Details: http://bit.ly/2tba3u2

Gartner - 2017 Market Guide for Asset Performance Management

Image
CIOs in utilities and other asset-intensive organizations can use this research to support the development of enterprise APM strategies. APM is a key element of the foundational technology that can help their organizations achieve higher levels of operational reliability, safety and efficiency. Key Findings Asset performance management (APM) solutions are widening in scope and decreasing in deployment cost due to market acceptance, increasing competition and maturation of enabling technologies such as advanced analytics, algorithms, cloud and the Internet of Things (IoT). As APM solutions mature and cloud deployment increases, asset management will become a more collaborative process. Activities will be shared among asset owners, operators, service providers and OEMs. Asset management strategies are beginning to shift from preventive to predictive — driven by innovation in enabling technologies and streamlined access to consistent operational technology (OT) data resulting from I...

Enterprise data integration with an operational data hub

Big data (also called NoSQL) technologies facilitate the ingestion, processing, and search of data with no regard to schema (database structure). Web technologies such as Google, LinkedIn, and Facebook use big data technologies to process the tremendous amount of data from every possible source without regard to structure, and offer a searchable interface to access it. Modern NoSQL technologies have evolved to offer capabilities to govern, process, secure, and deliver data, and have facilitated the development of an integration pattern called the operational data hub (ODH). The Centers for Medicare and Medicaid Services (CMS) and other organizations (public and private) in the health, finance, banking, entertainment, insurance, and defense sectors (amongst others) utilize the capabilities of ODH technologies for enterprise data integration. This gives them the ability to access, integrate, master, process, and deliver data across the enterprise. Traditional mode...

Gartner - Analytics Center of Excellence Capabilities

Image
The analytics center of excellence has a new mandate for making the entire organization proficient in generating and leveraging automated insights. Data and analytics leaders should include a broad spectrum of organizational, project, data, educational and technological capabilities in their ACE. Key Challenges Data and analytics leaders often struggle to define, establish and communicate the range of analytics capabilities their teams can offer the organization. IT leaders such as CIOs regularly contend that they want to "get out of the report writing business" and expand the notion of analytics from BI application development to enable the entire organization to benefit from data and analytics. Tactical business intelligence competency centers (BICCs), having formed within and emerged from IT organizations, are too limited in scope and technology-focused to provide broad-spectrum analytic enablement. Leading enterprises in most industries generally are more successf...

Top 5 Mistakes to Avoid When Writing Apache Spark Applications

Be careful in managing DAG People often do mistakes in DAG controlling. So in order to avoid such mistakes. We should do the following: Always try to use reducebykey instead of groupbykey :  The ReduceByKey and GroupByKey can perform almost similar functions, but GroupByKey contains large data. Hence, try to use ReduceByKey to the most. Make sure you stay away from shuffles as much as possible: Always try to lower the side of maps as much as possible Try not to waste more time in Partitioning Try not to shuffle more Try to keep away from Skews as well as partitions too Reduce should be lesser than TreeReduce:  Always use TreeReduce instead of Reduce, Because TreeReduce does much more work in comparison to the Reduce on the executors. Maintain the required size of the shuffle blocks In the shuffle operation, the task that emits the data in the source executor is “mapper”, the task that consumes the data into the target executor is “reducer”, an...

Adopting Self-Service BI with Tableau - Notes from the field

Image
(originally this article was created and posted by me on March 7, 2016 at datasciencecentral.com, now I am transferring it here) I have spent many hours planning and executing in-company self-service BI implementation. This enabled me to gain several insights. Now that the ideas became mature enough and field-proven, I believe they are worth sharing. No matter how far you are in toying with potential approaches (possibly you are already in the thick of it!), I hope my attempt of describing feasible scenarios would provide a decent foundation.   All scenarios presume that IT plays its main role by owning the infrastructure, managing scalability, data security, and governance. Scenario 1. Tableau Desktop + departmental/cross-functional data schemas. This scenario involves gaining insights by data analysts on a daily basis. They might be either independent individuals or a team. Business users’ interaction with published workbooks is applicable, but limited to simple filterin...