Posts

What is the Microsoft's Team Data Science Process?

Image
The Team Data Science Process (TDSP) is an agile, iterative data science methodology to deliver predictive analytics solutions and intelligent applications efficiently. TDSP helps improve team collaboration and learning by suggesting how team roles work best together. TDSP includes best practices and structures from Microsoft and other industry leaders to help toward successful implementation of data science initiatives. The goal is to help companies fully realize the benefits of their analytics program. This article provides an overview of TDSP and its main components. We provide a generic description of the process here that can be implemented with different kinds of tools. A more detailed description of the project tasks and roles involved in the lifecycle of the process is provided in additional linked topics. Guidance on how to implement the TDSP using a specific set of Microsoft tools and infrastructure that we use to implement the TDSP in our teams is also provi...

Announcing the dbt IDE: orchestrate the entire analytics engineering workflow in your browser

Image
Today we released the dbt Integrated Developer Environment (IDE) into general availability in dbt Cloud. With the IDE, you can build, run, test, and version control dbt projects from your browser. There’s no wrestling with pip, homebrew, hidden files in your home directory, or coordinating upgrades across large teams. If you haven’t already, be sure to check out Tristan’s post on why we built the IDE and why we think it’s such a meaningful development in the analytics engineering space. Otherwise, read on to learn more about what you can do with the IDE and what’s next for dbt Cloud. Read full post >>>

2019 Was the Year Data Visualization Hit the Mainstream

Image
There’s always something going on in the field of data visualization but until recently it was only something that people in the field noticed. To the outside world, beyond perhaps an occasional Amazing Map®, Tufte workshop or funny pie chart, these trends are invisible. Not so in 2019, where data visualization featured prominently in major news stories and key players in the field created work that didn’t just do well on Dataviz Twitter but all over. 2019 saw the United States President amend a data visualization product with a sharpie. That should have been enough to make 2019 special, but the year also saw the introduction of a data visualization-focused fashion line, a touching book that uses data visualization to express some of the anxieties and feelings we all struggle with, as well as the creation of the first holistic professional society focused on data visualization. Original Article >>>

Why Not Airflow? An overview of the Prefect engine for Airflow users

Image
Airflow is a historically important tool in the data engineering ecosystem. It introduced the ability to combine a strict Directed Acyclic Graph (DAG) model with Pythonic flexibility in a way that made it appropriate for a wide variety of use cases. However, Airflow’s applicability is limited by its legacy as a monolithic batch scheduler aimed at data engineers principally concerned with orchestrating third-party systems employed by others in their organizations. Today, many data engineers are working more directly with their analytical counterparts. Compute and storage are cheap, so friction is low and experimentation prevails. Processes are fast, dynamic, and unpredictable. Airflow got many things right, but its core assumptions never anticipated the rich variety of data applications that has emerged. It simply does not have the requisite vocabulary to describe many of those activities. The seed that would grow into Prefect was first planted all the way back in 2016, in a seri...

Researchers love PyTorch and TensorFlow

Image
In a recent survey—AI Adoption in the Enterprise, which drew more than 1,300 respondents—we found significant usage of several machine learning (ML) libraries and frameworks. About half indicated they used TensorFlow or scikit-learn, and a third reported they were using PyTorch or Keras. I recently attended an interesting RISELab presentation delivered by Caroline Lemieux describing recent work on AutoPandas and automation tools that rely on program synthesis. In the course of her presentation, Lemieux reviewed usage statistics they had gathered on different deep learning frameworks and data science libraries. She kindly shared some of that data with me, which I used to draw this chart: The numbers are based on simple full-text searches of papers posted on the popular e-print service arXiv.org. Specifically, they reflect the number of papers which mention (in a full-text search) each of the frameworks. Using this metric, the two most popular deep learning frameworks among resear...

Operating a Large, Distributed System in a Reliable Way

Image
"The article is the collection of the practices I've found useful to reliably operate a large system at Uber, while working here. My experience is not unique - people working on similar sized systems go through a similar journey. I've talked with engineers at Google, Facebook, and Netflix, who shared similar experiences and solutions. Many of the ideas and processes listed here should apply to systems of similar scale, regardless of running on own data centers (like Uber mostly does) or on the cloud (where Uber sometimes scales to). However, the practices might be an overkill for smaller or less mission-critical systems." There's much ground to cover: Monitoring Oncall, Anomaly Detection & Alerting Outages & Incident Management Processes Postmortems, Incident Reviews & a Culture of Ongoing Improvements Failover Drills, Capacity Planning & Blackbox Testing SLOs, SLAs & Reporting on Them SRE as an Independent Team Reliability as an...

The Apache Spark 3.0 Preview is here!

Image
Preview release of Spark 3.0 To enable wide-scale community testing of the upcoming Spark 3.0 release, the Apache Spark community has posted a preview release of Spark 3.0 . This preview is not a stable release in terms of either API or functionality, but it is meant to give the community early access to try the code that will become Spark 3.0. If you would like to test the release, please download it, and send feedback using either the mailing lists or JIRA. The Spark issue tracker already contains a list of features in 3.0.

Azure SQL Data Warehouse is now Azure Synapse Analytics

Image
On November fourth, we announced Azure Synapse Analytics, the next evolution of Azure SQL Data Warehouse. Azure Synapse is a limitless analytics service that brings together enterprise data warehousing and Big Data analytics. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources—at scale. Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate business intelligence and machine learning needs. With Azure Synapse, data professionals can query both relational and non-relational data using the familiar SQL language. This can be done using either serverless on-demand queries for data exploration and ad hoc analysis or provisioned resources for your most demanding data warehousing needs. A single service for any workload. In fact, it’s the first and only analytics system to have run all the TPC-H queries at petabyte-scale. For current SQL Data Warehouse custome...

Neo4j Aura: A New Graph Database as a Service

Image
Aura is an entirely new, built-from-the-ground-up, multi-tenant graph DBaaS based on Neo4j. It lets any developer take advantage of the best graph database in the world via a frictionless service in the cloud.  When we began building Neo4j all those years ago, we wanted to give developers a database that was very powerful, flexible… and accessible to all. We believed that open source was the best way to bring this product to developers worldwide. Since then, the vast majority of our paying customers have started out with data practitioners leading the way. Individual developers downloaded Neo4j, experimented with it, and realized graphs were an ideal way to model and traverse connected data. However, only a few of those developers had direct access to a budget to make the leap to our Enterprise Edition. Neo4j Aura bridges that gap for individuals, small teams and established startups. I believe this is the next logical step in Neo4j’s vision to help the world make sense of data....

CSA - Cloud Controls Matrix v3

Image
The CCM, the only meta-framework of cloud-specific security controls, mapped to leading standards, best practices and regulations. CCM provides organizations with the needed structure, detail and clarity relating to information security tailored to cloud computing. CCM is currently considered a de-facto standard for cloud security assurance and compliance. Download >>>