Posts

Showing posts from June, 2019

The big interview with Martin Kleppmann: “Figuring out the future of distributed data systems”

Image
Dr. Martin Kleppmann is a researcher in distributed systems at the University of Cambridge, and the author of the highly acclaimed “Designing Data-Intensive Applications” (O’Reilly Media, 2017). Kevin Scott, CTO at Microsoft once said: “This book should be required reading for software engineers. Designing Data-Intensive Applications is a rare resource that connects theory and practice to help developers make smart decisions as they design and implement data infrastructure and systems.” Martin’s main research interests include collaboration software, CRDTs, and formal verification of distributed algorithms. Previously he was a software engineer and an entrepreneur at several Internet companies including LinkedIn and Rapportive, where he worked on large-scale data infrastructure. Vadim Tsesko (@incubos) is a lead software engineer at Odnoklassniki who works in Core Platform team. Vadim’s scientific and engineering interests include distributed systems, data warehouses and verification o

Model governance and model operations: building and deploying robust, production-ready machine learning models

Image
O'Reilly's surveys over the past couple of years have shown growing interest in machine learning (ML) among organizations from diverse industries. A few factors are contributing to this strong interest in implementing ML in products and services. First, the machine learning community has conducted groundbreaking research in many areas of interest to companies, and much of this research has been conducted out in the open via preprints and conference presentations. We are also beginning to see researchers share sample code written in popular open source libraries, and some even share pre-trained models. Organizations now also have more use cases and case studies from which to draw inspiration—no matter what industry or domain you are interested in, chances are there are many interesting ML applications you can learn from. Finally, modeling tools are improving, and automation is beginning to allow new users to tackle problems that used to be the province of experts. With the s
Image
Here’s a curated list of resources for data engineers, with sections for algorithms and data structures, SQL, databases, programming, tools, distributed systems, and more. Useful articles The AI Hierarchy of Needs The Rise of Data Engineer The Downfall of the Data Engineer A Beginner’s Guide to Data Engineering Part I Part II Part III Functional Data Engineering — a modern paradigm for batch data processing How to become a Data Engineer (in Russian) Talks Data Engineering Principles - Build frameworks not pipelines by Gatis Seja Functional Data Engineering - A Set of Best Practices by Maxime Beauchemin Advanced Data Engineering Patterns with Apache Airflow by Maxime Beauchemin Creating a Data Engineering Culture by Jesse Anderson Algorithms & Data Structures Algorithmic Toolbox in Russian Data Structures in Russian Data Structures & Algorithms Specialization on Coursera Algorithms Specialization from Stanford on Coursera SQL Com

Test data quality at scale with AWS Deequ

Image
You generally write unit tests for your code, but do you also test your data? Incorrect or malformed data can have a large impact on production systems. Examples of data quality issues are: Missing values can lead to failures in production system that require non-null values (NullPointerException). Changes in the distribution of data can lead to unexpected outputs of machine learning models. Aggregations of incorrect data can lead to wrong business decisions. In this blog post, we introduce Deequ, an open source tool developed and used at Amazon. Deequ allows you to calculate data quality metrics on your dataset, define and verify data quality constraints, and be informed about changes in the data distribution. Instead of implementing checks and verification algorithms on your own, you can focus on describing how your data should look. Deequ supports you by suggesting checks for you. Deequ is implemented on top of Apache Spark and is designed to scale with large datasets (th

DataOps Principles: How Startups Do Data The Right Way

Image
If you have been trying to harness the power of data science and machine learning — but, like many teams, struggling to produce results — there’s a secret you are missing out on. All of those models and sophisticated insights require lots of good data, and the best way to get good data quickly is by using DataOps. What is DataOps? It’s a way of thinking about how an organization deals with data. It’s a set of tools to automate processes and empower individuals. And it’s a new DataOps Engineer role designed to make that thinking real by managing and building those tools. DataOps Principles DataOps was inspired by DevOps, which brought the power of agile development to operations (infrastructure management and production deployment).  DevOps transformed the way that software development is done; and now DataOps is transforming the way that data management is done. For larger enterprises with a dedicated data engineering team, DataOps is about breaking down barriers and re-alig

Move Beyond a Monolithic Data Lake to a Distributed Data Mesh (Martin Fowler)

Image
Many enterprises are investing in their next generation data lake, with the hope of democratizing data at scale to provide business insights and ultimately make automated intelligent decisions. Data platforms based on the data lake architecture have common failure modes that lead to unfulfilled promises at scale. To address these failure modes we need to shift from the centralized paradigm of a lake, or its predecessor data warehouse. We need to shift to a paradigm that draws from modern distributed architecture: considering domains as the first class concern, applying platform thinking to create self-serve data infrastructure, and treating data as a product. Becoming a data-driven organization remains one of the top strategic goals of many companies I work with. My clients are well aware of the benefits of becoming intelligently empowered: providing the best customer experience based on data and hyper-personalization; reducing operational costs and time through data-driven optimi