Notes about Cutting-Edge Technologies and Everything

Posts

Showing posts with the label Data Engineering

Data Reliability at Scale: How Fox Digital Architected its Modern Data Stack

- March 04, 2022

As distributed architectures continue to become a new gold standard for data driven organizations, this kind of self-serve motion would be a dream come true for many data leaders. So when the Monte Carlo team got the chance to sit down with Alex, we took a deep dive into how he made it happen. Here’s how his team architected a hybrid data architecture that prioritizes democratization and access, while ensuring reliability and trust at every turn. Exercise “Controlled Freedom” when dealing with stakeholders Alex has built decentralized access to data at Fox on a foundation he calls “controlled freedom.” In fact, he believes using your data team as the single source of truth within an organization actually creates the biggest silo. So instead of becoming a guardian and bottleneck, Alex and his data team focus on setting certain parameters around how data is ingested and supplied to stakeholders. Within the framework, internal data consumers at Fox have the freedom to cr...

Emerging Architectures for Modern Data Infrastructure

- October 26, 2021

As an industry, we’ve gotten exceptionally good at building large, complex software systems. We’re now starting to see the rise of massive, complex systems built around data – where the primary business value of the system comes from the analysis of data, rather than the software directly. We’re seeing quick-moving impacts of this trend across the industry, including the emergence of new roles, shifts in customer spending, and the emergence of new startups providing infrastructure and tooling around data. In fact, many of today’s fastest growing infrastructure startups build products to manage data. These systems enable data-driven decision making (analytic systems) and drive data-powered products, including with machine learning (operational systems). They range from the pipes that carry data, to storage solutions that house data, to SQL engines that analyze data, to dashboards that make data easy to understand – from data science and machine learning libraries, to automated data pipe...

Announcing Databricks Serverless SQL

- August 31, 2021

Databricks SQL already provides a first-class user experience for BI and SQL directly on the data lake, and today, we are excited to announce another step in making data and AI simple with Databricks Serverless SQL. This new capability for Databricks SQL provides instant compute to users for their BI and SQL workloads, with minimal management required and capacity optimizations that can lower overall cost by an average of 40%. This makes it even easier for organizations to expand adoption of the lakehouse for business analysts who are looking to access the rich, real-time datasets of the lakehouse with a simple and performant solution. Under the hood of this capability is an active server fleet, fully managed by Databricks, that can transfer compute capacity to user queries, typically in about 15 seconds. The best part? You only pay for Serverless SQL when users start running reports or queries. Organizations with business analysts who want to analyze data in the data lake with t...

2021 Gartner Magic Quadrant for Data Integration Tools

- August 27, 2021

Strategic Planning Assumptions Through 2022, manual data management tasks will be reduced by 45% through the addition of machine learning and automated service-level management. By 2023, AI-enabled automation in data management and integration will reduce the need for IT specialists by 20%. Read report >>>

The DataOps Landscape

- June 01, 2021

Data has emerged as an imperative foundational asset for all organizations. Data fuels significant initiatives such as digital transformation and the adoption of analytics, machine learning, and AI. Organizations that are able to tame, manage, and unlock their data assets stand to benefit in myriad ways, including improvements to decision-making and operational efficiency, better fraud prediction and prevention, better risk management and control, and more. In addition, data products and services can often lead to new or additional revenue. As companies increasingly depend on data to power essential products and services, they are investing in tools and processes to manage essential operations and services. In this post, we describe these tools as well as the community of practitioners using them. One sign of the growing maturity of these tools and practices is that a community of engineers and developers are beginning to coalesce around the term “DataOps” (data operations). Our conver...

Visualizing Data Timeliness at Airbnb

- March 19, 2021

Imagine you are a business leader ready to start your day, but you wake up to find that your daily business report is empty — the data is late, so now you are blind. Over the last year, multiple teams came together to build SLA Tracker , a visual analytics tool to facilitate a culture of data timeliness at Airbnb. This data product enabled us to address and systematize the following challenges of data timeliness: When should a dataset be considered late? How frequently are datasets late? Why is a dataset late? This project is a critical part of our efforts to achieve high data quality and required overcoming many technical, product, and organizational challenges in order to build. In this article, we focus on the product design : the journey of how we designed and built data visualizations that could make sense of the deeply complex data of data timeliness. Continue reading >>>

Data Discovery Platforms and Their Open Source Solutions

- February 09, 2021

In the past year or two, many companies have shared their data discovery platforms (the latest being Facebook’s Nemo). Based on this list, we now know of more than 10 implementations. I haven’t been paying much attention to these developments in data discovery and wanted to catch up. I was interested in: The questions these platforms help answer The features developed to answer these questions How they compare with each other What open source solutions are available By the end of this, we’ll learn about the key features that solve 80% of data discoverability problems. We’ll also see how the platforms compare on these features, and take a closer look at open source solutions available. Questions we ask in the data discovery process Before discussing platform features, let’s briefly go over some common questions in the data discovery process. Where can I find data about ____? If we don’t know the right terms, this is especially challenging. For user browsing behavior, do we search for “c...

6 Data Integration Tools Vendors to Watch in 2021

- December 09, 2020

Solutions Review’s Data Integration Tools Vendors to Watch is an annual listing of solution providers we believe are worth monitoring. Companies are commonly included if they demonstrate a product roadmap aligning with our meta-analysis of the marketplace. Other criteria include recent and significant funding, talent acquisition, a disruptive or innovative new technology or product, or inclusion in a major analyst publication. Data integration tools vendors are increasingly being disrupted by cloud connectivity, self-service, and the encroachment of data management functionality. As data volumes grow, we expect to see a continued push by providers in this space to adopt core capabilities of horizontal technology sectors. Organizations are keen on adopting these changes as well, and continue to allocate resources toward the providers that can not only connect data lakes and Hadoop to their analytic frameworks, but cleanse, prepare, and govern data. The next generation of tools will offe...

Nemo: Data discovery at Facebook

- November 08, 2020

Large-scale companies serve millions or even billions of people who depend on the services these companies provide for their everyday needs. To keep these services running and delivering meaningful experiences, the teams behind them need to find the most relevant and accurate information quickly so that they can make informed decisions and take action. Finding the right information can be hard for several reasons. The problem might be discovery — the relevant table might have an obscure or nondescript name, or different teams might have constructed overlapping data sets. Or, the problem could be one of confidence — the dashboard someone is looking at might have been superseded by another source six months ago. Many companies, such as Airbnb, Lyft, Netflix, and Uber, have built their own custom solutions for this challenge. For us, it was important to make the data discovery process simple and fast. Funneling everything through data experts to locate the necessary data each time we...

The unreasonable importance of data preparation

- July 30, 2020

We know data preparation requires a ton of work and thought. In this provocative article, Hugo Bowne-Anderson provides a formal rationale for why that work matters, why data preparation is particularly important for reanalyzing data, and why you should stay focused on the question you hope to answer. Along the way, Hugo introduces how tools and automation can help augment analysts and better enable real-time models. In a world focused on buzzword-driven models and algorithms, you’d be forgiven for forgetting about the unreasonable importance of data preparation and quality: your models are only as good as the data you feed them. This is the garbage in, garbage out principle: flawed data going in leads to flawed results, algorithms, and business decisions. If a self-driving car’s decision-making algorithm is trained on data of traffic collected during the day, you wouldn’t put it on the roads at night. To take it a step further, if such an algorithm is trained in an environment with car...

Project Hop - Exploring the future of data integration

- April 30, 2020

Project Hop was announced at KCM19 back in November 2019. The first preview release is available since April, 10th. We’ve been posting about it on our social media accounts, but what exactly is Project Hop? Let’s explore the project in a bit more detail. In this post, we'll have a look at what Project Hop is, why the project was started and why know.bi wants to go all in on it. What is Project Hop? hopAs the project’s tagline says, Project Hop intends to explore the future of data integration. We take that quite literally. We’ve seen massive changes in the data processing landscape over the last decade (the rise and fall of the Hadoop ecosystem, just to name one). All of these changes need to be supported and integrated into your data engineering and data processing systems. Apart from these purely technical challenges, the data processing life cycle has become a software life cycle. Robust and reliable data processing requires testing, a fast and flexible deployment...

ETL and How it Changed Over Time

- March 21, 2020

Modern world data and its usage has drastically changed when compared to a decade ago. There is a gap caused by the traditional ETL processes when processing modern data. The following are some of the main reasons for this: Modern data processes often include real-time streaming data, and organizations need real-time insights into processes. The systems need to perform ETL on data streams without using batch processing, and they should handle high data rates by scaling the system. Some single-server databases are now replaced by distributed data platforms ( e.g., Cassandra, MongoDB, Elasticsearch, SAAS apps ), message brokers( e.g., Kafka, ActiveMQ, etc. ) and several other types of endpoints. The system should have the capability to plugin additional sources or sinks to connect on the go in a manageable way. Repeated data processing due to ad hoc architecture has to be eliminated. Change data capture technologies used with traditional ETL has to be integ...

Announcing the dbt IDE: orchestrate the entire analytics engineering workflow in your browser

- January 15, 2020

Today we released the dbt Integrated Developer Environment (IDE) into general availability in dbt Cloud. With the IDE, you can build, run, test, and version control dbt projects from your browser. There’s no wrestling with pip, homebrew, hidden files in your home directory, or coordinating upgrades across large teams. If you haven’t already, be sure to check out Tristan’s post on why we built the IDE and why we think it’s such a meaningful development in the analytics engineering space. Otherwise, read on to learn more about what you can do with the IDE and what’s next for dbt Cloud. Read full post >>>

Why Not Airflow? An overview of the Prefect engine for Airflow users

- December 18, 2019

Airflow is a historically important tool in the data engineering ecosystem. It introduced the ability to combine a strict Directed Acyclic Graph (DAG) model with Pythonic flexibility in a way that made it appropriate for a wide variety of use cases. However, Airflow’s applicability is limited by its legacy as a monolithic batch scheduler aimed at data engineers principally concerned with orchestrating third-party systems employed by others in their organizations. Today, many data engineers are working more directly with their analytical counterparts. Compute and storage are cheap, so friction is low and experimentation prevails. Processes are fast, dynamic, and unpredictable. Airflow got many things right, but its core assumptions never anticipated the rich variety of data applications that has emerged. It simply does not have the requisite vocabulary to describe many of those activities. The seed that would grow into Prefect was first planted all the way back in 2016, in a seri...

Data Processing Pipeline Patterns

- October 14, 2019

Data produced by applications, devices, or humans must be processed before it is consumed. By definition, a data pipeline represents the flow of data between two or more systems. It is a set of instructions that determine how and when to move data between these systems. My last blog conveyed how connectivity is foundational to a data platform. In this blog, I will describe the different data processing pipelines that leverage different capabilities of the data platform, such as connectivity and data engines for processing. There are many data processing pipelines. One may: “Integrate” data from multiple sources Perform data quality checks or standardize data Apply data security-related transformations, which include masking, anonymizing, or encryption Match, merge, master, and do entity resolution Share data with partners and customers in the required format, such as HL7 Consumers or “targets” of data pipelines may include: Data warehouses like ...

The Future of Data Engineering

- August 24, 2019

Data engineering’s job is to help an organization move and process data. This generally requires two different systems, broadly speaking: a data pipeline, and a data warehouse. The data pipeline is responsible for moving the data, and the data warehouse is responsible for processing it. I acknowledge that this is a bit overly simplistic. You can do processing in the pipeline itself by doing transformations between extraction and loading with batch and stream processing. The “data warehouse” now includes many storage and processing systems (Flink, Spark, Presto, Hive, BigQuery, Redshift, etc), as well as auxiliary systems such as data catalogs, job schedulers, and so on. Still, I believe the paradigm holds. The industry is working through changes in how these systems are built and managed. There are four areas, in particular, where I expect to see shifts over the next few years. Timeliness: From batch to realtime Connectivity: From one:one bespoke integrations to many:many Cen...

Deep dive into how Uber uses Spark

- August 15, 2019

Apache Spark is a foundational piece of Uber’s Big Data infrastructure that powers many critical aspects of our business. We currently run more than one hundred thousand Spark applications per day, across multiple different compute environments. Spark’s versatility, which allows us to build applications and run them everywhere that we need, makes this scale possible. However, our ever-growing infrastructure means that these environments are constantly changing, making it increasingly difficult for both new and existing users to give their applications reliable access to data sources, compute resources, and supporting tools. Also, as the number of users grow, it becomes more challenging for the data team to communicate these environmental changes to users, and for us to understand exactly how Spark is being used. We built the Uber Spark Compute Service (uSCS) to help manage the complexities of running Spark at this scale. This Spark-as-a-service solution leverages Apache Livy, cu...

- June 07, 2019

Here’s a curated list of resources for data engineers, with sections for algorithms and data structures, SQL, databases, programming, tools, distributed systems, and more. Useful articles The AI Hierarchy of Needs The Rise of Data Engineer The Downfall of the Data Engineer A Beginner’s Guide to Data Engineering Part I Part II Part III Functional Data Engineering — a modern paradigm for batch data processing How to become a Data Engineer (in Russian) Talks Data Engineering Principles - Build frameworks not pipelines by Gatis Seja Functional Data Engineering - A Set of Best Practices by Maxime Beauchemin Advanced Data Engineering Patterns with Apache Airflow by Maxime Beauchemin Creating a Data Engineering Culture by Jesse Anderson Algorithms & Data Structures Algorithmic Toolbox in Russian Data Structures in Russian Data Structures & Algorithms Specialization on Coursera Algorithms Specialization from Stanford on Coursera SQL Com...

DataOps Principles: How Startups Do Data The Right Way

- June 01, 2019

If you have been trying to harness the power of data science and machine learning — but, like many teams, struggling to produce results — there’s a secret you are missing out on. All of those models and sophisticated insights require lots of good data, and the best way to get good data quickly is by using DataOps. What is DataOps? It’s a way of thinking about how an organization deals with data. It’s a set of tools to automate processes and empower individuals. And it’s a new DataOps Engineer role designed to make that thinking real by managing and building those tools. DataOps Principles DataOps was inspired by DevOps, which brought the power of agile development to operations (infrastructure management and production deployment). DevOps transformed the way that software development is done; and now DataOps is transforming the way that data management is done. For larger enterprises with a dedicated data engineering team, DataOps is about breaking down barriers and re-...

Move Beyond a Monolithic Data Lake to a Distributed Data Mesh (Martin Fowler)

- June 01, 2019

Many enterprises are investing in their next generation data lake, with the hope of democratizing data at scale to provide business insights and ultimately make automated intelligent decisions. Data platforms based on the data lake architecture have common failure modes that lead to unfulfilled promises at scale. To address these failure modes we need to shift from the centralized paradigm of a lake, or its predecessor data warehouse. We need to shift to a paradigm that draws from modern distributed architecture: considering domains as the first class concern, applying platform thinking to create self-serve data infrastructure, and treating data as a product. Becoming a data-driven organization remains one of the top strategic goals of many companies I work with. My clients are well aware of the benefits of becoming intelligently empowered: providing the best customer experience based on data and hyper-personalization; reducing operational costs and time through data-driven optimi...