Notes about Cutting-Edge Technologies and Everything

Posts

Showing posts with the label Data Lake

Emerging Architectures for Modern Data Infrastructure

- October 26, 2021

As an industry, we’ve gotten exceptionally good at building large, complex software systems. We’re now starting to see the rise of massive, complex systems built around data – where the primary business value of the system comes from the analysis of data, rather than the software directly. We’re seeing quick-moving impacts of this trend across the industry, including the emergence of new roles, shifts in customer spending, and the emergence of new startups providing infrastructure and tooling around data. In fact, many of today’s fastest growing infrastructure startups build products to manage data. These systems enable data-driven decision making (analytic systems) and drive data-powered products, including with machine learning (operational systems). They range from the pipes that carry data, to storage solutions that house data, to SQL engines that analyze data, to dashboards that make data easy to understand – from data science and machine learning libraries, to automated data pipe...

See post »

2021 Gartner Magic Quadrant for Data Integration Tools

- August 27, 2021

Strategic Planning Assumptions Through 2022, manual data management tasks will be reduced by 45% through the addition of machine learning and automated service-level management. By 2023, AI-enabled automation in data management and integration will reduce the need for IT specialists by 20%. Read report >>>

See post »

Data Mesh Principles and Logical Architecture v2

- December 04, 2020

Our aspiration to augment and improve every aspect of business and life with data, demands a paradigm shift in how we manage data at scale. While the technology advances of the past decade have addressed the scale of volume of data and data processing compute, they have failed to address scale in other dimensions: changes in the data landscape, proliferation of sources of data, diversity of data use cases and users, and speed of response to change. Data mesh addresses these dimensions, founded in four principles: domain-oriented decentralized data ownership and architecture, data as a product, self-serve data infrastructure as a platform, and federated computational governance. Each principle drives a new logical view of the technical architecture and organizational structure. The original writeup, How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh - which I encourage you to read before joining me back here - empathized with today’s pain points of architectural and or...

See post »

Dremio 4.8 is released

- September 25, 2020

Today we are excited to announce the release of Dremio 4.8! This month’s release delivers multiple features such as external query, a new authorization service API, AWS Edition enhancements and more. This blog post highlights the following updates: External query Default reflections Runtime filtering GA Documented JMX metrics and provided sample exporters Ability to customize projects in Dremio AWS Edition Support for Dremio AWS Edition deployments without public IP addresses Read full article >>>

See post »

Data Processing Pipeline Patterns

- October 14, 2019

Data produced by applications, devices, or humans must be processed before it is consumed. By definition, a data pipeline represents the flow of data between two or more systems. It is a set of instructions that determine how and when to move data between these systems. My last blog conveyed how connectivity is foundational to a data platform. In this blog, I will describe the different data processing pipelines that leverage different capabilities of the data platform, such as connectivity and data engines for processing. There are many data processing pipelines. One may: “Integrate” data from multiple sources Perform data quality checks or standardize data Apply data security-related transformations, which include masking, anonymizing, or encryption Match, merge, master, and do entity resolution Share data with partners and customers in the required format, such as HL7 Consumers or “targets” of data pipelines may include: Data warehouses like ...

See post »

Dremio 4.0 Data Lake Engine

- September 17, 2019

Dremio’s Data Lake Engine delivers lightning fast query speed and a self-service semantic layer operating directly against your data lake storage. No moving data to proprietary data warehouses or creating cubes, aggregation tables and BI extracts. Just flexibility and control for Data Architects, and self-service for Data Consumers. This release, also known as Dremio 4.0, dramatically accelerates query performance on S3 and ADLS, and provides deeper integration with the security services of AWS and Azure. In addition, this release simplifies the ability to query data across a broader range of data sources, including multiple lakes (with different Hive versions) and through community-developed connectors offered in Dremio Hub. Read full article >>>

See post »

Move Beyond a Monolithic Data Lake to a Distributed Data Mesh (Martin Fowler)

- June 01, 2019

Many enterprises are investing in their next generation data lake, with the hope of democratizing data at scale to provide business insights and ultimately make automated intelligent decisions. Data platforms based on the data lake architecture have common failure modes that lead to unfulfilled promises at scale. To address these failure modes we need to shift from the centralized paradigm of a lake, or its predecessor data warehouse. We need to shift to a paradigm that draws from modern distributed architecture: considering domains as the first class concern, applying platform thinking to create self-serve data infrastructure, and treating data as a product. Becoming a data-driven organization remains one of the top strategic goals of many companies I work with. My clients are well aware of the benefits of becoming intelligently empowered: providing the best customer experience based on data and hyper-personalization; reducing operational costs and time through data-driven optimi...

See post »

How to Use Data Preparation to Accelerate Cloud Data Lake Adoption

- May 06, 2019

This TDWI Checklist offers six steps for data preparation processes and solutions that can help accelerate cloud data lake adoption.

See post »

Gartner - Market Guide for Information Stewardship Applications

- July 17, 2018

The critical need for information governance continues to drive a diversified market for information stewardship solutions that support it. Data and analytics leaders must assess the capabilities these solutions offer to select vendors that will best suit their needs. Key Findings Policy setting in information governance programs is still so different and inconsistent that no market of offerings is forming as yet. Furthermore, policy enforcement in information stewardship initiatives is conforming to a market, but now across a wider set of use cases. Information stewardship applications available in the market do not yet fully support the information steward's wider role and tasks. Growth in the market for information stewardship applications is being disrupted by new technology capabilities in adjacent markets, such as data quality and metadata management, and new regulatory requirements, such as GDPR. Recommendations For data and analytics leaders working with dat...

See post »

Gartner Hype Cycle for Data Science and Machine Learning, 2017

- March 13, 2018

The hype around data science and machine learning has increased from already high levels in the past year. Data and analytics leaders should use this Hype Cycle to understand technologies generating excitement and inflated expectations, as well as significant movements in adoption and maturity. The Hype Cycle The Peak of Inflated Expectations is crowded and the Trough of Disillusionment remains sparse, though several highly hyped technologies are beginning to hear the first disillusioned rumblings from the market. In general, the faster a technology moves from the innovation trigger to the peak, the faster the technology moves into the trough as organizations quickly see it as just another passing fad. This Hype Cycle is especially relevant to data and analytics leaders, chief data officers, and heads of data science teams who are implementing machine-learning programs and looking to understand the next-generation innovations. Technology provider product marketers and strategists...

See post »

Enterprise data integration with an operational data hub

- February 17, 2018

Big data (also called NoSQL) technologies facilitate the ingestion, processing, and search of data with no regard to schema (database structure). Web technologies such as Google, LinkedIn, and Facebook use big data technologies to process the tremendous amount of data from every possible source without regard to structure, and offer a searchable interface to access it. Modern NoSQL technologies have evolved to offer capabilities to govern, process, secure, and deliver data, and have facilitated the development of an integration pattern called the operational data hub (ODH). The Centers for Medicare and Medicaid Services (CMS) and other organizations (public and private) in the health, finance, banking, entertainment, insurance, and defense sectors (amongst others) utilize the capabilities of ODH technologies for enterprise data integration. This gives them the ability to access, integrate, master, process, and deliver data across the enterprise. Traditional mode...

See post »

To Schema On Read or to Schema On Write, That is the Hadoop Data Lake Question

- July 04, 2015

The Hadoop data lake concept can be summed up as, “Store it all in one place, figure out what to do with it later.” But while this might be the general idea of your Hadoop data lake, you won’t get any real value out of that data until you figure out a logical structure for it. And you’d better keep track of your metadata one way or another. It does no good to have a lake full of data, if you have no idea what lies under the shiny surface. At some point, you have to give that data a schema, especially if you want to query it with SQL or something like it. The eternal Hadoop question is whether to apply the brave new strategy of schema on read, or to stick with the tried and true method of schema on write. What is Schema on Write? Schema on write has been the standard for many years in relational databases. Before any data is written in the database, the structure of that data is strictly defined, and that metadata stored and tracked. Irrelevant data is discarded, data types, lengths and...

See post »