Notes about Cutting-Edge Technologies and Everything

Posts

Showing posts with the label Analytics

2022 Gartner Magic Quadrant for Analytics and Business Intelligence Platforms

- March 25, 2022

Today’s analytics and BI platforms are augmented throughout and enable users to compose low/no-code workflows and applications. Cloud ecosystems and alignment with digital workplace tools are key selection factors. This research helps data and analytics leaders plan for and select these platforms. Analytics and business intelligence (ABI) platforms enable less technical users, including businesspeople, to model, analyze, explore, share and manage data, and collaborate and share findings, enabled by IT and augmented by artificial intelligence (AI). ABI platforms may optionally include the ability to create, modify or enrich a semantic model including business rules. Today’s ABI platforms have an emphasis on visual self-service for end users, augmented by AI to deliver automated insights. Increasingly, the focus of augmentation is shifting from the analyst persona to the consumer or decision maker. To achieve this, automated insights must not only be statistically relevant, but the...

Data Reliability at Scale: How Fox Digital Architected its Modern Data Stack

- March 04, 2022

As distributed architectures continue to become a new gold standard for data driven organizations, this kind of self-serve motion would be a dream come true for many data leaders. So when the Monte Carlo team got the chance to sit down with Alex, we took a deep dive into how he made it happen. Here’s how his team architected a hybrid data architecture that prioritizes democratization and access, while ensuring reliability and trust at every turn. Exercise “Controlled Freedom” when dealing with stakeholders Alex has built decentralized access to data at Fox on a foundation he calls “controlled freedom.” In fact, he believes using your data team as the single source of truth within an organization actually creates the biggest silo. So instead of becoming a guardian and bottleneck, Alex and his data team focus on setting certain parameters around how data is ingested and supplied to stakeholders. Within the framework, internal data consumers at Fox have the freedom to cr...

Emerging Architectures for Modern Data Infrastructure

- October 26, 2021

As an industry, we’ve gotten exceptionally good at building large, complex software systems. We’re now starting to see the rise of massive, complex systems built around data – where the primary business value of the system comes from the analysis of data, rather than the software directly. We’re seeing quick-moving impacts of this trend across the industry, including the emergence of new roles, shifts in customer spending, and the emergence of new startups providing infrastructure and tooling around data. In fact, many of today’s fastest growing infrastructure startups build products to manage data. These systems enable data-driven decision making (analytic systems) and drive data-powered products, including with machine learning (operational systems). They range from the pipes that carry data, to storage solutions that house data, to SQL engines that analyze data, to dashboards that make data easy to understand – from data science and machine learning libraries, to automated data pipe...

The DataOps Landscape

- June 01, 2021

Data has emerged as an imperative foundational asset for all organizations. Data fuels significant initiatives such as digital transformation and the adoption of analytics, machine learning, and AI. Organizations that are able to tame, manage, and unlock their data assets stand to benefit in myriad ways, including improvements to decision-making and operational efficiency, better fraud prediction and prevention, better risk management and control, and more. In addition, data products and services can often lead to new or additional revenue. As companies increasingly depend on data to power essential products and services, they are investing in tools and processes to manage essential operations and services. In this post, we describe these tools as well as the community of practitioners using them. One sign of the growing maturity of these tools and practices is that a community of engineers and developers are beginning to coalesce around the term “DataOps” (data operations). Our conver...

Automated Data Wrangling

A growing array of techniques apply machine learning directly to the problems of data wrangling. They often start out as open research projects but then become proprietary. How can we build automated data wrangling systems for open data? We work with a lot of messy public data. In theory it’s already “structured” and published in machine readable forms like Microsoft Excel spreadsheets, poorly designed databases, and CSV files with no associated schema. In practice it ranges from almost unstructured to… almost structured. Someone working on one of our take-home questions for the data wrangler & analyst position recently noted of the FERC Form 1: “This database is not really a database – more like a bespoke digitization of a paper form that happened to be built using a database.” And I mean, yeah. Pretty much. The more messy datasets I look at, the more I’ve started to question Hadley Wickham’s famous Tolstoy quip about the uniqueness of messy data. There’s a taxonomy of diffe...

The Growing Importance of Metadata Management Systems

- March 21, 2021

As companies embrace digital technologies to transform their operations and products, many are using best-of-breed software, open source tools, and software as a service (SaaS) platforms to rapidly and efficiently integrate new technologies. This often means that data required for reports, analytics, and machine learning (ML) reside on disparate systems and platforms. As such, IT initiatives in companies increasingly involve tools and frameworks for data fusion and integration. Examples include tools for building data pipelines, data quality and data integration solutions, customer data platform ( CDP ) , master data management , and data markets . Collecting, unifying, preparing, and managing data from diverse sources and formats has become imperative in this era of rapid digital transformation. Organizations that invest in foundational data technologies are much more likely to build solid foundation applications, ranging from BI and analytics to machine learn...

Visualizing Data Timeliness at Airbnb

- March 19, 2021

Imagine you are a business leader ready to start your day, but you wake up to find that your daily business report is empty — the data is late, so now you are blind. Over the last year, multiple teams came together to build SLA Tracker , a visual analytics tool to facilitate a culture of data timeliness at Airbnb. This data product enabled us to address and systematize the following challenges of data timeliness: When should a dataset be considered late? How frequently are datasets late? Why is a dataset late? This project is a critical part of our efforts to achieve high data quality and required overcoming many technical, product, and organizational challenges in order to build. In this article, we focus on the product design : the journey of how we designed and built data visualizations that could make sense of the deeply complex data of data timeliness. Continue reading >>>

Data Management maturity models: a comparative analysis

- February 20, 2021

From the first glance, you can see that there are seven key Subject Areas where the Subject domains are located. These are: Data Data and System Design Technology, Governance Data Quality Security Related Capabilities. You can see that the difference in approaches to define the key Domains are rather big. It is not the purpose of this article to deliver a detailed analysis, but there is one striking observation I would like to share: the Subject domains and deliverables of these domains are being mixed with one another. For example, let us have a look at Data governance. The domain ‘Data governance’ exists in four different models. Some other domains like ‘Data management strategy’, that appears in three models, is considered as a deliverable of Data Governance domain in other models, for example in DAMA model. Such a big difference of opinions on key Subject domains is rather confusing. Subject domain dimensions Subject domain dimensions are characteristics of (sub-) domains. It ...

Data Discovery Platforms and Their Open Source Solutions

- February 09, 2021

In the past year or two, many companies have shared their data discovery platforms (the latest being Facebook’s Nemo). Based on this list, we now know of more than 10 implementations. I haven’t been paying much attention to these developments in data discovery and wanted to catch up. I was interested in: The questions these platforms help answer The features developed to answer these questions How they compare with each other What open source solutions are available By the end of this, we’ll learn about the key features that solve 80% of data discoverability problems. We’ll also see how the platforms compare on these features, and take a closer look at open source solutions available. Questions we ask in the data discovery process Before discussing platform features, let’s briefly go over some common questions in the data discovery process. Where can I find data about ____? If we don’t know the right terms, this is especially challenging. For user browsing behavior, do we search for “c...

How DataOps Amplifies Data and Analytics Business Value

- December 10, 2020

DataOps techniques can provide a more agile and collaborative approach to building and managing data pipelines. The pandemic has accelerated the need for data and analytics leaders to deliver data and analytics insight faster, with higher quality and resiliency in the face of constant change. Organizations need to make better-informed and faster decisions with a focus on automation, real-time risk assessment and mitigation, continuous value delivery and agility. The point of DataOps is to change how people collaborate around data and how it is used in the organization As a result, data and analytics leaders are increasingly applying DataOps techniques that provide a more agile and collaborative approach to building and managing data pipelines. What is DataOps? Gartner defines DataOps as a collaborative data management practice focused on improving the communication, integration and automation of data flows between data managers and data consumers across an organization. “The poi...

Data Mesh Principles and Logical Architecture v2

- December 04, 2020

Our aspiration to augment and improve every aspect of business and life with data, demands a paradigm shift in how we manage data at scale. While the technology advances of the past decade have addressed the scale of volume of data and data processing compute, they have failed to address scale in other dimensions: changes in the data landscape, proliferation of sources of data, diversity of data use cases and users, and speed of response to change. Data mesh addresses these dimensions, founded in four principles: domain-oriented decentralized data ownership and architecture, data as a product, self-serve data infrastructure as a platform, and federated computational governance. Each principle drives a new logical view of the technical architecture and organizational structure. The original writeup, How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh - which I encourage you to read before joining me back here - empathized with today’s pain points of architectural and or...

Gartner - 2020 Magic Quadrant for Metadata Management Solutions

- November 12, 2020

Metadata management is a core aspect of an organization’s ability to manage its data and information assets. The term “metadata” describes the various facets of an information asset that can improve its usability throughout its life cycle. Metadata and its uses go far beyond technical matters. Metadata is used as a reference for business-oriented and technical projects, and lays the foundations for describing, inventorying and understanding data for multiple use cases. Use-case examples include data governance, security and risk, data analysis and data value. The market for metadata management solutions is complex because these solutions are not all identical in scope or capability. Vendors include companies with one or more of the following functional capabilities in their stand-alone metadata management products (not all vendors offer all these capabilities, and not all vendor solutions offer these capabilities in one product): Metadata repositories — Used to document and manage meta...

Nemo: Data discovery at Facebook

- November 08, 2020

Large-scale companies serve millions or even billions of people who depend on the services these companies provide for their everyday needs. To keep these services running and delivering meaningful experiences, the teams behind them need to find the most relevant and accurate information quickly so that they can make informed decisions and take action. Finding the right information can be hard for several reasons. The problem might be discovery — the relevant table might have an obscure or nondescript name, or different teams might have constructed overlapping data sets. Or, the problem could be one of confidence — the dashboard someone is looking at might have been superseded by another source six months ago. Many companies, such as Airbnb, Lyft, Netflix, and Uber, have built their own custom solutions for this challenge. For us, it was important to make the data discovery process simple and fast. Funneling everything through data experts to locate the necessary data each time we...

Gartner - Information as a Second Language: Enabling Data Literacy for Digital Society

- October 28, 2020

Digital society expects its citizens to “speak data.” Unless data and analytics leaders treat information as the new second language of business, government and communities, they will not be able to deliver the competitive advantage and agility demanded by their enterprises. Key Challenges ■ Poor data literacy is the second highest inhibitor to progress, as reported by respondents to Gartner’s third annual Chief Data Oficer Survey, behind culture change and just ahead of lack of t alent and skills. ■ An information language barrier exists across business units and IT functions, rooted in ineffective communication across a wide range of diverse stakeholders. As a result, data and analytics leaders struggle to get their message across and information assets go underutilized. ■ Although academic and professional programs are beginning to address the disparity in talent and skills, in many cases they reinforce the information language barrier with narrow content...

Dremio 4.8 is released

- September 25, 2020

Today we are excited to announce the release of Dremio 4.8! This month’s release delivers multiple features such as external query, a new authorization service API, AWS Edition enhancements and more. This blog post highlights the following updates: External query Default reflections Runtime filtering GA Documented JMX metrics and provided sample exporters Ability to customize projects in Dremio AWS Edition Support for Dremio AWS Edition deployments without public IP addresses Read full article >>>

- September 17, 2020

Once an outsider category, cloud computing now powers every industry. Look no further than this year’s Forbes Cloud 100 list, the annual ranking of the world’s top private cloud companies, where this year's standouts are keeping businesses surviving—and thriving—from real estate to retail, data to design. Produced for the fifth consecutive year in partnership with Bessemer Venture Partners and Salesforce Ventures, the Cloud 100 recognizes standouts in tech’s hottest category from small startups to private-equity-backed giants, from Silicon Valley to Australia and Hong Kong. The companies on the list are selected for their growth, sales, valuation and culture, as well as a reputation score derived in consultation with 43 CEO judges and executives from their public-cloud-company peers. This year’s new No. 1 has set a record for shortest time running atop the list. Database leader Snowflake takes the top slot, up from No. 2 last year and just hours before graduating from the list by g...

The unreasonable importance of data preparation

- July 30, 2020

We know data preparation requires a ton of work and thought. In this provocative article, Hugo Bowne-Anderson provides a formal rationale for why that work matters, why data preparation is particularly important for reanalyzing data, and why you should stay focused on the question you hope to answer. Along the way, Hugo introduces how tools and automation can help augment analysts and better enable real-time models. In a world focused on buzzword-driven models and algorithms, you’d be forgiven for forgetting about the unreasonable importance of data preparation and quality: your models are only as good as the data you feed them. This is the garbage in, garbage out principle: flawed data going in leads to flawed results, algorithms, and business decisions. If a self-driving car’s decision-making algorithm is trained on data of traffic collected during the day, you wouldn’t put it on the roads at night. To take it a step further, if such an algorithm is trained in an environment with car...

Shopify's approach to data discovery

- July 24, 2020

Humans generate a lot of data. Every two days we create as much data as we did from the beginning of time until 2003! The International Data Corporation estimates the global datasphere totaled 33 zettabytes (one trillion gigabytes) in 2018. The estimate for 2025 is 175 ZBs, an increase of 430%. This growth is challenging organizations across all industries to rethink their data pipelines. The nature of data usage is problem driven, meaning data assets (tables, reports, dashboards, etc.) are aggregated from underlying data assets to help decision making about a particular business problem, feed a machine learning algorithm, or serve as an input to another data asset. This process is repeated multiple times, sometimes for the same problems, and results in a large number of data assets serving a wide variety of purposes. Data discovery and management is the practice of cataloguing these data assets and all of the applicable metadata that saves time for data professionals, increasing data ...

Swarm64: Open source PostgreSQL on steroids

- April 25, 2020

PostgreSQL is a big deal. The most common SQL open source database that you have never heard of, as ZDNet's own Tony Baer called it. Besides being the framework on which a number of commercial offerings were built, PostgreSQL has a user base of its own. According to DB Engines, PostgreSQL is the 4th most popular database in the world. Swarm64, on the other hand, is a small vendor. So small, actually, that we have shared the stage with CEO Thomas Richter in a local Berlin Meetup a few years back. Back then, Richter was not CEO, and Swarm64 was even smaller. But its value proposition still sounded attractive: boost PostgreSQL's performance for free. Swarm64 is an acceleration layer for PostgreSQL. There's no such thing as a free lunch of course, so the "for free" part is a figure of speech. Swarm64 is a commercial vendor. Until recently, however, the real gotcha was hardware: Swarm64 Database Acceleration (DA) required a specialized chip called FPGA to be able ...

The Forrester Wave™: Data Management For Analytics, Q1 2020

- February 14, 2020

While traditional data warehouses often took years to build, deploy, and reap benefits from, today's organizations want simple, agile, integrated, cost-effective, and highly automated solutions to support insights. In addition, traditional architectures are failing to meet new business requirements, especially around high-speed data streaming, real-time analytics, large volumes of messy and complex data sets, and self-service. As a result, firms are revisiting their data architectures, looking for ways to modernize to support new requirements. DMA is a modern architecture that minimizes the complexity of messy data and hides heterogeneity by embodying a trusted model and integrated policies and by adapting to changing business requirements. It leverages metadata, in-memory, and distributed data repositories, running on-premises or in the cloud, to deliver scalable and integrated analytics. Adoption of DMA will grow further as enterprise architects look at overcoming data challeng...