Posts

Showing posts from 2021

Dec 2021 Gartner Magic Quadrant for Cloud Database Management Systems

Image
  Database management systems continue their move to the cloud — a move that is producing an increasingly complex landscape of vendors and offerings. This Magic Quadrant will help data and analytics leaders make the right choices in a complex and fast-evolving market. Strategic Planning Assumptions By 2025, cloud preference for data management will substantially reduce the vendor landscape while the growth in multicloud will increase the complexity for data governance and integration. By 2022, cloud database management system (DBMS) revenue will account for 50% of the total DBMS market revenue. These DBMSs reflect optimization strategies designed to support transactions and/or analytical processing for one or more of the following use cases:     Traditional and augmented transaction processing     Traditional and logical data warehouse     Data science exploration/deep learning     Stream/event processing     Operational intelligence This market does not include vendors that only provi

AWS vs Azure vs GCP: Cloud Web Services Comparison in Detail

Image
  The following post focuses on AWS, MS Azure, and GCP in detail. Learn more about each cloud service and how to choose the best one for your business needs.  Digitalization is being embraced by all of us across the globe, especially cloud computing technology. Whether it's because of its scalability or security or reduced costs, cloud platforms have sprung up to a great extent over a few years. Gone are the days when businesses were confused about whether to choose a cloud service provider or not. Now the confusion surrounds the question of which cloud service provider to use. AWS, Azure, and Google Cloud are our top three contenders. Recently, I happen to stumble upon an informative post focusing on AWS Lambda vs Azure Functions. I must say this one was quite detailed and well-structured. Here they have successfully covered all the aspects that are essential and dominating while we compare lambda vs azure. And I am pretty sure considering both the posts together will act as a s

Cloud Data Warehouse Comparison: Redshift vs. BigQuery vs. Azure vs. Snowflake for Real-Time Workloads

Image
  Data helps companies take the guesswork out of decision-making. Teams can use data-driven evidence to decide which products to build, which features to add, and which growth initiatives to pursue. And, such insights-driven businesses grow at an annual rate of over 30%. But, there’s a difference between being merely data-aware and insights-driven. Discovering insights requires finding a way to analyze data in near real-time, which is where cloud data warehouses play a vital role. As scalable repositories of data, warehouses allow businesses to find insights by storing and analyzing huge amounts of structured and semi-structured data. And, running a data warehouse is more than a technical initiative. It’s vital to the overall business strategy and can inform an array of future product, marketing, and engineering decisions. But, choosing a cloud data warehouse provider can be challenging. Users have to evaluate costs, performance, the ability to handle real-time workloads, and other par

Take A Product Management Approach To Data Monetization

Image
    Central to treating data as an asset, data monetization should align with familiar research and development (R&D) and product management/marketing approaches. Not to oversimplify the many challenges and activities involved in monetizing data, certain basic concepts will reap significant rewards if executed well.  Evolve from Data Project Management to Data Product Management Although you may already have a data leader such as a chief data officer (CDO), or an analytics leader, the first step toward data monetization is to designate a team tasked with identifying and pursuing opportunities for and generating demonstrable economic benefits from available data assets. They may report to a data and analytics executive, into the enterprise architecture group, a chief digital officer, or perhaps even a business unit head.  Creating a distinct, dedicated data product management role is vital especially when business and data leaders agree on pursuing direct data mo

Emerging Architectures for Modern Data Infrastructure

Image
As an industry, we’ve gotten exceptionally good at building large, complex software systems. We’re now starting to see the rise of massive, complex systems built around data – where the primary business value of the system comes from the analysis of data, rather than the software directly. We’re seeing quick-moving impacts of this trend across the industry, including the emergence of new roles, shifts in customer spending, and the emergence of new startups providing infrastructure and tooling around data. In fact, many of today’s fastest growing infrastructure startups build products to manage data. These systems enable data-driven decision making (analytic systems) and drive data-powered products, including with machine learning (operational systems). They range from the pipes that carry data, to storage solutions that house data, to SQL engines that analyze data, to dashboards that make data easy to understand – from data science and machine learning libraries, to automated data pipe

Top 9 Data Modeling Tools & Software 2021

Image
  Data modeling is the procedure of crafting a visual representation of an entire information system or portions of it in order to convey connections between data points and structures. The objective is to portray the types of data used and stored within the system, the ways the data can be organized and grouped, the relationships among these data types, and their attributes and formats. Data modeling uses abstraction to better understand and represent the nature of the flow of data within an enterprise-level information system.  The types of data models include: Conceptual data models. Logical data models. Physical data models. Database and information system design begins with the creation of these data models.  What is a Data Modeling Tool? A data modeling tool enables quick and efficient database design while minimizing human error. A data modeling software helps craft a high-performance database, generate reports that can be useful for stakeholders and create data definition (a.k.

Mainframe Modernization to Cloud

Image
What Is Mainframe Modernization? Mainframe Modernization entails the process of migrating or improving the IT operations to reduce IT spending efficiently. In the realm of improving, we can define Mainframe Modernization as the process of enhancing legacy infrastructure by incorporating modern interfaces, code modernization, and performance modernization. In terms of migration, it is the process of shifting the enterprise’s code and functionality to a newer platform technology like cloud systems. The strategy employed to modernize Mainframe structures relies on factors like business/customer objectives, IT budgets, and costs of running new technology vs. costs incurred from not modernizing. B enefits of Mainframe Modernization to Cloud Cloud can offer economies of scale and new functions that are not available through mainframe computing. The benefits of cloud technologies and the law of diminishing returns in the Mainframe are calling for an increased demand for migration strate

Announcing Databricks Serverless SQL

Image
Databricks SQL   already provides a first-class user experience for BI and SQL directly on the data lake, and today, we are excited to announce another step in making data and AI simple with Databricks Serverless SQL. This new capability for Databricks SQL provides instant compute to users for their BI and SQL workloads, with minimal management required and capacity optimizations that can lower overall cost by an average of 40%. This makes it even easier for organizations to expand adoption of the lakehouse for business analysts who are looking to access the rich, real-time datasets of the lakehouse with a simple and performant solution. Under the hood of this capability is an active server fleet, fully managed by Databricks, that can transfer compute capacity to user queries, typically in about 15 seconds. The best part? You only pay for Serverless SQL when users start running reports or queries. Organizations with business analysts who want to analyze data in the data lake with their

2021 Gartner Magic Quadrant for Data Integration Tools

Image
  Strategic Planning Assumptions Through 2022, manual data management tasks will be reduced by 45% through the addition of machine learning and automated service-level management. By 2023, AI-enabled automation in data management and integration will reduce the need for IT specialists by 20%.  Read report >>>

Cost-Efficient Open Source Big Data Platform at Uber

Image
  In this blog post, we shared efforts and ideas in improving the platform efficiency of Uber’s Big Data Platform, including file format improvements, HDFS erasure coding, YARN scheduling policy improvements, load balancing, query engines, and Apache Hudi.  These improvements have resulted in significant savings.  In addition, we explored some open challenges like analytics and online colocation, and pricing mechanisms.  However, as the framework outlined in our previous post established, platform efficiency improvements alone do not guarantee efficient operation.  Controlling the supply and the demand of data is equally important, which we will address in an upcoming post. As Uber’s business has expanded, the underlying pool of data that powers it has grown exponentially, and thus ever more expensive to process. When Big Data rose to become one of our largest operational expenses, we began an initiative to reduce costs on our data platform, which divides challenges into 3 broad pillar

30 ways to leave your data center: key migration guides, in one place

Image
  One of the challenges with cloud migration is that you’re solving a puzzle with multiple pieces. In addition to a number of workloads you could migrate, you’re also solving for challenges you’re facing, the use cases driving you to migrate, and the benefits you’re looking to gain. Each organization’s puzzle will likely get solved in their own unique way, but thankfully there is plenty of guidance on how you can migrate common workloads in successful ways.  In addition to working directly with our Rapid Assessment and Migration Program (RAMP), we also offer a plethora of self-service guides to help you succeed! Some of these guides, which we’ll cover below, are designed to help you identify the best ways to migrate, which include meeting common organizational goals like minimizing time and risk during your migration, identifying the most enterprise-grade infrastructure for your workloads, picking a cloud that aligns with your organization’s sustainability goals, and more. Continue rea

The DataOps Landscape

Image
Data has emerged as an imperative foundational asset for all organizations. Data fuels significant initiatives such as digital transformation and the adoption of analytics, machine learning, and AI. Organizations that are able to tame, manage, and unlock their data assets stand to benefit in myriad ways, including improvements to decision-making and operational efficiency, better fraud prediction and prevention, better risk management and control, and more. In addition, data products and services can often lead to new or additional revenue. As companies increasingly depend on data to power essential products and services, they are investing in tools and processes to manage essential operations and services. In this post, we describe these tools as well as the community of practitioners using them. One sign of the growing maturity of these tools and practices is that a community of engineers and developers are beginning to coalesce around the term “DataOps” (data operations). Our conver

Automated Data Wrangling

Image
  A growing array of techniques apply machine learning directly to the problems of data wrangling. They often start out as open research projects but then become proprietary. How can we build automated data wrangling systems for open data? We work with a lot of messy public data. In theory it’s already “structured” and published in machine readable forms like Microsoft Excel spreadsheets, poorly designed databases, and CSV files with no associated schema. In practice it ranges from almost unstructured to… almost structured. Someone working on one of our take-home questions for the data wrangler & analyst position recently noted of the FERC Form 1: “This database is not really a database – more like a bespoke digitization of a paper form that happened to be built using a database.” And I mean, yeah. Pretty much. The more messy datasets I look at, the more I’ve started to question Hadley Wickham’s famous Tolstoy quip about the uniqueness of messy data. There’s a taxonomy of different

What is a Vector Database?

Image
  The meteoric rise in Machine Learning in the last few years has led to increasing use of vector embeddings. They are fundamental to many models and approaches, and are a potent tool for applications such as semantic search, similarity search, and anomaly detection. The unique nature, growing volume, and rising importance of vector embeddings make it necessary to find new methods of storage and retrieval. We need a new kind of database. Continue reading >>>

The State of serverless computing 2021

Image
Serverless computing is redefining the way organizations develop, deploy, and integrate cloud-native applications. According to an industry report, market size of serverless computing is expected to reach 7.72 billion by 2021. A new and compelling paradigm for the deployment of cloud applications, serverless computing is at the precipice of enterprise shift towards containers and microservices. In the year 2021, serverless paradigm shift presents exciting opportunities to organizations by providing a simplified programming model for creating cloud applications by abstracting away most operational concerns. Major cloud vendors, Microsoft, Google, and Amazon are already in the game with their respective offering and there is no reason you shouldn’t aboard the train. 2021 is the year of FaaS All major providers of serverless computing offer several types and tiers of database and storage services to their customers. In addition, all major cloud player such as Amazon, Microsoft and Google

Snowflake Data Sharing and Data Marketplace

Image
  Snowflake data sharing and data marketplace can support modern data sharing techniques and eliminate the need for data movement. In Snowflake, there is no need to extract the data from the provider database and use some secure data transfer mechanism to share it with the consumers. Snowflake supports data sharing embedded into their SQL language so databases can be shared from within SQL commands. And on top of that, the data provider can update the data in real-time ensuring that all consumers will have a consistent, up-to-date view of their data sets. How Data Sharing Works Snowflake can share regular and external tables, and secure views and secure materialized views. Snowflake enables the sharing of databases through the concept of shares. Continue reading >>>

The Growing Importance of Metadata Management Systems

Image
As companies embrace digital technologies to transform their operations and products, many are using best-of-breed software, open source tools, and software as a service (SaaS) platforms to rapidly and efficiently integrate new technologies. This often means that data required for reports, analytics, and machine learning (ML) reside on disparate systems and platforms. As such, IT initiatives in companies increasingly involve tools and frameworks for data fusion and integration. Examples include tools for building data pipelines, data quality and data integration solutions, customer data platform ( CDP ) ,   master data management , and   data markets . Collecting, unifying, preparing, and managing data from diverse sources and formats has become imperative in this era of rapid digital transformation. Organizations that invest in  foundational data technologies  are much more likely to build solid foundation applications, ranging from BI and analytics to machine learning and AI. In rece

Visualizing Data Timeliness at Airbnb

Image
  Imagine you are a business leader ready to start your day, but you wake up to find that your daily business report is empty — the data is late, so now you are blind. Over the last year, multiple teams came together to build  SLA Tracker , a visual analytics tool to facilitate a culture of data timeliness at Airbnb. This data product enabled us to address and systematize the following challenges of data timeliness: When  should a dataset be considered late? How   frequently  are datasets late? Why  is a dataset late? This project is a critical part of our efforts to achieve high data quality and required overcoming many technical, product, and organizational challenges in order to build. In this article, we focus on the  product design : the journey of how we designed and built data visualizations that could make sense of the deeply complex data of data timeliness. Continue reading >>>

ThoughtWorks Decoder puts tech into a business context

Image
The tech landscape changes pretty fast. There are always new terms, techniques and tools emerging. But don't let tech be an enigma: ThoughtWorks Decoder is here to help Simply search for the term you're interested in, and we'll give you the lowdown on what it is, what it can do for your enterprise and what the potential drawbacks are. ThoughtWorks Decoder >>>

Ten Use Cases to Enable an Organization with Metadata and Catalogs

Image
Enterprises are modernizing their data platforms and associated tool-sets to serve the fast needs of data practitioners, including data scientists, data analysts, business intelligence and reporting analysts, and self-service-embracing business and technology personnel. However, as the tool-stack in most organizations is getting modernized, so is the variety of metadata generated. As the volume of data is increasing every day, thereupon, the metadata associated with data is expanding, as is the need to manage it. The first thought that strikes us when we look at a data landscape and hear about a catalog is, “It scans any database ranging from Relational to NoSQL or Graph and gives out useful information.” Name Modeled data-type Inferred data types Patterns of data Length with minimum and largest threshold Minimal and maximum values Other profiling characteristics of data like frequency of values and their distribution What Is the Basic Benefit of Metadata Managed in Catalogs? 1. Increa

Gartner Magic Quadrant for Data Science and Machine Learning Platforms 2021

Image
This report assesses 20 vendors of platforms that data scientists and others can use to source data, build models and operationalize machine learning. It will help them make the right choice from a crowded field in a maturing DSML platform market that continues to show rapid product development. Market Definition/Description Gartner  defines a data science and machine learning (DSML) platform as a core product and supporting portfolio of coherently integrated products, components, libraries and frameworks (including proprietary, partner-sourced and open-source). Its primary users are data science professionals, including expert data scientists, citizen data scientists, data engineers, application developers and machine learning (ML) specialists. The core product and supporting portfolio: Are sufficiently well-integrated to provide a consistent “look and feel.” Create a user experience in which all components are reasonably interoperable in support of an analytics pipeline. The  DSML pl

Data Management maturity models: a comparative analysis

Image
From the first glance, you can see that there are seven key Subject Areas where the Subject domains are located. These are: Data Data and System Design Technology, Governance Data Quality Security Related Capabilities. You can see that the difference in approaches to define the key Domains are rather big. It is not the purpose of this article to deliver a detailed analysis, but there is one striking observation I would like to share: the Subject domains and deliverables of these domains are being mixed with one another.  For example, let us have a look at Data governance. The domain ‘Data governance’ exists in four different models. Some other domains like ‘Data management strategy’, that appears in three models, is considered as a deliverable of Data Governance domain in other models, for example in DAMA model. Such a big difference of opinions on key Subject domains is rather confusing. Subject domain dimensions Subject domain dimensions are characteristics of (sub-) domains. It is i

Data Discovery Platforms and Their Open Source Solutions

Image
In the past year or two, many companies have shared their data discovery platforms (the latest being Facebook’s Nemo). Based on this list, we now know of more than 10 implementations. I haven’t been paying much attention to these developments in data discovery and wanted to catch up. I was interested in: The questions these platforms help answer The features developed to answer these questions How they compare with each other What open source solutions are available By the end of this, we’ll learn about the key features that solve 80% of data discoverability problems. We’ll also see how the platforms compare on these features, and take a closer look at open source solutions available. Questions we ask in the data discovery process Before discussing platform features, let’s briefly go over some common questions in the data discovery process. Where can I find data about ____? If we don’t know the right terms, this is especially challenging. For user browsing behavior, do we search for “c

A Rudimentary Guide to Metadata Management

Image
 According to IDC, the size of the global datasphere is projected to reach 163 ZB by 2025, leading to disparate data sources in legacy systems, new system deployments, and the creation of data lakes and data warehouses. Most organizations do not utilize the entirety of the data at their disposal for strategic and executive decision making.  Identifying, classifying, and analyzing data historically has relied on manual processes and therefore, in the current age consumes a lot of resources, with respect to time and monitory value. Defining metadata for the data owned by the organization is the first step in unleashing the organizational data’s maximum potential.  The numerous data types and data sources that are embedded in different systems and technologies over time are seldomly designed to work together. Thus, the applications or models used on multiple data types and data sources can potentially be compromised, rendering inaccurate analysis and conclusions.  Having consistency acros

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

Image
Netflix has more than 195 million subscribers that generate petabytes of data everyday. Data scientists and engineers collect this data from our subscribers and videos, and implement data analytics models to discover customer behaviour with the goal of maximizing user joy. Usually Data scientists and engineers write Extract-Transform-Load (ETL) jobs and pipelines using big data compute technologies, like Spark or Presto, to process this data and periodically compute key information for a member or a video. The processed data is typically stored as data warehouse tables in AWS S3. Iceberg is widely adopted in Netflix as a data warehouse table format that addresses many of the usability and performance problems with Hive tables. At Netflix, we also heavily embrace a microservice architecture that emphasizes separation of concerns. Many of these services often have the requirement to do a fast lookup for this fine-grained data which is generated periodically. For example, in order to enha