Posts

Showing posts from 2020

Dremio December 2020 released!

Image
This month’s release delivers very useful features like Apache Arrow Flight with Python, full support for CDP 7.1, security enhancements for Oracle connections, a new support bundle and much more. This blog post highlights the following updates: Arrow Flight clients Query support bundle Kerberos support for Dremio-Oracle connections User/job metrics available in the UI Continue reading >>>

Gartner Predicts 2021: COVID-19 Drives Accelerated Shift to Digital and Commerce Model Evolution

Image
According to Gartner: “By 2022, organizations using multiple go-to-market approaches for digital commerce will outperform noncommerce organizations by 30 percentage points in sales growth." Covid-19 has forced many brands to accelerate their digital-first commerce strategies, sooner than they had planned, in an effort to keep up with customer demands and drive revenue.    Whether your brand was an early adopter, or is still struggling to implement a robust digital commerce strategy, the Gartner Predicts 2021:  COVID-19 Drives Accelerated Shift to Digital and Commerce Model Evolution report will help you prepare for 2021. We believe, in this report you will discover:  5 key digital commerce predictions for 2021 and beyond The market implications of these predictions  How you can embrace new market realities to propel your business Get the report >>>

How DataOps Amplifies Data and Analytics Business Value

Image
DataOps techniques can provide a more agile and collaborative approach to building and managing data pipelines. The pandemic has accelerated the need for data and analytics leaders to deliver data and analytics insight faster, with higher quality and resiliency in the face of constant change. Organizations need to make better-informed and faster decisions with a focus on automation, real-time risk assessment and mitigation, continuous value delivery and agility. The point of DataOps is to change how people collaborate around data and how it is used in the organization  As a result, data and analytics leaders are increasingly applying DataOps techniques that provide a more agile and collaborative approach to building and managing data pipelines. What is DataOps? Gartner defines DataOps as a collaborative data management practice focused on improving the communication, integration and automation of data flows between data managers and data consumers across an organization. “The point o

6 Data Integration Tools Vendors to Watch in 2021

Image
Solutions Review’s Data Integration Tools Vendors to Watch is an annual listing of solution providers we believe are worth monitoring. Companies are commonly included if they demonstrate a product roadmap aligning with our meta-analysis of the marketplace. Other criteria include recent and significant funding, talent acquisition, a disruptive or innovative new technology or product, or inclusion in a major analyst publication. Data integration tools vendors are increasingly being disrupted by cloud connectivity, self-service, and the encroachment of data management functionality. As data volumes grow, we expect to see a continued push by providers in this space to adopt core capabilities of horizontal technology sectors. Organizations are keen on adopting these changes as well, and continue to allocate resources toward the providers that can not only connect data lakes and Hadoop to their analytic frameworks, but cleanse, prepare, and govern data. The next generation of tools will offe

Data Mesh Principles and Logical Architecture v2

Image
Our aspiration to augment and improve every aspect of business and life with data, demands a paradigm shift in how we manage data at scale. While the technology advances of the past decade have addressed the scale of volume of data and data processing compute, they have failed to address scale in other dimensions: changes in the data landscape, proliferation of sources of data, diversity of data use cases and users, and speed of response to change. Data mesh addresses these dimensions, founded in four principles: domain-oriented decentralized data ownership and architecture, data as a product, self-serve data infrastructure as a platform, and federated computational governance. Each principle drives a new logical view of the technical architecture and organizational structure. The original writeup, How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh - which I encourage you to read before joining me back here - empathized with today’s pain points of architectural and or

Gamification in Technology Adoption

Image
Adoption is the use of a new technology. Engagement is the amount of involvement with a technology. This small semantic difference is the key to unlocking the full potential of new applications. Some applications will have inherently higher engagement than others. For example, a frontline healthcare worker will have a high level of engagement with an electronic medical record because it contains essential information for treating patients. There are other technologies we introduce to make processes easier and faster, even if they are not required. For example, a data analyst may or may not choose to use a metadata management application to learn about the data they use every day. While using the application will make their work easier and faster, they can choose to do their work without it. Engagement is about utilization — increasing the likelihood that people will use the application. Continue reading >>>

Data Observability Ushers In A New Era Enabling Golden Age Of Data

Image
Have we entered the Golden Age of Data? Modern enterprises are collecting, producing, and processing more data than ever before. According to a February 2020 IDG survey of data professionals, average corporate data volumes are increasing by 63% per month. 10% of respondents even reported that their data volumes double every month. Large companies are investing heavily to transform themselves into data-driven organizations that can quickly adapt to the fast pace of a modern economy. They gather huge amounts of data from customers and generate reams of data from transactions. They continuously process data in an attempt to personalize customer experiences, optimize business processes, and drive strategic decisions. The Real Challenge with Data In theory, breakthrough open-source technologies, such as Spark, Kafka, and Druid are supposed to help just about any organization benefit from massive amounts of customer and operational data just like they benefit Facebook, Apple, Google, Microso

Gartner - 2020 Magic Quadrant for Metadata Management Solutions

Image
Metadata management is a core aspect of an organization’s ability to manage its data and information assets. The term “metadata” describes the various facets of an information asset that can improve its usability throughout its life cycle. Metadata and its uses go far beyond technical matters. Metadata is used as a reference for business-oriented and technical projects, and lays the foundations for describing, inventorying and understanding data for multiple use cases. Use-case examples include data governance, security and risk, data analysis and data value. The market for metadata management solutions is complex because these solutions are not all identical in scope or capability. Vendors include companies with one or more of the following functional capabilities in their stand-alone metadata management products (not all vendors offer all these capabilities, and not all vendor solutions offer these capabilities in one product): Metadata repositories — Used to document and manage meta

Nemo: Data discovery at Facebook

Image
Large-scale companies serve millions or even billions of people who depend on the services these companies provide for their everyday needs. To keep these services running and delivering meaningful experiences, the teams behind them need to find the most relevant and accurate information quickly so that they can make informed decisions and take action. Finding the right information can be hard for several reasons. The problem might be discovery — the relevant table might have an obscure or nondescript name, or different teams might have constructed overlapping data sets. Or, the problem could be one of confidence — the dashboard someone is looking at might have been superseded by another source six months ago.  Many companies, such as Airbnb, Lyft, Netflix, and Uber, have built their own custom solutions for this challenge. For us, it was important to make the data discovery process simple and fast. Funneling everything through data experts to locate the necessary data each time we nee

10 Reasons to Choose Apache Pulsar Over Apache Kafka

Image
Apache Pulsar's unique features such as tiered storage, stateless brokers, geo-aware replication, and multi-tenancy may be a reason to choose it over Apache Kafka. Today, many data architects, engineers, dev-ops, and business leaders are struggling to understand the pros and cons of Apache Pulsar and Apache Kafka. As someone who has worked with Kafka in the past, I wanted to compare these two technologies.  If you are looking for insights on when to use Pulsar, here are 10 advantages of the technology that might be the deciding factors for you. Continue reading >>>

The State of Open-Source Data Integration and ETL

Image
Open-source data integration started 16 years ago with Talend. Since then, the whole industry has changed. Let's compare the different actors. Open-source data integration is not new. It started 16 years ago with Talend. But since then, the whole industry has changed. The likes of Snowflake, Bigquery, Redshift have changed how data is being hosted, managed, and accessed while making it easier and a lot cheaper. But the data integration industry has evolved as well. On one hand, new open-source projects emerged, such as Singer.io in 2017. This enabled more data integration connectors to become accessible to more teams, even though it still required a significant amount of manual work.  On the other hand, data integration was made accessible to more teams (analysts, scientists, business intelligence teams). Indeed, companies like Fivetran benefited from Snowflake’s rise,  empowering non-engineering teams to set up and manage their data integration connectors by themselves, so they ca

Gartner - Information as a Second Language: Enabling Data Literacy for Digital Society

Image
Digital society expects its citizens to “speak data.” Unless data and analytics  leaders treat information as the new second language of business,  government and communities, they will not be able to deliver the  competitive advantage and agility demanded by their enterprises. Key Challenges ■ Poor data literacy is the second highest inhibitor to progress, as reported by respondents to Gartner’s third annual Chief Data Oficer Survey, behind culture change and just ahead of lack of t alent and skills. ■ An information language barrier exists across business units and IT functions, rooted in ineffective communication across a wide range of diverse stakeholders. As a result, data and analytics leaders struggle to get their message across and information assets go underutilized. ■ Although academic and professional programs are beginning to address the disparity in talent  and skills, in many cases they reinforce the information language barrier with narrow content  focus, bias toward too

Barriers to Effective Information Asset Management (Research)

Image
In the knowledge-based economy the wealth-creating capacity of organisations is no longer based on tangible assets such as buildings, equipment, and vehicles alone. Intangible assets are key contributors to securing sustainable competitive advantage. It is therefore critically important that intangible Information Assets (IA) such as data, documents, content on web sites, and knowledge are understood and well managed. The sound management of these assets allows an organisation to run faster and better, resulting in products and services that are of a higher quality at a lower cost with the benefits of reduced risk, improved competitive position, and higher return on investment. The initial stage of this research found that executive level managers acknowledge the existence and importance of Information Assets in their organisations, but that hardly any mechanisms are in place for the management and governance of these valuable assets. This paper discusses the reasons for this situation

AIOps Platforms (Gartner)

Image
AIOps is an emerging technology and addresses something I’m a big fan of – improving IT Operations.  So I asked fellow Gartner analyst Colin Fletcher for a guest blog on the topic… Roughly three years ago, it was looking like we were going to see many enterprise IT operations leaders put themselves in the precarious role of “ the cobbler’s children ” by forgoing investment in Artificial Intelligence (AI) to help them do their work better, faster, and cheaper. We were hearing from many IT ops leaders building incredibly sophisticated Big Data and Advanced Analytics systems for business stakeholders, but were themselves using rudimentary, reactive red/yellow/green lights and manual steps to help run the infrastructure required to keep those same systems up and running. Further, we’re all now familiar in our personal lives with dynamic recommendations from online retailers, search providers, virtual personal assistants, and entertainment services, Talk about a paradox! Now I wouldn’t say

The Forrester Wave™: Machine Learning Data Catalogs, Q4 2020

Image
Key Takeaways  Alation, Collibra, Alex Solutions, And IBM Lead The Pack Forrester’s research uncovered a market in which Alation, Collibra, Alex Solutions, and IBM are Leaders; data.world, Informatica, Io-Tahoe, and Hitachi Vantara are Strong Performers; and Infogix and erwin are Contenders.  Collaboration, Lineage, And Data Variety Are Key Differentiators As metadata and business glossary technology becomes outdated and less effective, improved machine learning will dictate which providers lead the pack. Vendors that can provide scaleout collaboration, offer detailed data lineage, and interpret any type of data will position themselves to successfully deliver contextualized, trusted, accessible data to their customers. Read full report >>>

Dremio 4.8 is released

Image
Today we are excited to announce the release of Dremio 4.8! This month’s release delivers multiple features such as external query, a new authorization service API, AWS Edition enhancements and more. This blog post highlights the following updates: External query Default reflections Runtime filtering GA Documented JMX metrics and provided sample exporters Ability to customize projects in Dremio AWS Edition Support for Dremio AWS Edition deployments without public IP addresses Read full article >>>
Image
Once an outsider category, cloud computing now powers every industry. Look no further than this year’s Forbes Cloud 100 list, the annual ranking of the world’s top private cloud companies, where this year's standouts are keeping businesses surviving—and thriving—from real estate to retail, data to design. Produced for the fifth consecutive year in partnership with Bessemer Venture Partners and Salesforce Ventures, the Cloud 100 recognizes standouts in tech’s hottest category from small startups to private-equity-backed giants, from Silicon Valley to Australia and Hong Kong. The companies on the list are selected for their growth, sales, valuation and culture, as well as a reputation score derived in consultation with 43 CEO judges and executives from their public-cloud-company peers. This year’s new No. 1 has set a record for shortest time running atop the list. Database leader Snowflake takes the top slot, up from No. 2 last year and just hours before graduating from the list by g

Only 3% of Companies’ Data Meets Basic Quality Standards

Image
Our analyses confirm that data is in far worse shape than most managers realize — and than we feared — and carry enormous implications for managers everywhere: On average, 47% of newly-created data records have at least one critical (e.g., work-impacting) error.  A full quarter of the scores in our sample are below 30% and half are below 57%. In today’s business world, work and data are inextricably tied to one another. No manager can claim that his area is functioning properly in the face of data quality issues. It is hard to see how businesses can survive, never mind thrive, under such conditions. Only 3% of the DQ scores in our study can be rated “acceptable” using the loosest-possible standard.  We often ask managers (both in these classes and in consulting engagements) how good their data needs to be. While a fine-grained answer depends on their uses of the data, how much an error costs them, and other company- and department-specific considerations, none has ever thought a score

Quantum computing: an illustrated guide

Image
At the smallest scales in the universe, at the level of an atom, the laws of physics are weird. You can know precisely where something like an electron is, but not how fast it is going. If you know exactly how fast it is going, you cannot know where it is. As for location, an electron could be in many places at once, each with a different probability. Describing this is the job of quantum physics. Quantum physics works together with computer science to make a new type of computer called a quantum computer. It uses quantum weirdness to solve problems we have not been able to solve with supercomputers. It can crack codes way faster than supercomputers. It might even help us build better drugs and materials. Why are some problems harder to solve than others? Who thought of making a quantum computer and who is making it? Are they very different to the ones we use now? Are there things it cannot do? Take a visual tour of its evolution: the people, the physics and a flavour of how we might p

The unreasonable importance of data preparation

Image
We know data preparation requires a ton of work and thought. In this provocative article, Hugo Bowne-Anderson provides a formal rationale for why that work matters, why data preparation is particularly important for reanalyzing data, and why you should stay focused on the question you hope to answer. Along the way, Hugo introduces how tools and automation can help augment analysts and better enable real-time models. In a world focused on buzzword-driven models and algorithms, you’d be forgiven for forgetting about the unreasonable importance of data preparation and quality: your models are only as good as the data you feed them. This is the garbage in, garbage out principle: flawed data going in leads to flawed results, algorithms, and business decisions. If a self-driving car’s decision-making algorithm is trained on data of traffic collected during the day, you wouldn’t put it on the roads at night. To take it a step further, if such an algorithm is trained in an environment with car

Shopify's approach to data discovery

Image
Humans generate a lot of data. Every two days we create as much data as we did from the beginning of time until 2003! The International Data Corporation estimates the global datasphere totaled 33 zettabytes (one trillion gigabytes) in 2018. The estimate for 2025 is 175 ZBs, an increase of 430%. This growth is challenging organizations across all industries to rethink their data pipelines. The nature of data usage is problem driven, meaning data assets (tables, reports, dashboards, etc.) are aggregated from underlying data assets to help decision making about a particular business problem, feed a machine learning algorithm, or serve as an input to another data asset. This process is repeated multiple times, sometimes for the same problems, and results in a large number of data assets serving a wide variety of purposes. Data discovery and management is the practice of cataloguing these data assets and all of the applicable metadata that saves time for data professionals, increasing data

Open sourcing DataHub: LinkedIn’s metadata search and discovery platform

Image
Finding the right data quickly is critical for any company that relies on big data insights to make data-driven decisions. Not only does this impact the productivity of data users (including analysts, machine learning developers, data scientists, and data engineers), but it also has a direct impact on end products that rely on a quality machine learning (ML) pipeline. Additionally, the trend towards adopting or building ML platforms naturally begs the question: what is your method for internal discovery of ML features, models, metrics, datasets, etc.? In this blog post, we will share the journey of open sourcing DataHub, our metadata search and discovery platform, starting with the project’s early days as WhereHows. LinkedIn maintains an in-house version of DataHub separate from the open source version. We will start by explaining why we need two separate development environments, followed by a discussion on the early approaches for open sourcing WhereHows, and a comparison of our inte

The Ins and Outs of Data Acquisition: Beliefs and Best Practices

Image
Data acquisition involves the set of activities that are required to qualify and obtain external data — and also data that may be available elsewhere in an organization — and then to arrange for it to be brought into or accessed by the company. This strategy is on the rise as organizations leverage this data to get access to prospects, learn information about customers they already work with, be more competitive, develop new products and more. Standardizing data acquisition can be an afterthought at the tail-end of a costly data journey — if it’s considered at all. Now we see companies starting to pay more attention to the finer, critical points of data acquisition as the need for more data grows. And this is a good thing because ignoring acquisition best practices and proper oversight will lead to a whole host of problems that can outweigh the benefits of bringing in new data. The costly, problematic issues we see organizations grapple with include: Data purchases that, once brought i

Technical Guide to Ocean Compute-to-Data

Image
With the v2 Compute-to-Data release, Ocean Protocol provides a means to exchange data while preserving privacy. This guide explains Compute-to-Data without requiring deep technical know-how. Private data is data that people or organizations keep to themselves. It can mean any personal, personally identifiable, medical, lifestyle, financial, sensitive or regulated information. Benefits of Private Data. Private data can help research, leading to life-altering innovations in science and technology. For example, more data improves the predictive accuracy of modern Artificial Intelligence (AI) models. Private data is often considered the most valuable data because it’s so hard to get at, and using it can lead to potentially big payoffs. Risks of Private Data. Sharing or selling private data comes with risk. What if you don’t get hired because of your private medical history? What if you are persecuted for private lifestyle choices? Large organizations that have massive datasets know their d

Diagram as Code

Image
Diagrams lets you draw the cloud system architecture in Python code. It was born for prototyping a new system architecture without any design tools. You can also describe or visualize the existing system architecture as well. Diagram as Code allows you to track the architecture diagram changes in any version control system. Diagrams currently supports six major providers: AWS, Azure, GCP, Kubernetes, Alibaba Cloud and Oracle Cloud. It now also supports On-Premise nodes. diagrams.mingrammer.com >>>

Hitchhiker’s Guide to Ocean Protocol

Image
This guide will help you understand what Ocean Protocol is and how it works. It will be useful for developers who want a simple introduction to Ocean architecture and understand how data exchange works in Ocean. Finally, it details the components of our tech stack. After reading, you will understand the various actors and components in the Ocean Ecosystem, their roles and responsibilities — the perfect preface to dive further into our  Documentation . Read guide >>>

Project Hop - Exploring the future of data integration

Image
Project Hop was announced at KCM19 back in November 2019. The first preview release is available since April, 10th. We’ve been posting about it on our social media accounts, but what exactly is Project Hop? Let’s explore the project in a bit more detail. In this post, we'll have a look at what Project Hop is, why the project was started and why know.bi wants to go all in on it.  What is Project Hop? hopAs the project’s tagline says, Project Hop intends to explore the future of data integration. We take that quite literally. We’ve seen massive changes in the data processing landscape over the last decade (the rise and fall of the Hadoop ecosystem, just to name one). All of these changes need to be supported and integrated into your data engineering and data processing systems.  Apart from these purely technical challenges, the data processing life cycle has become a software life cycle. Robust and reliable data processing requires testing, a fast and flexible deployment proc

Swarm64: Open source PostgreSQL on steroids

Image
PostgreSQL is a big deal. The most common SQL open source database that you have never heard of, as ZDNet's own Tony Baer called it. Besides being the framework on which a number of commercial offerings were built, PostgreSQL has a user base of its own. According to DB Engines, PostgreSQL is the 4th most popular database in the world. Swarm64, on the other hand, is a small vendor. So small, actually, that we have shared the stage with CEO Thomas Richter in a local Berlin Meetup a few years back. Back then, Richter was not CEO, and Swarm64 was even smaller. But its value proposition still sounded attractive: boost PostgreSQL's performance for free. Swarm64 is an acceleration layer for PostgreSQL. There's no such thing as a free lunch of course, so the "for free" part is a figure of speech. Swarm64 is a commercial vendor. Until recently, however, the real gotcha was hardware: Swarm64 Database Acceleration (DA) required a specialized chip called FPGA to be able

14 ways AWS beats Microsoft Azure and Google Cloud

Image
Microsoft Azure and Google Cloud have their advantages, but they don’t match the breadth and depth of the Amazon cloud. The reason is simple: AWS has built out so many products and services that it’s impossible to begin to discuss them in a single article or even a book. Many of them were amazing innovations when they first appeared and the hits keep coming. Every year Amazon adds new tools that make it harder and harder to justify keeping those old boxes pumping out heat and overstressing the air conditioner in the server room down the hall. For all of its dominance, though, Amazon has strong competitors. Companies like Microsoft, Google, IBM, Oracle, SAP, Rackspace, Linnode, and Digital Ocean know that they must establish a real presence in the cloud and they are finding clever ways to compete and excel in what is less and less a commodity business. These rivals offer great products with different and sometimes better approaches. In many cases, they’re running neck and neck wi
Image
IEEE DataPort™ is a valuable and easily accessible data platform that enables users to store, search, access and manage data.  The data platform is designed to accept all formats and sizes of datasets (up to 2TB), and it provides both downloading capabilities and access to datasets in the Cloud.  IEEE DataPort™ is a universally accessible web-based portal that serves four primary purposes:   Enable individuals and institutions to indefinitely store and make datasets easily accessible to a broad set of researchers, engineers and industry;   Enable researchers, engineers and industry to gain access to datasets that can be analyzed to advance technology; Facilitate data analysis by enabling access to data in the AWS Cloud and by enabling the downloading of datasets Supports reproducible research. IEEE DataPort™ is an online data platform created and supported by IEEE and supports IEEE’s overall mission of Advancing Technology for Humanity.   IEEE DataPort >>&g

ETL and How it Changed Over Time

Image
Modern world data and its usage has drastically changed when compared to a decade ago. There is a gap caused by the traditional ETL processes when processing modern data. The following are some of the main reasons for this:  Modern data processes often include real-time streaming data, and organizations need real-time insights into processes.  The systems need to perform ETL on data streams without using batch processing, and they should handle high data rates by scaling the system. Some single-server databases are now replaced by distributed data platforms ( e.g., Cassandra, MongoDB, Elasticsearch, SAAS apps ), message brokers( e.g., Kafka, ActiveMQ, etc. ) and several other types of endpoints. The system should have the capability to plugin additional sources or sinks to connect on the go in a manageable way. Repeated data processing due to ad hoc architecture has to be eliminated. Change data capture technologies used with traditional ETL has to be integrated to