Posts

The unreasonable importance of data preparation

Image
We know data preparation requires a ton of work and thought. In this provocative article, Hugo Bowne-Anderson provides a formal rationale for why that work matters, why data preparation is particularly important for reanalyzing data, and why you should stay focused on the question you hope to answer. Along the way, Hugo introduces how tools and automation can help augment analysts and better enable real-time models. In a world focused on buzzword-driven models and algorithms, you’d be forgiven for forgetting about the unreasonable importance of data preparation and quality: your models are only as good as the data you feed them. This is the garbage in, garbage out principle: flawed data going in leads to flawed results, algorithms, and business decisions. If a self-driving car’s decision-making algorithm is trained on data of traffic collected during the day, you wouldn’t put it on the roads at night. To take it a step further, if such an algorithm is trained in an environment with car...

Shopify's approach to data discovery

Image
Humans generate a lot of data. Every two days we create as much data as we did from the beginning of time until 2003! The International Data Corporation estimates the global datasphere totaled 33 zettabytes (one trillion gigabytes) in 2018. The estimate for 2025 is 175 ZBs, an increase of 430%. This growth is challenging organizations across all industries to rethink their data pipelines. The nature of data usage is problem driven, meaning data assets (tables, reports, dashboards, etc.) are aggregated from underlying data assets to help decision making about a particular business problem, feed a machine learning algorithm, or serve as an input to another data asset. This process is repeated multiple times, sometimes for the same problems, and results in a large number of data assets serving a wide variety of purposes. Data discovery and management is the practice of cataloguing these data assets and all of the applicable metadata that saves time for data professionals, increasing data ...

Open sourcing DataHub: LinkedIn’s metadata search and discovery platform

Image
Finding the right data quickly is critical for any company that relies on big data insights to make data-driven decisions. Not only does this impact the productivity of data users (including analysts, machine learning developers, data scientists, and data engineers), but it also has a direct impact on end products that rely on a quality machine learning (ML) pipeline. Additionally, the trend towards adopting or building ML platforms naturally begs the question: what is your method for internal discovery of ML features, models, metrics, datasets, etc.? In this blog post, we will share the journey of open sourcing DataHub, our metadata search and discovery platform, starting with the project’s early days as WhereHows. LinkedIn maintains an in-house version of DataHub separate from the open source version. We will start by explaining why we need two separate development environments, followed by a discussion on the early approaches for open sourcing WhereHows, and a comparison of our inte...

The Ins and Outs of Data Acquisition: Beliefs and Best Practices

Image
Data acquisition involves the set of activities that are required to qualify and obtain external data — and also data that may be available elsewhere in an organization — and then to arrange for it to be brought into or accessed by the company. This strategy is on the rise as organizations leverage this data to get access to prospects, learn information about customers they already work with, be more competitive, develop new products and more. Standardizing data acquisition can be an afterthought at the tail-end of a costly data journey — if it’s considered at all. Now we see companies starting to pay more attention to the finer, critical points of data acquisition as the need for more data grows. And this is a good thing because ignoring acquisition best practices and proper oversight will lead to a whole host of problems that can outweigh the benefits of bringing in new data. The costly, problematic issues we see organizations grapple with include: Data purchases that, once brought i...

Technical Guide to Ocean Compute-to-Data

Image
With the v2 Compute-to-Data release, Ocean Protocol provides a means to exchange data while preserving privacy. This guide explains Compute-to-Data without requiring deep technical know-how. Private data is data that people or organizations keep to themselves. It can mean any personal, personally identifiable, medical, lifestyle, financial, sensitive or regulated information. Benefits of Private Data. Private data can help research, leading to life-altering innovations in science and technology. For example, more data improves the predictive accuracy of modern Artificial Intelligence (AI) models. Private data is often considered the most valuable data because it’s so hard to get at, and using it can lead to potentially big payoffs. Risks of Private Data. Sharing or selling private data comes with risk. What if you don’t get hired because of your private medical history? What if you are persecuted for private lifestyle choices? Large organizations that have massive datasets know their d...

Diagram as Code

Image
Diagrams lets you draw the cloud system architecture in Python code. It was born for prototyping a new system architecture without any design tools. You can also describe or visualize the existing system architecture as well. Diagram as Code allows you to track the architecture diagram changes in any version control system. Diagrams currently supports six major providers: AWS, Azure, GCP, Kubernetes, Alibaba Cloud and Oracle Cloud. It now also supports On-Premise nodes. diagrams.mingrammer.com >>>

Hitchhiker’s Guide to Ocean Protocol

Image
This guide will help you understand what Ocean Protocol is and how it works. It will be useful for developers who want a simple introduction to Ocean architecture and understand how data exchange works in Ocean. Finally, it details the components of our tech stack. After reading, you will understand the various actors and components in the Ocean Ecosystem, their roles and responsibilities — the perfect preface to dive further into our  Documentation . Read guide >>>

Project Hop - Exploring the future of data integration

Image
Project Hop was announced at KCM19 back in November 2019. The first preview release is available since April, 10th. We’ve been posting about it on our social media accounts, but what exactly is Project Hop? Let’s explore the project in a bit more detail. In this post, we'll have a look at what Project Hop is, why the project was started and why know.bi wants to go all in on it.  What is Project Hop? hopAs the project’s tagline says, Project Hop intends to explore the future of data integration. We take that quite literally. We’ve seen massive changes in the data processing landscape over the last decade (the rise and fall of the Hadoop ecosystem, just to name one). All of these changes need to be supported and integrated into your data engineering and data processing systems.  Apart from these purely technical challenges, the data processing life cycle has become a software life cycle. Robust and reliable data processing requires testing, a fast and flexible deployment...

Swarm64: Open source PostgreSQL on steroids

Image
PostgreSQL is a big deal. The most common SQL open source database that you have never heard of, as ZDNet's own Tony Baer called it. Besides being the framework on which a number of commercial offerings were built, PostgreSQL has a user base of its own. According to DB Engines, PostgreSQL is the 4th most popular database in the world. Swarm64, on the other hand, is a small vendor. So small, actually, that we have shared the stage with CEO Thomas Richter in a local Berlin Meetup a few years back. Back then, Richter was not CEO, and Swarm64 was even smaller. But its value proposition still sounded attractive: boost PostgreSQL's performance for free. Swarm64 is an acceleration layer for PostgreSQL. There's no such thing as a free lunch of course, so the "for free" part is a figure of speech. Swarm64 is a commercial vendor. Until recently, however, the real gotcha was hardware: Swarm64 Database Acceleration (DA) required a specialized chip called FPGA to be able ...

14 ways AWS beats Microsoft Azure and Google Cloud

Image
Microsoft Azure and Google Cloud have their advantages, but they don’t match the breadth and depth of the Amazon cloud. The reason is simple: AWS has built out so many products and services that it’s impossible to begin to discuss them in a single article or even a book. Many of them were amazing innovations when they first appeared and the hits keep coming. Every year Amazon adds new tools that make it harder and harder to justify keeping those old boxes pumping out heat and overstressing the air conditioner in the server room down the hall. For all of its dominance, though, Amazon has strong competitors. Companies like Microsoft, Google, IBM, Oracle, SAP, Rackspace, Linnode, and Digital Ocean know that they must establish a real presence in the cloud and they are finding clever ways to compete and excel in what is less and less a commodity business. These rivals offer great products with different and sometimes better approaches. In many cases, they’re running neck and neck wi...