Posts

Showing posts from 2019

2019 Was the Year Data Visualization Hit the Mainstream

Image
There’s always something going on in the field of data visualization but until recently it was only something that people in the field noticed. To the outside world, beyond perhaps an occasional Amazing Map®, Tufte workshop or funny pie chart, these trends are invisible. Not so in 2019, where data visualization featured prominently in major news stories and key players in the field created work that didn’t just do well on Dataviz Twitter but all over. 2019 saw the United States President amend a data visualization product with a sharpie. That should have been enough to make 2019 special, but the year also saw the introduction of a data visualization-focused fashion line, a touching book that uses data visualization to express some of the anxieties and feelings we all struggle with, as well as the creation of the first holistic professional society focused on data visualization. Original Article >>>

Why Not Airflow? An overview of the Prefect engine for Airflow users

Image
Airflow is a historically important tool in the data engineering ecosystem. It introduced the ability to combine a strict Directed Acyclic Graph (DAG) model with Pythonic flexibility in a way that made it appropriate for a wide variety of use cases. However, Airflow’s applicability is limited by its legacy as a monolithic batch scheduler aimed at data engineers principally concerned with orchestrating third-party systems employed by others in their organizations. Today, many data engineers are working more directly with their analytical counterparts. Compute and storage are cheap, so friction is low and experimentation prevails. Processes are fast, dynamic, and unpredictable. Airflow got many things right, but its core assumptions never anticipated the rich variety of data applications that has emerged. It simply does not have the requisite vocabulary to describe many of those activities. The seed that would grow into Prefect was first planted all the way back in 2016, in a seri

Researchers love PyTorch and TensorFlow

Image
In a recent survey—AI Adoption in the Enterprise, which drew more than 1,300 respondents—we found significant usage of several machine learning (ML) libraries and frameworks. About half indicated they used TensorFlow or scikit-learn, and a third reported they were using PyTorch or Keras. I recently attended an interesting RISELab presentation delivered by Caroline Lemieux describing recent work on AutoPandas and automation tools that rely on program synthesis. In the course of her presentation, Lemieux reviewed usage statistics they had gathered on different deep learning frameworks and data science libraries. She kindly shared some of that data with me, which I used to draw this chart: The numbers are based on simple full-text searches of papers posted on the popular e-print service arXiv.org. Specifically, they reflect the number of papers which mention (in a full-text search) each of the frameworks. Using this metric, the two most popular deep learning frameworks among resear

Operating a Large, Distributed System in a Reliable Way

Image
"The article is the collection of the practices I've found useful to reliably operate a large system at Uber, while working here. My experience is not unique - people working on similar sized systems go through a similar journey. I've talked with engineers at Google, Facebook, and Netflix, who shared similar experiences and solutions. Many of the ideas and processes listed here should apply to systems of similar scale, regardless of running on own data centers (like Uber mostly does) or on the cloud (where Uber sometimes scales to). However, the practices might be an overkill for smaller or less mission-critical systems." There's much ground to cover: Monitoring Oncall, Anomaly Detection & Alerting Outages & Incident Management Processes Postmortems, Incident Reviews & a Culture of Ongoing Improvements Failover Drills, Capacity Planning & Blackbox Testing SLOs, SLAs & Reporting on Them SRE as an Independent Team Reliability as an

The Apache Spark 3.0 Preview is here!

Image
Preview release of Spark 3.0 To enable wide-scale community testing of the upcoming Spark 3.0 release, the Apache Spark community has posted a preview release of Spark 3.0 . This preview is not a stable release in terms of either API or functionality, but it is meant to give the community early access to try the code that will become Spark 3.0. If you would like to test the release, please download it, and send feedback using either the mailing lists or JIRA. The Spark issue tracker already contains a list of features in 3.0.

Azure SQL Data Warehouse is now Azure Synapse Analytics

Image
On November fourth, we announced Azure Synapse Analytics, the next evolution of Azure SQL Data Warehouse. Azure Synapse is a limitless analytics service that brings together enterprise data warehousing and Big Data analytics. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources—at scale. Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate business intelligence and machine learning needs. With Azure Synapse, data professionals can query both relational and non-relational data using the familiar SQL language. This can be done using either serverless on-demand queries for data exploration and ad hoc analysis or provisioned resources for your most demanding data warehousing needs. A single service for any workload. In fact, it’s the first and only analytics system to have run all the TPC-H queries at petabyte-scale. For current SQL Data Warehouse custome

Neo4j Aura: A New Graph Database as a Service

Image
Aura is an entirely new, built-from-the-ground-up, multi-tenant graph DBaaS based on Neo4j. It lets any developer take advantage of the best graph database in the world via a frictionless service in the cloud.  When we began building Neo4j all those years ago, we wanted to give developers a database that was very powerful, flexible… and accessible to all. We believed that open source was the best way to bring this product to developers worldwide. Since then, the vast majority of our paying customers have started out with data practitioners leading the way. Individual developers downloaded Neo4j, experimented with it, and realized graphs were an ideal way to model and traverse connected data. However, only a few of those developers had direct access to a budget to make the leap to our Enterprise Edition. Neo4j Aura bridges that gap for individuals, small teams and established startups. I believe this is the next logical step in Neo4j’s vision to help the world make sense of data. Th

CSA - Cloud Controls Matrix v3

Image
The CCM, the only meta-framework of cloud-specific security controls, mapped to leading standards, best practices and regulations. CCM provides organizations with the needed structure, detail and clarity relating to information security tailored to cloud computing. CCM is currently considered a de-facto standard for cloud security assurance and compliance. Download >>>

Q3 2019 BARC - Data & Analytics market update (by Carsten Bange)

Image
This quarter: Investments - record breaking third quarter for investments in data & analytics companies. M&A - The Hadoop market consolidates quickly and data science/AI companies add portfolio capabilities by acquisition. B2B software brand Idera has bought WhereScape. WhereScape develops and markets automation software for modern Data Warehouses deployed in in the cloud or on premise. Idera, a parent company for several database, development, and testing software companies announced to integrate WhereScape in their Database Tools unit. Other software providers in the same Idera unit are AquaFold featuring an IDE for visual database queries and Webyog featuring MySQL monitoring and management tools. With the acquisition of WhereScape, Idera improves its capabilities for empowering data professionals regarding DevOps use cases in complex data environments. WhereScape was a very visible player in the Data & Analytics ecosystem. It will be interesting to watch whether Idera

Beast: Moving Data from Kafka to BigQuery

Image
In order to serve customers across 19+ products, GOJEK places a lot of emphasis on data. Our Data Warehouse, built by integrating data from multiple applications and sources, helps our team of data scientists, as well as business and product analysts make solid, data-driven decisions. This post explains our open source solution for easy movement of data from Kafka to BigQuery. Data Warehouse setup at GOJEK. We use Google Bigquery (BQ) as our Data Warehouse, which serves as a powerful tool for interactive analysis. This has proven extremely valuable for our use cases. Our approach to push data to our warehouse is to first push the data to Kafka. We rely on multiple Kafka clusters to ingest relevant events across teams. A common approach to push data from Kafka to BigQuery is to first push it to GCS, and then import said data into BigQuery from GCS. While this solves the use case of running analytics on historical data, we also use BigQuery for near-real-time analytics & r

Data Processing Pipeline Patterns

Image
Data produced by applications, devices, or humans must be processed before it is consumed. By definition, a data pipeline represents the flow of data between two or more systems. It is a set of instructions that determine how and when to move data between these systems. My last blog conveyed how connectivity is foundational to a data platform. In this blog, I will describe the different data processing pipelines that leverage different capabilities of the data platform, such as connectivity and data engines for processing. There are many data processing pipelines. One may: “Integrate” data from multiple sources Perform data quality checks or standardize data Apply data security-related transformations, which include masking, anonymizing, or encryption Match, merge, master, and do entity resolution Share data with partners and customers in the required format, such as HL7 Consumers or “targets” of data pipelines may include: Data warehouses like Reds

Modern applications at AWS

Image
Innovation has always been part of the Amazon DNA, but about 20 years ago, we went through a radical transformation with the goal of making our iterative process—"invent, launch, reinvent, relaunch, start over, rinse, repeat, again and again"—even faster. The changes we made affected both how we built applications and how we organized our company. Back then, we had only a small fraction of the number of customers that Amazon serves today. Still, we knew that if we wanted to expand the products and services we offered, we had to change the way we approached application architecture. The giant, monolithic "bookstore" application and giant database that we used to power Amazon.com limited our speed and agility. Whenever we wanted to add a new feature or product for our customers, like video streaming, we had to edit and rewrite vast amounts of code on an application that we'd designed specifically for our first product—the bookstore. This was a long, unwieldy p

2019 Datanami Readers’ and Editors’ Choice Awards

Image
Datanami  is pleased to announce the results of its fourth annual Readers’ and Editors’ Choice Awards, which recognizes the companies, products, and projects that have made a difference in the big data community this year. These awards, which are nominated and voted on by Datanami readers, give us insight into the state of the community. We’d like to thank our dedicated readers for weighing in on their top picks for the best in big data. It’s been a privilege for us to present these awards, and we extend our congratulations to this year’s winners. Best Big Data Product or Technology: Machine Learning Readers’ Choice: Elastic Editor’s Choice: SAS Visual Data Mining & Machine Learning Best Big Data Product or Technology: Internet of Things Readers’ Choice: SAS Analytics for IoT Editor’s Choice:  The Striim Platform Best Big Data Product or Technology: Big Data Security Readers’ Choice: Cloudera Enterprise Editor’s Choice: Elastic Stack Best Big Data Product o

The Forrester Wave™: Streaming Analytics, Q3 2019

Image
Key Takeaways Software AG, IBM, Microsoft, Google, And TIBCO Software Lead The Pack Forrester's research uncovered a market in which Software AG, IBM, Microsoft, Google, and TIBCO Software are Leaders; Cloudera, SAS, Amazon Web Services, and Impetus are Strong Performers; and EsperTech and Alibaba are Contenders. Analytics Prowess, Scalability, And Deployment Freedom Are Key Differentiators Depth and breadth of analytics types on streaming data are critical. But that is all for naught if streaming analytics vendors cannot also scale to handle potentially huge volumes of streaming data. Also, it's critical that streaming analytics can be deployed where it is most needed, such as on-premises, in the cloud, and/or at the edge. Read report >>>

Gartner - The CIO’s Guide to Blockchain 2019

Image
More than $4 trillion in goods are shipped globally each year. The 80% of those goods carried via ocean shipping creates a lot of paperwork. Required trade documentation to process and administer all the goods is approximately one-fifth of the actual physical transportation costs. Last year, a logistics business and a large technology company developed a joint global trade digitalization platform built using blockchain technology. It will enable them to establish a shared, immutable record of all transactions and provide all disparate partners access to that information at any time. Although the distributed, immutable, encrypted nature of blockchain solutions can help with such business issues, blockchain can achieve much more than that. Large companies looking to explore new disruptive business opportunities need to think beyond efficiency gains. And to do so, they need real blockchain solutions. Full article >>>

Neo4j Backs Launch of GQL Project: First New ISO Database Language Since SQL

Image
Neo4j, the leader in graph databases, announced today that the international committees that develop the SQL standard have voted to initiate GQL (Graph Query Language) as a new database query language. Now to be codified as the international standard declarative query language for property graphs, GQL represents the culmination of years of effort by Neo4j and the broader database community.  English: GQL to incorporate and consider several graph database languages. Cypher: (:Neo4j)-[:BACKS]->(GQL:Project)<-[:STARTED]-(:ISO)-[:STANDARDIZED]->(SQL:Project) The initiative for GQL was first advanced in the GQL Manifesto in May 2018. A year later, the project was considered at an international gathering in June. Ten countries including the United States, Germany, UK, Korea, and China have now voted in favor, with seven countries promising active participation by national experts. It has been well over 30 years since ISO/IEC began the SQL project. SQL went on to become the

Distributed SQL System Review: Snowflake vs Splice Machine

Image
After many years of Big Data, NoSQL, and Schema-on-Read detours, there is a clear return to SQL as the lingua franca for data operations. Developers need the comprehensive expressiveness that SQL provides. A world without SQL ignores more than 40 years of database research and results in hard-coded spaghetti code in applications to handle functionality that SQL handles extremely efficiently such as joins, groupings, aggregations, and (most importantly) rollback when updates go wrong. Luckily, there is a modern architecture for SQL called Distributed SQL that no longer suffers from the challenges of traditional SQL systems (cost, scalability, performance, elasticity, and schema flexibility). The key attribute of Distributed SQL is that data is stored across many distributed storage locations and computation takes place across a cluster of networked servers. This yields unprecedented performance and scalability because it distributes work on each worker node in the cluster in parall

Dremio 4.0 Data Lake Engine

Image
Dremio’s Data Lake Engine delivers lightning fast query speed and a self-service semantic layer operating directly against your data lake storage. No moving data to proprietary data warehouses or creating cubes, aggregation tables and BI extracts. Just flexibility and control for Data Architects, and self-service for Data Consumers. This release, also known as Dremio 4.0, dramatically accelerates query performance on S3 and ADLS, and provides deeper integration with the security services of AWS and Azure. In addition, this release simplifies the ability to query data across a broader range of data sources, including multiple lakes (with different Hive versions) and through community-developed connectors offered in Dremio Hub. Read full article >>>

Gartner Hype Cycle for Emerging Technologies, 2019

Image
The Gartner Hype Cycle highlights the 29 emerging technologies CIOs should experiment with over the next year. Today, companies detect insurance fraud using a combination of claim analysis, computer programs and private investigators. The FBI estimates the total cost of non-healthcare-related insurance fraud to be around $40 billion per year. But a maturing emerging technology called emotion artificial intelligence (AI) might make it possible to detect insurance fraud based on audio analysis of the caller. Some technologies will provide “superhuman capabilities” In addition to catching fraud, this technology can improve customer experience by tracking happiness, more accurately directing callers, enabling better diagnostics for dementia, detecting distracted drivers, and even adapting education to a student’s current emotional state. Though still relatively new, emotion AI is one of 21 new technologies added to the Gartner Hype Cycle for Emerging Technologies, 2019. Original article &g

Who is interested in Apache Pulsar?

Image
With companies producing data from an increasing number of systems and devices, messaging and event streaming solutions—particularly Apache Kafka—have gained widespread adoption. Over the past year, we’ve been tracking the progress of Apache Pulsar (Pulsar), a less well-known but highly capable open source solution originated by Yahoo. Pulsar is designed to intelligently process, analyze, and deliver data from an expanding array of services and applications, and thus it fits nicely into modern data platforms. Pulsar is also designed to ease the operational burdens normally associated with complex, distributed systems. Of the thousands of recent visitors to the site: 33% are from the Americas, 36% from Asia-Pacific, and 27% were based in the EMEA region. While Apache Kafka is by far the most popular pub/sub solution, over the last year, we’ve started to come across numerous companies that use Pulsar. It turns out that Pulsar has a few features these companies value, including: Mu

The Future of Data Engineering

Image
Data engineering’s job is to help an organization move and process data. This generally requires two different systems, broadly speaking: a data pipeline, and a data warehouse. The data pipeline is responsible for moving the data, and the data warehouse is responsible for processing it. I acknowledge that this is a bit overly simplistic. You can do processing in the pipeline itself by doing transformations between extraction and loading with batch and stream processing. The “data warehouse” now includes many storage and processing systems (Flink, Spark, Presto, Hive, BigQuery, Redshift, etc), as well as auxiliary systems such as data catalogs, job schedulers, and so on. Still, I believe the paradigm holds. The industry is working through changes in how these systems are built and managed. There are four areas, in particular, where I expect to see shifts over the next few years. Timeliness: From batch to realtime Connectivity: From one:one bespoke integrations to many:many Cen

Deep dive into how Uber uses Spark

Image
Apache Spark is a foundational piece of Uber’s Big Data infrastructure that powers many critical aspects of our business. We currently run more than one hundred thousand Spark applications per day, across multiple different compute environments. Spark’s versatility, which allows us to build applications and run them everywhere that we need, makes this scale possible. However, our ever-growing infrastructure means that these environments are constantly changing, making it increasingly difficult for both new and existing users to give their applications reliable access to data sources, compute resources, and supporting tools. Also, as the number of users grow, it becomes more challenging for the data team to communicate these environmental changes to users, and for us to understand exactly how Spark is being used. We built the Uber Spark Compute Service (uSCS) to help manage the complexities of running Spark at this scale. This Spark-as-a-service solution leverages Apache Livy, cu

Understanding Apache Spark Failures and Bottlenecks

Image
Apache Spark is a powerful open-source distributed computing framework for scalable and efficient analysis of big data apps running on commodity compute clusters. Spark provides a framework for programming entire clusters with built-in data parallelism and fault tolerance while hiding the underlying complexities of using distributed systems. Spark has seen a massive spike in adoption by enterprises across a wide swath of verticals, applications, and use cases. Spark provides speed (up to 100x faster in-memory execution than Hadoop MapReduce) and easy access to all Spark components (write apps in R, Python, Scala, and Java) via unified high-level APIs. Spark also handles a wide range of workloads (ETL, BI, analytics, ML, graph processing, etc.) and performs interactive SQL queries, batch processing, streaming data analytics, and data pipelines. Spark is also replacing MapReduce as the processing engine component of Hadoop. Spark applications are easy to write and easy to understa

Data Management Portfolio for Improvement of Privacy in Fog-to-cloud Computing Systems

Image
With the challenge of the vast amount of data generated by devices at the edge of networks, new architecture needs a well-established data service model that accounts for privacy concerns. This paper presents an architecture of data transmission and a data portfolio with privacy for fog-to-cloud (DPPforF2C). We would like to propose a practical data model with privacy from a digitalized information perspective at fog nodes. In addition, we also propose an architecture for implicating the privacy of DPPforF2C used in fog computing. Technically, we design a data portfolio based on the Message Queuing Telemetry Transport (MQTT) and the Advanced Message Queuing Protocol (AMQP). We aim to propose sample data models with privacy architecture because there are some differences in the data obtained from 10T devices and sensors. Thus, we propose an architecture with the privacy of DPPforF2C for publishing data from edge devices to fog and to cloud servers that could be applied to fog architectu