Posts

Showing posts from April, 2019

Gartner Market Guide for Data Preparation Tools 2019

Image
Data preparation tools have matured from initially being self-service-focused to now supporting data integration, analytics and data science use cases in production. Data and analytics leaders must use this research to understand the dynamics of and popular vendors in this rapidly evolving market. Key Findings The market for data preparation tools has evolved from being able to support only self-service use cases. Modern data preparation tools now enable data and analytics teams to build agile datasets at an enterprise scale, for a range of distributed content authors. The market for data preparation tools remains crowded and complex. The choices range from stand-alone specialists to vendors that embed data preparation — as a key capability — into their broader analytics/BI, data science or data integration tools. While most data preparation tool capabilities have been maturing at a steady state, organizations continue to cite “operationalization” — the ability to promote

Harnessing Organizational Knowledge for Machine Learning

Image
One of the biggest bottlenecks in developing machine learning (ML) applications is the need for the large, labeled datasets used to train modern ML models. Creating these datasets involves the investment of significant time and expense, requiring annotators with the right expertise. Moreover, due to the evolution of real-world applications, labeled datasets often need to be thrown out or re-labeled. In collaboration with Stanford and Brown University, we present "Snorkel Drybell: A Case Study in Deploying Weak Supervision at Industrial Scale," which explores how existing knowledge in an organization can be used as noisier, higher-level supervision—or, as it is often termed, weak supervision—to quickly label large training datasets. In this study, we use an experimental internal system, Snorkel Drybell, which adapts the open-source Snorkel framework to use diverse organizational knowledge resources—like internal models, ontologies, legacy rules, knowledge graphs and more—in

50 of the most popular Python libraries and frameworks that are used in data science

Image
This article introduces a landscape diagram which shows 50 or so of the most popular Python libraries and frameworks used in data science. Landscape diagrams illustrate components within a technology stack alongside their complementary technologies. In other words, “How do the parts fit together?” Landscape diagrams provide useful learning materials, helping people conceptualize and discuss complex technology topics. Of course it’s important to keep this diagram curated and updated as the Python ecosystem evolves. We’ll do that. One caveat: trying to fit lots of complex, interconnected parts into a neatly formatted 2D grid is a challenge. Any diagram must “blur the lines” of definitions to simplify the illustration, and those definitions could be debated at length. On the one hand, the diagram does not include an exhaustive list. We chose popular libraries among widely-used categories, but had to skip some. For example we didn’t go into the varied universe of audio processing lib

O’Reilly and TensorFlow are teaming up for the first-ever TensorFlow World

Image
O’Reilly and TensorFlow are teaming up for the first-ever TensorFlow World, happening October 28–31 in Santa Clara. Where today's top minds bring machine learning to life From data centers to edge devices, and diagnosing diseases to environmental conservation, TensorFlow is powering the machine learning revolution. Growing from its origins at Google, TensorFlow is a fast-moving and expansive open source ecosystem, covering many platforms and programming languages in industry, education, and research. O'Reilly Media and TensorFlow are teaming up to present the first TensorFlow World, bringing together the entire community to explore the latest developments, from research to production, and application areas spanning healthcare, finance, robotics, IoT, and more. We'll hear how data scientists, engineers, developers, and product managers are leveraging TensorFlow to build products and services to help transform their company. Executives, CTOs, and innovators will share

Mozilla releases Iodide, an open source browser tool for publishing dynamic data science

Image
Mozilla wants to make it easier to create, view, and replicate data visualizations on the web, and toward that end, it today unveiled Iodide, an “experimental tool” meant to help scientists and engineers write and share interactive documents using an iterative workflow. It’s currently in alpha, and available from GitHub in open source. “In the last ten years, there has been an explosion of interest in ‘scientific computing’ and ‘data science’: that is, the application of computation to answer questions and analyze data in the natural and social sciences,” Brendan Colloran, staff data scientist at Mozilla, wrote in a blog post. “To address these needs, we’ve seen a renaissance in programming languages, tools, and techniques that help scientists and researchers explore and understand data and scientific concepts, and to communicate their findings. But to date, very few tools have focused on helping scientists gain unfiltered access to the full communication potential of modern w

The Forrester Wave™: Big Data NoSQL, Q1 2019

Image
Key Takeaways MongoDB, Microsoft, Couchbase, AWS, Google, And Redis Labs Lead The Pack Forrester's research uncovered a market in which MongoDB, Microsoft, Couchbase, AWS, Google, and Redis Labs are Leaders; MarkLogic, DataStax, Aerospike, Oracle, Neo4j, and IBM are Strong Performers; and SAP, ArangoDB, and RavenDB are Contenders. Performance, Scalability, Multimodel, And Security Are Key Differentiators The Leaders we identified support a broader set of use cases, automation, good scalability and performance, and security offerings. The Strong Performers have turned up the heat on the incumbents. Contenders offer lower costs and are ramping up their core NoSQL functionality. THE RISE OF BIG DATA NOSQL PLATFORMS NoSQL is more than a decade old. It has gone from supporting simple schemaless apps to becoming a mission-critical data platform for large Fortune 1000 companies. It has already disrupted the database market, which was dominated for decades by relational datab

Self-Service Data Preparation: Research to Practice

Image
The story of Self-Service Data Preparation and academic research behind Trifacta, which is also a SaaS offering in GCP Dataprep:  http://sites.computer.org/debull/A18june/p23.pdf

Extending an Amazon S3 Integration to Google Cloud Storage With the Interop API

Getting Started With The S3-Interop API for GCS To start the process, enable the Google Cloud Storage service in the Google Cloud console and create a project and bucket for testing. You can then enable the S3-interoperable API in the Interoperability tab within Project Settings. Google enables the S3-interoperability API on a per-user basis for each project. This means that you’ll want to ensure you have a unique credential or user account for each end-user or service if you want more meaningful access logs. While you’re in the interoperability settings, create an access key and save it locally for reference. As the Interoperability settings page describes, the secret key lets you authorize requests with HMAC authentication. You can calculate the signatures for each request manually in your shell or REPL, or use a library. Read full article >>>