Posts

Showing posts from February, 2019

Unprovability comes to machine learning

Scenarios have been discovered in which it is impossible to prove whether or not a machine-learning algorithm could solve a particular problem. This finding might have implications for both established and future learning algorithms. During the twentieth century, discoveries in mathematical logic revolutionized our understanding of the very foundations of mathematics. In 1931, the logician Kurt Gödel showed that, in any system of axioms that is expressive enough to model arithmetic, some true statements will be unprovable1. And in the following decades, it was demonstrated that the continuum hypothesis — which states that no set of distinct objects has a size larger than that of the integers but smaller than that of the real numbers — can be neither proved nor refuted using the standard axioms of mathematics2–4. Writing in Nature Machine Intelligence, Ben-David et al.5 show that the field of machine learning, although seemingly distant from mathematical logic, shares this limitation.

The Copernican Principle and How to Use Statistics to Figure Out How Long Anything Will Last

Image
Statistics, the lifetime equation, and when data science will end The pursuit of astronomy has been a gradual process of uncovering the insignificance of humanity. We started out in the center of the universe with the cosmos literally revolving around us. Then we were rudely relegated to one of 8 planets orbiting the sun, a sun which subsequently was revealed to be just one of billions of stars (and not even a large one) in our galaxy. This galaxy, the majestic Milky Way, seemed pretty impressive until Hubble discovered that all those fuzzy objects in the sky are billions of other galaxies, each of which has billions of stars (potentially with their own intelligent life). The demotion has only continued in the 21st century, as mathematicians and physicists have concluded the universe is one of an infinity of universes collectively called the multiverse. Read full article >>>

Confluo: Millisecond-level Queries on Large-scale Live Data

Image
Confluo is a system for real-time distributed analysis of multiple data streams. Confluo simultaneously supports high throughput concurrent writes, online queries at millisecond timescales, and CPU-efficient ad-hoc queries  via a combination of data structures carefully designed for the specialized case of multiple data streams, and an end-to-end optimized system design. We are excited to release Confluo as an open-source C++ project, comprising: Confluo’s data structure library, that supports high throughput ingestion of logs, along with a wide range of online (live aggregates, conditional trigger executions, etc.) and offline (ad-hoc filters, aggregates, etc.) queries, and, A Confluo server implementation, that encapsulates the data structures and exposes its operations via an RPC interface, along with client libraries in C++, Java and Python. We have evaluated Confluo for several different application scenarios, including: A network monitoring and diagnosis framework, w

The Forrester Wave™: Cloud Hadoop/Spark Platforms, Q1 2019

Image
Cloud Hadoop/Spark (HARK) platforms accelerate insights by automating the storage, processing, and accessing of big data. In our 25-criterion evaluation of HARK providers, we identified the 11 most significant ones — Amazon Web Services (AWS), Cloudera, Google, Hortonworks, Huawei, MapR, Microsoft, Oracle, Qubole, Rackspace, and SAP — and researched, analyzed, and scored them.  This report shows how each provider measures up and helps enterprise architecture (EA) professionals select the right one for their needs. Note: Cloudera and Hortonworks completed their planned merger on January 3, 2019, and will continue as Cloudera. This Forrester Wave reflects our evaluation of each company's independent HARK platforms prior to the completion of the merger. Full report available here >>>

2019 Gartner Magic Quadrant for Analytics and Business Intelligence Platforms

Image
The Five Use Cases and 15 Critical Capabilities of an Analytics and BI Platform We define and assess product capabilities across the following five use cases: Agile, centralized BI provisioning: Supports an agile IT-enabled workflow, from data to centrally delivered and managed analytic content, using the platform’s self-contained data management capabilities. Decentralized analytics: Supports a workflow from data to self-service analytics, and includes analytics for individual business units and users. Governed data discovery: Supports a workflow from data to self-service analytics to system of record (SOR), IT-managed content with governance, reusability and promotability of user-generated content to certified data and analytics content. OEM or embedded analytics: Supports a workflow from data to embedded BI content in a process or application. Extranet deployment: Supports a workflow similar to agile, centralized BI provisioning for the external customer or, in the pu

Looking Back at Google’s Research Efforts in 2018

Image
2018 was an exciting year for Google's research teams, with our work advancing technology in many ways, including fundamental computer science research results and publications, the application of our research to emerging areas new to Google (such as healthcare and robotics), open source software contributions and strong collaborations with  Google product teams, all aimed at providing useful tools and services. Below, we highlight just some of our efforts from 2018, and we look forward to what will come in the new year: Ethical Principles and AI AI for Social Good Assistive Technology Quantum computing Natural Language Understanding Perception Computational Photography Algorithms and Theory Software Systems AutoML Tensor Processing Units (TPUs)  Open Source Software and Datasets Robotics Applications of AI to Other Fields Read more >>>

Google announces Kubernetes Operator for Apache Spark

Image
The beta release of "Spark Operator" allows native execution of Spark applications on Kubernetes clusters -- no Hadoop or Mesos required. Apache Spark is a hugely popular execution framework for running data engineering and machine learning workloads. It powers the Databricks platform and is available in both on-premises and cloud-based Hadoop services, like Azure HDInsight, Amazon EMR and Google Cloud Dataproc. It can run on Mesos clusters too. But what of you just want to run your Spark workloads on a Kubernetres (k8s) cluster sans Mesos, and without the Hadoop YARN strings attached? While Spark first added Kubernetes-specific features in its 2.3 release, and improved them in 2.4, getting Spark to run natively on k8s, in a fully integrated fashion, can still be a challenge. KUBE OPERATOR Today, Google, which created Kubernetes in the first place, is announcing the beta release of the Kubernetes Operator for Apache Spark -- "Spark Operator" for short. Sp