Posts

Dremio 3.0 adds new capabilities and security features, and dramatically improves performance

Image
Here’s what’s NEW: Up to 100x performance improvement for a wide range of query workloads, using Apache Arrow’s new kernel – Gandiva. Gandiva performs just-in-time compilation of SQL queries to machine code to get the fastest possible performance. (Our blog post explains more about how.) Support for Teradata, Azure Data Lake Store, AWS S3 GovCloud, and the latest version of Elasticsearch. Expect more soon! We’ve got a new connector framework that improves performance, stability, and development velocity for all data sources. Cluster Workload Manager, which lets you deploy diverse workloads on a single operational cluster while ensuring critical SLAs for performance and availability. More data catalog features, including wikis and tags for your data sets. That makes it even easier to discover, organize, curate, and share datasets from all your data sources.  Improved security and governance controls, like end-to-end encryption over TLS, and integration with Apache Ranger, ...

Progress for big data in Kubernetes

Image
Kubernetes is really cool because managing services as flocks of little containers is a really cool way to make computing happen. We can get away from the idea that the computer will run the program and get into the idea that a service happens because a lot of little computing just happens. This idea is crucial to making reliable services that don’t require a ton of heroism to stand up or keep running. But there is a dark side here. Containers want to be agile because that is the point of containers in the first place. We want containers because we want to make computing more like a gas made up of indistinguishable atoms instead of like a few billiard balls with colors and numbers on their sides. Stopping or restarting containers should be cheap so we can push flocks of containers around easily and upgrade processes incrementally. If ever a container becomes heavy enough that we start thinking about that specific container, the whole metaphor kind of dissolves. So that metap...

Azure HDInsight brings next generation Apache Hadoop 3.0

Preview of Apache Hadoop 3.0 in Azure HDInsight 4.0 Led by Hortonworks, Apache Hadoop 3.0 represents over 5 years of work across the community since the last major update to the Hadoop stack. Enterprises can now realize their data lake vision while efficiently incorporating deep learning frameworks in to their applications all on the same Hadoop stack that they are comfortable with. Some of the key enhancements include: With ACID semantics enabled by default, Apache Hive 3.0 becomes more like a traditional database, making it easier for customers to build LOB applications on top of very large data sets. Apache Druid is an open source data store with indexing/caching capabilities on top of a column-oriented storage layout. With Apache Hive and Apache Druid (now available by default), customers can do near real time exploratory analytics on incoming data. With Tensorflow, available by default, and GPU support, Apache Hadoop 3.0 squarely targets the machine learning...

Collection of data governance resources

Learning about data governance Use these introductory books, videos, and articles to understand the basics of data governance. Data Governance: What You Need to Know  — Jon Bruner explains how a data governance program provides the intellectual and institutional grounding to address the data needs across an organization, anticipate new issues, and provide for development according to the company’s strategic plan. Data Governance  — John Adler leads you through the maze of data governance issues facing companies today—security breaches, regulatory agencies, in-house turf battles over who controls the data, monetizing data, and more. The Rise of Big Data Governance: Insight on this Emerging Trend from Active Open Source Initiatives  — John Mertic and Maryna Strelchuk detail the benefits of a vendor-neutral approach to data governance. Understanding the Chief Data Officer  — Through interviews with current and former chief data officers (CDO), Julie Steele lo...

Dremio 2.1 is shipped with many new features!

Image
This is a major release that includes many new features, performance improvements, and hundreds of stability enhancements - see the highlights and more details below. • Elasticsearch 6.  Dremio now supports the latest versions of Elasticsearch. Enjoy full SQL support, including JOINs, Window functions, and accelerated analytics through any BI tool, including Tableau and Power BI. We also added support for compressing Elasticsearch responses to minimize network traffic.  • Approximate count distinct acceleration.  Dremio now supports accelerating count distinct queries based on an approximation-based algorithm (HyperLogLog). This provides a faster and more memory efficient way of providing distinct counts and is especially useful in high cardinality scenarios with very large datasets.  • Faster ORC performance.  Data encoded in ORC is now significantly faster to access and more memory efficient for ORC managed in Hive sources.  • Support for AWS GovClou...

The DGI Data Governance Framework

Image
The DGI Data Governance Framework   is a logical structure for classifying, organizing, and communicating complex activities involved in making decisions about and taking action on enterprise data.

Top 15 Investment Priorities for CDOs in Financial Services 2018

Image
Source >>

Gartner - Market Guide for Information Stewardship Applications

Image
The critical need for information governance continues to drive a diversified market for information stewardship solutions that support it. Data and analytics leaders must assess the capabilities these solutions offer to select vendors that will best suit their needs. Key Findings Policy setting in information governance programs is still so different and inconsistent that no market of offerings is forming as yet. Furthermore, policy enforcement in information stewardship initiatives is conforming to a market, but now across a wider set of use cases. Information stewardship applications available in the market do not yet fully support the information steward's wider role and tasks. Growth in the market for information stewardship applications is being disrupted by new technology capabilities in adjacent markets, such as data quality and metadata management, and new regulatory requirements, such as GDPR. Recommendations For data and analytics leaders working with dat...

Comparing Top Deep Learning Frameworks

Comparing Top Deep Learning Frameworks: Deeplearning4j, PyTorch, TensorFlow, Caffe, Keras, MxNet, Gluon & CNTK Skymind bundles Deeplearning4j and Python deep learning libraries such as Tensorflow and Keras (using a managed Conda environment) in the Skymind Intelligence Layer (SKIL), which offers ETL, training and one-click deployment on a managed GPU cluster. The SKIL Community Edition is free and downloadable here . Eclipse Deeplearning4j is distinguished from other frameworks in its API languages, intent and integrations. DL4J is a JVM-based, industry-focused, commercially supported, distributed deep-learning framework that solves problems involving massive amounts of data in a reasonable amount of time. It integrates with Kafka, Hadoop and Spark using an arbitrary number of GPUs or CPUs , and it has a number you can call if anything breaks. DL4J is portable and platform neutral, rather than being optimized on a specific cloud service such as AWS, Azure or Goog...

Machine Learning Platforms For Developers

Machine learning platforms are not the wave of the future. It's happening now. Developers need to know how and when to harness their power. Working within the ML landscape while using the right tools like Filestack can make it easier for developers to create a productive algorithm that taps into its power. The following machine learning platforms and tools — listed in no certain order — are available now as resources to seamlessly integrate the power of ML into daily tasks. 1.  H2O H2O was designed for the Python, R, and Java programming languages by H2O.ai. By using these familiar languages, this open source software makes it easy for developers to apply both predictive analytics and machine learning to a variety of situations. Available on Mac, Windows, and Linux operating systems, H2O provides developers with the tools they need to analyze data sets in the Apache Hadoop file systems as well as those in the cloud. 2.  Apache PredictionIO Developer...