Posts

Showing posts with the label Whats New

Dremio December 2020 released!

Image
This month’s release delivers very useful features like Apache Arrow Flight with Python, full support for CDP 7.1, security enhancements for Oracle connections, a new support bundle and much more. This blog post highlights the following updates: Arrow Flight clients Query support bundle Kerberos support for Dremio-Oracle connections User/job metrics available in the UI Continue reading >>>

6 Data Integration Tools Vendors to Watch in 2021

Image
Solutions Review’s Data Integration Tools Vendors to Watch is an annual listing of solution providers we believe are worth monitoring. Companies are commonly included if they demonstrate a product roadmap aligning with our meta-analysis of the marketplace. Other criteria include recent and significant funding, talent acquisition, a disruptive or innovative new technology or product, or inclusion in a major analyst publication. Data integration tools vendors are increasingly being disrupted by cloud connectivity, self-service, and the encroachment of data management functionality. As data volumes grow, we expect to see a continued push by providers in this space to adopt core capabilities of horizontal technology sectors. Organizations are keen on adopting these changes as well, and continue to allocate resources toward the providers that can not only connect data lakes and Hadoop to their analytic frameworks, but cleanse, prepare, and govern data. The next generation of tools will offe...

Technical Guide to Ocean Compute-to-Data

Image
With the v2 Compute-to-Data release, Ocean Protocol provides a means to exchange data while preserving privacy. This guide explains Compute-to-Data without requiring deep technical know-how. Private data is data that people or organizations keep to themselves. It can mean any personal, personally identifiable, medical, lifestyle, financial, sensitive or regulated information. Benefits of Private Data. Private data can help research, leading to life-altering innovations in science and technology. For example, more data improves the predictive accuracy of modern Artificial Intelligence (AI) models. Private data is often considered the most valuable data because it’s so hard to get at, and using it can lead to potentially big payoffs. Risks of Private Data. Sharing or selling private data comes with risk. What if you don’t get hired because of your private medical history? What if you are persecuted for private lifestyle choices? Large organizations that have massive datasets know their d...

The Apache Spark 3.0 Preview is here!

Image
Preview release of Spark 3.0 To enable wide-scale community testing of the upcoming Spark 3.0 release, the Apache Spark community has posted a preview release of Spark 3.0 . This preview is not a stable release in terms of either API or functionality, but it is meant to give the community early access to try the code that will become Spark 3.0. If you would like to test the release, please download it, and send feedback using either the mailing lists or JIRA. The Spark issue tracker already contains a list of features in 3.0.

Neo4j Aura: A New Graph Database as a Service

Image
Aura is an entirely new, built-from-the-ground-up, multi-tenant graph DBaaS based on Neo4j. It lets any developer take advantage of the best graph database in the world via a frictionless service in the cloud.  When we began building Neo4j all those years ago, we wanted to give developers a database that was very powerful, flexible… and accessible to all. We believed that open source was the best way to bring this product to developers worldwide. Since then, the vast majority of our paying customers have started out with data practitioners leading the way. Individual developers downloaded Neo4j, experimented with it, and realized graphs were an ideal way to model and traverse connected data. However, only a few of those developers had direct access to a budget to make the leap to our Enterprise Edition. Neo4j Aura bridges that gap for individuals, small teams and established startups. I believe this is the next logical step in Neo4j’s vision to help the world make sense of data....

Dremio 4.0 Data Lake Engine

Image
Dremio’s Data Lake Engine delivers lightning fast query speed and a self-service semantic layer operating directly against your data lake storage. No moving data to proprietary data warehouses or creating cubes, aggregation tables and BI extracts. Just flexibility and control for Data Architects, and self-service for Data Consumers. This release, also known as Dremio 4.0, dramatically accelerates query performance on S3 and ADLS, and provides deeper integration with the security services of AWS and Azure. In addition, this release simplifies the ability to query data across a broader range of data sources, including multiple lakes (with different Hive versions) and through community-developed connectors offered in Dremio Hub. Read full article >>>

What’s coming in PostgreSQL 12

Image
Bruce Momjian prepared this slide deck describing the most significant improvements coming to PostgreSQL 12 .

What's the future of the pandas library?

Image
Pandas is a powerful, open source Python library for data analysis, manipulation, and visualization. I've been teaching data scientists to use pandas since 2014, and in the years since, it has grown in popularity to an estimated 5 to 10 million users and become a "must-use" tool in the Python data science toolkit. I started using pandas around version 0.14.0, and I've followed the library as it has significantly matured to its current version, 0.23.4. But numerous data scientists have asked me questions like these over the years: "Is pandas reliable?" "Will it keep working in the future?" "Is it buggy? They haven't even released version 1.0!" Version numbers can be used to signal the maturity of a product, and so I understand why someone might be hesitant to rely on "pre-1.0" software. But in the world of open source, version numbers don't necessarily tell you anything about the maturity or reliability ...

What’s new in Hortonworks DataFlow 3.3?

With the upcoming HDP 3.1 release, we also bring about some exciting innovations to enhance our Kafka offering – New Hive Kafka Storage Handler (for SQL Analytics) – View Kafka topics as tables and execute SQL via Hive with full SQL Support for joins, windowing, aggregations, etc. New Druid Kafka Indexing Service (for OLAP Analytics) – View Kafka topics as cubes and perform OLAP style analytics on streaming events in Kafka using Druid. HDF 3.3 includes the following major innovations and enhancements: Core HDF Enhancements Support for Kafka 2.0, the latest Kafka release in the Apache community, with lots of enhancements into security, reliability and performance. Support for Kafka 2.0 NiFi processors NiFi Connection load balancing – This feature allows for bottleneck connections in the NiFi workflow to spread the queued-up flow files across the NiFi cluster and increase the processing speed and therefore lessen the effect of the bottleneck. MQTT performance improvements inc...

Dremio 3.0 adds new capabilities and security features, and dramatically improves performance

Image
Here’s what’s NEW: Up to 100x performance improvement for a wide range of query workloads, using Apache Arrow’s new kernel – Gandiva. Gandiva performs just-in-time compilation of SQL queries to machine code to get the fastest possible performance. (Our blog post explains more about how.) Support for Teradata, Azure Data Lake Store, AWS S3 GovCloud, and the latest version of Elasticsearch. Expect more soon! We’ve got a new connector framework that improves performance, stability, and development velocity for all data sources. Cluster Workload Manager, which lets you deploy diverse workloads on a single operational cluster while ensuring critical SLAs for performance and availability. More data catalog features, including wikis and tags for your data sets. That makes it even easier to discover, organize, curate, and share datasets from all your data sources.  Improved security and governance controls, like end-to-end encryption over TLS, and integration with Apache Ranger, ...

Dremio 2.1 is shipped with many new features!

Image
This is a major release that includes many new features, performance improvements, and hundreds of stability enhancements - see the highlights and more details below. • Elasticsearch 6.  Dremio now supports the latest versions of Elasticsearch. Enjoy full SQL support, including JOINs, Window functions, and accelerated analytics through any BI tool, including Tableau and Power BI. We also added support for compressing Elasticsearch responses to minimize network traffic.  • Approximate count distinct acceleration.  Dremio now supports accelerating count distinct queries based on an approximation-based algorithm (HyperLogLog). This provides a faster and more memory efficient way of providing distinct counts and is especially useful in high cardinality scenarios with very large datasets.  • Faster ORC performance.  Data encoded in ORC is now significantly faster to access and more memory efficient for ORC managed in Hive sources.  • Support for AWS GovClou...

SFIA7 - The seventh major version of the Skills Framework for the Information Age

Image
First published in 2000, SFIA has evolved through successive updates as a result of expert input by its global users to ensure that, first and foremost, it remains relevant and useful to the needs of the industry and business.  SFIA 7, as with previous updates, is an evolution. It has been updated in response to many change requests: many of the existing skills have been updated and a few additional ones introduced but the key concepts and essential values of SFIA remain true, as they have done for nearly 20 years. The structure has remained the same – 7 levels of responsibility characterised by generic attributes, along with many professional skills and competencies described at one or more of those 7 levels.  The SFIA standard covers the full breadth of the skills and competencies related to information and communication technologies, digital transformation and software engineering. SFIA is also often applied to a range of other technical endeav...

Apache Hadoop 3.1- a Giant Leap for Big Data

Image
Use Cases When we are in the outdoors, many of us often feel the need for a camera- that is intelligent enough to follow us, adjust to the terrain heights and visually navigate through the obstacles, while capturing panoramic videos.  Here, I am talking about autonomous self-flying drones, very similar to cars on auto pilot. The difference is that we are starting to see proliferation of artificial intelligence into affordable, everyday use cases, compared to relatively expensive cars. These new use cases mean: (1) They will need parallel compute processing to crunch through insane amount of data (visual or otherwise) in real time for inferences and training of deep learning neural network algorithms. This helps them distinguish between objects and get better with more data. Think like a leap of compute processing by 100x, due to the real time nature of the use cases (2) They will need the deep learning software frameworks, so that data scientists & data engi...

Tableau 10.5 with Hyper and server on Linux

Excited about new Tableau 10.5 with Hyper added as a data engine and Linux support. New features: https://www.tableau.com/products/new-features Hyper: https://www.tableau.com/products/technology

New token allocation algorithm in Cassandra 3.0

The central idea of the algorithm is to generate candidate tokens, and figure out what would be the effect of adding each of them to the ring as part of the new node. The new token will become primary for part of the range of the next one in the ring, but it will also affect the replication of preceding ones. The algorithm is able to quickly assess the effects thanks to some observations which lead to a simplified but equivalent version of the replication topology2: Replication is defined per datacentre and replicas for data for this datacentre are only picked from local nodes. That is, no matter how we change nodes in other datacentres, this cannot affect what replicates where in the local one. Therefore in analysing the effects of adding a new token to the ring, we can work with a local version of the ring that only contains the tokens belonging to local nodes. If there are no defined racks (or the datacentre is a single rack), data must be replicated in distinct nodes. If racks ...

HDFS Erasure Coding in Apache Hadoop

HDFS by default replicates each block three times. Replication provides a simple and robust form of redundancy to shield against most failure scenarios. It also eases scheduling compute tasks on locally stored data blocks by providing multiple replicas of each block to choose from. However, replication is expensive: the default 3x replication scheme incurs a 200% overhead in storage space and other resources (e.g., network bandwidth when writing the data). For datasets with relatively low I/O activity, the additional block replicas are rarely accessed during normal operations, but still consume the same amount of storage space. Therefore, a natural improvement is to use erasure coding (EC) in place of replication, which uses far less storage space while still providing the same level of fault tolerance. Under typical configurations, EC reduces the storage cost by ~50% compared with 3x replication. Motivated by this substantial cost saving opportunity, engineers from Cloudera and Intel ...

HIVE 0.14 Cost Based Optimizer (CBO)

Analysts and data scientists⎯not to mention business executives⎯want Big Data not for the sake of the data itself, but for the ability to work with and learn from that data. As other users become more savvy, they also want more access. But too many inefficient queries can create a bottleneck in the system. The good news is that Apache™ Hive 0.14—the standard SQL interface for processing, accessing and analyzing Apache Hadoop® data sets—is now powered by Apache Calcite. Calcite is an open source, enterprise-grade Cost-Based Logical Optimizer (CBO) and query execution framework. The main goal of a CBO is to generate efficient execution plans by examining the tables and conditions specified in the query, ultimately cutting down on query execution time and reducing resource utilization. Calcite has an efficient plan pruner that can select the cheapest query plan. All SQL queries are converted by Hive to a physical operator tree, optimized and converted to Tez/MapReduce jobs, then executed ...