Posts

Showing posts with the label SQL

Top 9 Data Modeling Tools & Software 2021

Image
  Data modeling is the procedure of crafting a visual representation of an entire information system or portions of it in order to convey connections between data points and structures. The objective is to portray the types of data used and stored within the system, the ways the data can be organized and grouped, the relationships among these data types, and their attributes and formats. Data modeling uses abstraction to better understand and represent the nature of the flow of data within an enterprise-level information system.  The types of data models include: Conceptual data models. Logical data models. Physical data models. Database and information system design begins with the creation of these data models.  What is a Data Modeling Tool? A data modeling tool enables quick and efficient database design while minimizing human error. A data modeling software helps craft a high-performance database, generate reports that can be useful for stakeholders and create data de...

Announcing Databricks Serverless SQL

Image
Databricks SQL   already provides a first-class user experience for BI and SQL directly on the data lake, and today, we are excited to announce another step in making data and AI simple with Databricks Serverless SQL. This new capability for Databricks SQL provides instant compute to users for their BI and SQL workloads, with minimal management required and capacity optimizations that can lower overall cost by an average of 40%. This makes it even easier for organizations to expand adoption of the lakehouse for business analysts who are looking to access the rich, real-time datasets of the lakehouse with a simple and performant solution. Under the hood of this capability is an active server fleet, fully managed by Databricks, that can transfer compute capacity to user queries, typically in about 15 seconds. The best part? You only pay for Serverless SQL when users start running reports or queries. Organizations with business analysts who want to analyze data in the data lake with t...

Dremio December 2020 released!

Image
This month’s release delivers very useful features like Apache Arrow Flight with Python, full support for CDP 7.1, security enhancements for Oracle connections, a new support bundle and much more. This blog post highlights the following updates: Arrow Flight clients Query support bundle Kerberos support for Dremio-Oracle connections User/job metrics available in the UI Continue reading >>>

Dremio 4.8 is released

Image
Today we are excited to announce the release of Dremio 4.8! This month’s release delivers multiple features such as external query, a new authorization service API, AWS Edition enhancements and more. This blog post highlights the following updates: External query Default reflections Runtime filtering GA Documented JMX metrics and provided sample exporters Ability to customize projects in Dremio AWS Edition Support for Dremio AWS Edition deployments without public IP addresses Read full article >>>

Announcing the dbt IDE: orchestrate the entire analytics engineering workflow in your browser

Image
Today we released the dbt Integrated Developer Environment (IDE) into general availability in dbt Cloud. With the IDE, you can build, run, test, and version control dbt projects from your browser. There’s no wrestling with pip, homebrew, hidden files in your home directory, or coordinating upgrades across large teams. If you haven’t already, be sure to check out Tristan’s post on why we built the IDE and why we think it’s such a meaningful development in the analytics engineering space. Otherwise, read on to learn more about what you can do with the IDE and what’s next for dbt Cloud. Read full post >>>

Neo4j Backs Launch of GQL Project: First New ISO Database Language Since SQL

Image
Neo4j, the leader in graph databases, announced today that the international committees that develop the SQL standard have voted to initiate GQL (Graph Query Language) as a new database query language. Now to be codified as the international standard declarative query language for property graphs, GQL represents the culmination of years of effort by Neo4j and the broader database community.  English: GQL to incorporate and consider several graph database languages. Cypher: (:Neo4j)-[:BACKS]->(GQL:Project)<-[:STARTED]-(:ISO)-[:STANDARDIZED]->(SQL:Project) The initiative for GQL was first advanced in the GQL Manifesto in May 2018. A year later, the project was considered at an international gathering in June. Ten countries including the United States, Germany, UK, Korea, and China have now voted in favor, with seven countries promising active participation by national experts. It has been well over 30 years since ISO/IEC began the SQL project. SQL went on to become...

Distributed SQL System Review: Snowflake vs Splice Machine

Image
After many years of Big Data, NoSQL, and Schema-on-Read detours, there is a clear return to SQL as the lingua franca for data operations. Developers need the comprehensive expressiveness that SQL provides. A world without SQL ignores more than 40 years of database research and results in hard-coded spaghetti code in applications to handle functionality that SQL handles extremely efficiently such as joins, groupings, aggregations, and (most importantly) rollback when updates go wrong. Luckily, there is a modern architecture for SQL called Distributed SQL that no longer suffers from the challenges of traditional SQL systems (cost, scalability, performance, elasticity, and schema flexibility). The key attribute of Distributed SQL is that data is stored across many distributed storage locations and computation takes place across a cluster of networked servers. This yields unprecedented performance and scalability because it distributes work on each worker node in the cluster in parall...
Image
Here’s a curated list of resources for data engineers, with sections for algorithms and data structures, SQL, databases, programming, tools, distributed systems, and more. Useful articles The AI Hierarchy of Needs The Rise of Data Engineer The Downfall of the Data Engineer A Beginner’s Guide to Data Engineering Part I Part II Part III Functional Data Engineering — a modern paradigm for batch data processing How to become a Data Engineer (in Russian) Talks Data Engineering Principles - Build frameworks not pipelines by Gatis Seja Functional Data Engineering - A Set of Best Practices by Maxime Beauchemin Advanced Data Engineering Patterns with Apache Airflow by Maxime Beauchemin Creating a Data Engineering Culture by Jesse Anderson Algorithms & Data Structures Algorithmic Toolbox in Russian Data Structures in Russian Data Structures & Algorithms Specialization on Coursera Algorithms Specialization from Stanford on Coursera SQL Com...

Tuning Snowflake Performance Using the Query Cache

Image
In terms of performance tuning in Snowflake, there are very few options available. However, it is worth understanding how the Snowflake architecture includes various levels of caching to help speed your queries. This article provides an overview of the techniques used, and some best practice tips on how to maximise system performance using caching. Snowflake Database Architecture Before starting it’s worth considering the underlying Snowflake architecture, and explaining when Snowflake caches data. The diagram below illustrates the overall architecture which consists of three layers:- Service Layer:   Which accepts SQL requests from users, coordinates queries, managing transactions and results.  Logically, this can be assumed to hold the  result cache  – a cached copy of the results of every query executed. Compute Layer:   Which actually does the heavy lifting.  ...

Data scientist salaries and jobs in Europe - 2018 snapshot

Image
Glassdoor names “Data Scientist” as the best job in the United States for 2019 and LinkedIn ranks it number one among the top 10. Topping the list for four years in a row,  Data Scientist has a job score of 4.7, job satisfaction rating of 4.3 with 6,510 open positions paying a median base salary of $108,000 in the U.S. But what is the scenario for Data Scientists in Europe? What is the demand and supply? Which countries in EU are the best destinations for Data Scientists and what salaries can they expect? A recent report titled Data Science Salary Report 2019 Europe by Big Cloud  answers some of these critical questions.   First, a little flashback: According to a report by the European Commission in 2017, the number of data workers in Europe will increase up to 10.43 million, with a compound average growth rate of 14.1% by 2020. The EU forecasted to face a data skills gap corresponding to 769,000 unfilled positions by 2020 in the baseline scenario and being concen...

Apache Superset in the Production Environment

Image
Visualizing data helps in building a much deeper understanding of the data and quickens analytics around the data. There are several mature paid products available on the market. Recently, I explored an open source product name Apache Superset which I found a very upbeat product in this space. Some prominent features of Superset are: A rich set of data visualizations. An easy-to-use interface for exploring and visualizing data. Create and share dashboards. After reading about Superset, I wanted to try it, and as Superset is a Python programming language-based project we can easily install it using pip; but I decided to set it up as a container based on Docker. The Apache Superset GitHub Repo contains code for building and running Superset as a container. Since I want to run Superset in a completely distributed manner and with as little modification as possible in the code, I decided to modify the code so that it could run in multiple different modes. Below is a list of sp...

Actian VectorH architecture and Amazon Redshift benchmark papers

Actian Vector in Hadoop (VectorH for short) is a new SQL-on-Hadoop  system  built  on  top  of  the  fast  Vectorwise  analytical database system.  VectorH achieves fault tolerance and storage scalability by relying on HDFS, and extends the state-of-the-art in SQL-on-Hadoop systems by instrumenting  the  HDFS  replication  policy  to  optimize  read  locality. VectorH integrates with YARN for workload management, achieving a high degree of elasticity.  Even though HDFS is an append-only filesystem, and VectorH supports (update-averse) ordered tables, trickle updates are possible thanks to Positional Delta Trees (PDTs), a differential update structure that can be queried efficiently.  The paper describes the changes made to single-server Vectorwise to turn it into a Hadoop-based  MPP  system,  encompassing  workload  management, parallel  query  optimizat...

Debunking Myths About the VoltDB In-Memory Database

Myth #1: “VoltDB requires stored procedures.” This was true for 1.0, but no one seems to notice it’s been false since we shipped 1.1 in 2010. VoltDB supports unforeseen SQL without any stored procedure use. We have users in production who have never used a single stored procedure. Myth #2: “VoltDB doesn’t support ad-hoc SQL.” This is just a rephrasing of Myth #1 and is still false. Myth #3: “VoltDB is slow unless I use stored procedures.” Well, no. VoltDB can run faster with stored procedures, but it’s still fast if they are not used. In our internal benchmarks on pretty cheap single-socket hardware, we can run about 50k write statements per second, per host with full durability. Myth #4: “I have to know Java to use VoltDB.” As of VoltDB 3.0, released over a year ago, (we’re on V4.2 today), a user can build VoltDB apps and run the server without ever directly interacting with the Java CLI tools or any Java code. Myth #5: “VoltDB has garbage collection problems because it is wri...

Weak transaction isolation

Concurrency bugs caused by weak transaction isolation are not just a theoretical problem. They may cause customer data to be corrupted. Many popular relational databases, which are usually considered ACID use weak isolation, so they would not necessarily have prevented these bugs from occurring. What exactly the Isolation guarantee in the SQL standard means based on what they call “read phenomena”. There are three types of phenomena: Dirty reads – If another transaction writes, but does not commit, during your transaction, is it possible that you will see their data? Non-repeatable reads – If you read the same row twice, is it possible that you might get different data the second time? Phantom reads – If you read a collection of rows twice, is it possible that different rows will be returned the second time? In the SQL standard, there are four levels of transactional isolation based on which of these phenomena they prevent (from weakest to strongest): Read Uncommitted – A tra...

To Schema On Read or to Schema On Write, That is the Hadoop Data Lake Question

The Hadoop data lake concept can be summed up as, “Store it all in one place, figure out what to do with it later.” But while this might be the general idea of your Hadoop data lake, you won’t get any real value out of that data until you figure out a logical structure for it. And you’d better keep track of your metadata one way or another. It does no good to have a lake full of data, if you have no idea what lies under the shiny surface. At some point, you have to give that data a schema, especially if you want to query it with SQL or something like it. The eternal Hadoop question is whether to apply the brave new strategy of schema on read, or to stick with the tried and true method of schema on write. What is Schema on Write? Schema on write has been the standard for many years in relational databases. Before any data is written in the database, the structure of that data is strictly defined, and that metadata stored and tracked. Irrelevant data is discarded, data types, lengths and...

Immutability, MVCC, and Garbage Collection

Interesting article about Datomic and its immutability with regard to MVCC databases. https://www.xaprb.com/blog/2013/12/28/immutability-mvcc-and-garbage-collection/