Posts

Showing posts with the label ETL

Emerging Architectures for Modern Data Infrastructure

Image
As an industry, we’ve gotten exceptionally good at building large, complex software systems. We’re now starting to see the rise of massive, complex systems built around data – where the primary business value of the system comes from the analysis of data, rather than the software directly. We’re seeing quick-moving impacts of this trend across the industry, including the emergence of new roles, shifts in customer spending, and the emergence of new startups providing infrastructure and tooling around data. In fact, many of today’s fastest growing infrastructure startups build products to manage data. These systems enable data-driven decision making (analytic systems) and drive data-powered products, including with machine learning (operational systems). They range from the pipes that carry data, to storage solutions that house data, to SQL engines that analyze data, to dashboards that make data easy to understand – from data science and machine learning libraries, to automated data pipe...

Top 9 Data Modeling Tools & Software 2021

Image
  Data modeling is the procedure of crafting a visual representation of an entire information system or portions of it in order to convey connections between data points and structures. The objective is to portray the types of data used and stored within the system, the ways the data can be organized and grouped, the relationships among these data types, and their attributes and formats. Data modeling uses abstraction to better understand and represent the nature of the flow of data within an enterprise-level information system.  The types of data models include: Conceptual data models. Logical data models. Physical data models. Database and information system design begins with the creation of these data models.  What is a Data Modeling Tool? A data modeling tool enables quick and efficient database design while minimizing human error. A data modeling software helps craft a high-performance database, generate reports that can be useful for stakeholders and create data de...

The DataOps Landscape

Image
Data has emerged as an imperative foundational asset for all organizations. Data fuels significant initiatives such as digital transformation and the adoption of analytics, machine learning, and AI. Organizations that are able to tame, manage, and unlock their data assets stand to benefit in myriad ways, including improvements to decision-making and operational efficiency, better fraud prediction and prevention, better risk management and control, and more. In addition, data products and services can often lead to new or additional revenue. As companies increasingly depend on data to power essential products and services, they are investing in tools and processes to manage essential operations and services. In this post, we describe these tools as well as the community of practitioners using them. One sign of the growing maturity of these tools and practices is that a community of engineers and developers are beginning to coalesce around the term “DataOps” (data operations). Our conver...

The State of Open-Source Data Integration and ETL

Image
Open-source data integration started 16 years ago with Talend. Since then, the whole industry has changed. Let's compare the different actors. Open-source data integration is not new. It started 16 years ago with Talend. But since then, the whole industry has changed. The likes of Snowflake, Bigquery, Redshift have changed how data is being hosted, managed, and accessed while making it easier and a lot cheaper. But the data integration industry has evolved as well. On one hand, new open-source projects emerged, such as Singer.io in 2017. This enabled more data integration connectors to become accessible to more teams, even though it still required a significant amount of manual work.  On the other hand, data integration was made accessible to more teams (analysts, scientists, business intelligence teams). Indeed, companies like Fivetran benefited from Snowflake’s rise,  empowering non-engineering teams to set up and manage their data integration connectors by themselves, so th...

ETL and How it Changed Over Time

Image
Modern world data and its usage has drastically changed when compared to a decade ago. There is a gap caused by the traditional ETL processes when processing modern data. The following are some of the main reasons for this:  Modern data processes often include real-time streaming data, and organizations need real-time insights into processes.  The systems need to perform ETL on data streams without using batch processing, and they should handle high data rates by scaling the system. Some single-server databases are now replaced by distributed data platforms ( e.g., Cassandra, MongoDB, Elasticsearch, SAAS apps ), message brokers( e.g., Kafka, ActiveMQ, etc. ) and several other types of endpoints. The system should have the capability to plugin additional sources or sinks to connect on the go in a manageable way. Repeated data processing due to ad hoc architecture has to be eliminated. Change data capture technologies used with traditional ETL has to be integ...

What’s Behind Lyft’s Choices in Big Data Tech

Image
Lyft was a late entrant to the ride-sharing business model, at least compared to its competitor Uber, which pioneered the concept and remains the largest provider. That delay in starting out actually gave Lyft a bit of an advantage in terms of architecting its big data infrastructure in the cloud, as it was able to sidestep some of the challenges that Uber faced in building out its on-prem system. Lyft and Uber, like many of the young Silicon Valley companies shaking up established business models, aren’t shy about sharing information about their computer infrastructure. They both share an ethos of openness in regards to using and developing technology. That openness is also pervasive at Google, Facebook, Twitter, and other Valley outfits that created much of the big data ecosystem, most of which is, of course, open source. So when the folks at Lyft were blueprinting how to construct a system that could do all the things that a ride-sharing app has to do – tracking and connectin...

Getting started with Apache Airflow

In this post, I am going to discuss Apache Airflow, a workflow management system developed by Airbnb. Earlier I had discussed writing basic ETL pipelines in Bonobo. Bonobo is cool for write ETL pipelines but the world is not all about writing ETL pipelines to automate things. There are other use cases in which you have to perform tasks in a certain order once or periodically. For instance: Monitoring Cron jobs transferring data from one place to other. Automating your DevOps operations. Periodically fetching data from websites and update the database for your awesome price comparison system. Data processing for recommendation based systems. Machine Learning Pipelines. Possibilities are endless. Before we move on further to implement Airflow in our systems, let’s discuss what actually is Airflow and it’s terminologies. What is Airflow? From the Website: Airflow is a platform to programmatically author, schedule and monitor workflows....

To Schema On Read or to Schema On Write, That is the Hadoop Data Lake Question

The Hadoop data lake concept can be summed up as, “Store it all in one place, figure out what to do with it later.” But while this might be the general idea of your Hadoop data lake, you won’t get any real value out of that data until you figure out a logical structure for it. And you’d better keep track of your metadata one way or another. It does no good to have a lake full of data, if you have no idea what lies under the shiny surface. At some point, you have to give that data a schema, especially if you want to query it with SQL or something like it. The eternal Hadoop question is whether to apply the brave new strategy of schema on read, or to stick with the tried and true method of schema on write. What is Schema on Write? Schema on write has been the standard for many years in relational databases. Before any data is written in the database, the structure of that data is strictly defined, and that metadata stored and tracked. Irrelevant data is discarded, data types, lengths and...