Scalable Log Analytics with Apache Spark: A Comprehensive Case-Study
Introduction
One
of the most popular and effective enterprise case-studies which
leverage analytics today is log analytics. Almost every small and big
organization today have multiple systems and infrastructure running day
in and day out. To effectively keep their business running,
organizations need to know if their infrastructure is performing to its
maximum potential. This involves analyzing system and application logs
and maybe even apply predictive analytics on log data. The amount of log
data is typically massive, depending on the type of organizational
infrastructure and applications running on it. Gone are the days when we
were limited by just trying to analyze a sample of data on a single
machine due to compute constraints.
Powered
by big data, better and distributed computing, big data processing and
open-source analytics frameworks like Spark, we can perform scalable log
analytics on potentially millions and billions of log messages daily.
The intent of this case-study oriented tutorial is to take a hands-on
approach to showcasing how we can leverage Spark to perform log
analytics at scale on semi-structured log data.
We will be covering the following major topics in this article today.
- Main Objective — NASA Log Analytics
- Setting up Dependencies
- Loading and Viewing the NASA Log Dataset
- Data Wrangling
- Data Analysis on our Web Logs
While
there are a lot of excellent open-source frameworks and tools out there
for log analytics including elasticsearch, the intent of this tutorial
is to showcase how Spark can be leveraged for analyzing logs at scale.
In the real-world, you are free to choose your toolbox when analyzing
log data. Let’s get started!
Comments
Post a Comment