Scalable Log Analytics with Apache Spark: A Comprehensive Case-Study

Introduction

One of the most popular and effective enterprise case-studies which leverage analytics today is log analytics. Almost every small and big organization today have multiple systems and infrastructure running day in and day out. To effectively keep their business running, organizations need to know if their infrastructure is performing to its maximum potential. This involves analyzing system and application logs and maybe even apply predictive analytics on log data. The amount of log data is typically massive, depending on the type of organizational infrastructure and applications running on it. Gone are the days when we were limited by just trying to analyze a sample of data on a single machine due to compute constraints.


Powered by big data, better and distributed computing, big data processing and open-source analytics frameworks like Spark, we can perform scalable log analytics on potentially millions and billions of log messages daily. The intent of this case-study oriented tutorial is to take a hands-on approach to showcasing how we can leverage Spark to perform log analytics at scale on semi-structured log data.
We will be covering the following major topics in this article today.
  • Main Objective — NASA Log Analytics
  • Setting up Dependencies
  • Loading and Viewing the NASA Log Dataset
  • Data Wrangling
  • Data Analysis on our Web Logs
While there are a lot of excellent open-source frameworks and tools out there for log analytics including elasticsearch, the intent of this tutorial is to showcase how Spark can be leveraged for analyzing logs at scale. In the real-world, you are free to choose your toolbox when analyzing log data. Let’s get started!



Comments