Splunking big data
Splunk is a big data tool. In this book, we will introduce the idea of using Splunk to solve problems that involve large amounts of data. When I worked on the IT security team, the problem was obvious – we needed to use security data to identify malicious activity. Defining the problem you are trying to solve will determine what kind of data you collect and how you analyze that data. Not every problem requires a big data solution. Sometimes, a traditional database solution might work just as well and with less cost. So, how do you know if you’re dealing with a big data problem? There are three V’s that help define big data:
- High Volume: A big data problem usually involves large volumes of data. Most times, the amount of data is greater than what can fit into traditional database solutions.
- High Velocity: Traditional database solutions are usually not able to handle the speed at which modern data enters a system. Imagine trying to store and manage data from user clicks on a website such as amazon.com in a traditional database. Databases are not designed to support that many operations.
- High Variety: A problem requiring analysis of big data involves a variety of data sources of varying formats. An IT security SIEM may have data being logged from multiple data sources, including firewall devices, email traces, DNS logs, and access logs. Each of these logs has a different format and correlating all the logs requires a heavy-duty system.
Here are some cases that can be solved using big data:
- A retail company wants to determine how product placement in stores affects sales. For example, research may show that placing packs of Cheetos near the Point Of Sale (POS) devices increases sales for customers with small children. The target assigns a guest ID number to every customer. They correlate this ID number with the customer’s credit card number and transactions.
- A rental company wants to measure the times of year that are busiest to ensure that there is a sufficient inventory of vehicles at different locations. Even so, they may realize that a certain type of vehicle is more suitable for a particular area of town.
- A public school district wants to explore data pulled from multiple district schools to determine the effect of remote classes on certain demographics.
- An online shop wants to use customer traffic to determine the peak time for posting ads or giving discounts.
- An IT security team may use datasets containing firewall logs, DNS logs, and user access to hunt down a malicious actor on the network.
Now, let’s look at how big data is generated.
How is big data generated?
Infographics published by FinancesOnline (https://financesonline.com) indicated that humans created, captured, copied, and consumed about 74 zettabytes of data in 2021. That number is estimated to grow to 149 zettabytes in 2024.
The volume of data seen in the last few years can be attributed to increases in three types of data:
- Machine data: Data generated by machines such as operating systems and application logs
- Social data: Data generated by social media systems
- Transactional data: Data generated by e-commerce systems
We are surrounded by digital devices, and as the capacity and capabilities of these devices increase, the amount of data generated also increases. Modern devices such as phones, laptops, watches, smart speakers, cars, sensors, POS devices, and household appliances all generate large volumes of machine data in a wide variety of formats. Many times, this data stays untouched because the data owners do not have the ability, time, or money to analyze it.
The prevalence of smartphones is possibly another contributor to the exponential increase in data. IBM’s Simon Personal Communicator, the first mainstream mobile telephone introduced in 1992, had very limited capability. It cost a whopping $899 with a service contract. Out of the box, a user could use the Simon to make calls and send and receive emails, faxes, and pages. It also contained a notebook, address book, calendar, world clock, and scheduler features. IBM sold approximately 50,000 units (https://time.com/3137005/first-smartphone-ibm-simon/).
Figure 1.1 shows the first smartphone to have the functions of a phone and a Personal Digital Assistant (PDA):
Figure 1.1 – The IBM Simon Personal Communicator released in 1992
The IBM Simon Personal Communicator is archaic compared to the average cellphone today. Apple sold 230 million iPhones in 2020 (https://www.businessofapps.com/data/apple-statistics/). iPhone users generate data when they browse the web, listen to music and podcasts, stream television and movies, conduct business transactions, and post to and browse social media feeds. This is in addition to the features that were found in the IBM Simon, such as sending and receiving emails. Each of these applications generates volumes of data. Just one application such as Facebook running on an iPhone involves a variety of data – posts, photos, videos, transactions from Facebook Marketplace, and so much more. Figure 1.2 shows data from OurWorldData.org (https://ourworldindata.org/internet) that illustrates the rapid increase in users of social media:
Figure 1.2 – Number of people using social media platforms, 2005 to 2019
In the next section, we’ll explore how we can use Splunk to process all this data.
Understanding Splunk
Now that we understand what big data is, its applications, and how it is generated, let’s talk about Splunk Enterprise and how Splunk can be used to manage big data. For simplicity, we will refer to Splunk Enterprise as Splunk.
Splunk was founded in 2003 by Michael Baum, Rob Das, and Erik Swan. Splunk was designed to search, monitor, and analyze machine-generated data. Splunk can handle high volume, high variety data being generated at high velocity. This makes it a perfect tool for dealing with big data. Splunk works on various platforms, including Windows (32- and 64-bit), Linux (64-bit), and macOS. Splunk can be installed on physical devices, virtual machines such as VirtualBox and VMWare, and virtual cloud instances such as Amazon Web Services (AWS) and Microsoft Azure. Customers can also sign up for the Splunk Cloud Platform, which supplies the user with a Splunk deployment hosted virtually. Using AWS instances and Splunk Cloud frees the user from having to deploy and maintain physical servers. There is a free version 60-day trial of Splunk that allows the user to index 500 MB of data daily. Once the user has used the product for 60 days, they can use a perpetual free license or purchase a Splunk license. The 60-day version of Splunk is a great way to get your feet wet. Traditionally, the paid version of Splunk was billed at a volume rate – that is, the more data you index, the more you pay. However, new pricing models such as workload and ingest pricing have been introduced in recent years.
In addition to the core Splunk tool, there are various free and paid applications, such as Splunk Enterprise Security, Splunk Soar, and various observability solutions such as Splunk User Behavior Analytics (UBA) and Splunk Observability Cloud.
Splunk was designed to index a variety of data. This is accomplished via pre-defined configurations that allow Splunk to recognize the format of different data sources. In addition, splunkbase.com is a constantly growing repository of 1,000+ apps and Technical Add-Ons (TAs) developed by Splunk, Splunk partners, and the Splunk community. One of the most important features of these TAs includes configurations for automatically extracting fields from raw data. Unlike traditional databases, Splunk can index large volumes of data. A dedicated Splunk Enterprise indexer can index over 20 MB of data per second or 1.7 per day. The amount of data that Splunk is capable of indexing can be increased with additional indexers. There are many use cases for which Splunk is a great solution.
Table 1.1 highlights how Splunk improved processes at The University of Arizona, Honda, and Lenovo:
Use Case |
Company |
Details |
Security |
The University of Arizona |
The University of Arizona used Splunk Remote Work Insights (RWI) to help with the challenges of remote learning during the pandemic (https://www.splunk.com/en_us/customers/success-stories/university-of-arizona.html) |
IT Operations |
Honda |
Honda used predictive analytics to increase efficiency and solve problems before they became machine failures or interruptions in their production line (https://tinyurl.com/5n7f7naz) |
DevOps |
Lenovo |
Lenovo reduced the amount of time spent in troubleshooting by 50% and maintained 100% uptime despite a 300% increase in web traffic (https://tinyurl.com/yactu398) |
Table 1.1 – Examples of success stories from Splunk customers
We will look at some of the major components of Splunk in the next section.