Loading complex log files using regex in Power BI
Log files are a very useful tool for developers and administrators of computer systems. They record what happened to the system, when it happened, and which user actually caused the event. Thanks to these files, you can find information about any system failure, which allows you to diagnose the causes of these failures more quickly.
Logs are often semi-structured data, information that cannot be persisted in a relational database in the format in which it was generated. In order to be analyzed with standard tools, this data must first be transformed into a more suitable format.
Because it is not structured data, it is difficult to import it into Power BI as is, unless someone has developed a custom connector to do so. In these scenarios, using a regex in languages like Python or R can help us get the results we want.
Apache access logs
Let’s suppose your company has a website published through an Apache web server. Your manager asks you to carry out an analysis regarding which web pages of the site are the most clicked on. The only way to get this information is to analyze the access log file. This file records data about all requests made to the web server. Here is an example of an Apache access log:
Figure 6.29: An example of an Apache access log
As you can see, at first glance, there is a fairly organized structure to the information in this log. If no one has customized the output of the Apache log files, it uses the Common Log Format (CLF) by default. You can find a real example of an Apache access log in the apache_logs.txt
file, which is inside the repository that comes with this book, in the 02-loading-complex-log-files-using-regex
folder. We can find it in the GitHub repository at http://bit.ly/apache-access-log (click on Download to view it).
If you go ahead and read the documentation of those log files, you will deduce that the information recorded in the access log follows the NCSA extended/combined log format. So, the data that is recorded is as follows:
- The remote host name (the IP address).
- The remote log name (if empty, you’ll find a dash; it is not used in the sample file).
- The remote user if the request was authenticated (if empty, you’ll find a dash).
- The datetime that the request was received, in the
[18/Sep/2011:19:18:28 -0400]
format. - The first line of the request made to the server between double quotes.
- The HTTP status code for the request.
- The size of the response in bytes, excluding the HTTP headers (could be a dash).
- The
Referer
HTTP request header, which contains the absolute or partial address of the page making the request. - The
User-Agent
HTTP request header, which contains a string that identifies the application, operating system, vendor, and/or version of the requesting user agent.
Once you know both the nature of the information written in the log and the form in which it is written, you can take advantage of the powerful tools provided by regexes to better structure this information and import it into Power BI.
Importing Apache access logs in Power BI with Python
As mentioned earlier, you can find a real example of an Apache access log in the apache_logs.txt
file, which is inside the repository that comes with this book, in the 02-loading-complex-log-files-using-regex
folder. You will load the information in this file using a Python script, not a Power BI connector.
Compared to what you’ve learned before about regexes and Python, in the 01-apache-access-log-parser-python.py
Python script (which you’ll find in the preceding folder), you will encounter the new constructs that we will highlight below.
To be able to read a text file line by line in Python, you’ll use the open(file, mode)
functions and the readlines()
method. Specifically, you’re going to read the apache_logs.txt
file as read-only ('r'
) and read each of its lines to store them in a list.
In regexes, it is possible to refer to groups identified by round brackets not only by a numerical index but also by a name. This is thanks to named capturing groups. Usually, the generic regex syntax that is used to assign a name to a group is (?<group-name>…)
. In Python, it is (?P<group-name>…)
:
- In Python, you can define a list of regex parts that can be merged (
join
) using a separator. In our case, the separator is defined by one or more space characters (\s+
). Also note that in the various patterns below, the character class\S
(uppercase) is used to denote a character that is not one of the spaces identified by\s
(lowercase):regex_parts = [ r'(?P<hostName>\S+)', r'\S+', r'(?P<userName>\S+)', r'\[(?P<requestDateTime>[\w:/]+\s[+\-]\d{4})\]', r'"(?P<requestContent>\S+\s?\S+?\s?\S+?)"', r'(?P<requestStatus>\d{3}|-)', r'(?P<responseSizeBytes>\d+|-)', r'"(?P<requestReferrer>[^"]*)"', r'"(?P<requestAgent>[^"]*)?"', ] pattern = re.compile(r'\s+'.join(regex_parts) + r'$')
Note that, in this case, the
re.compile()
function is used since the match must be done many times on all lines of the log; therefore, precompiling the regex could have computational advantages.
- Pattern matching is done for each line in the log:
for line in access_log_lines: log_data.append(pattern.match(line).groupdict())
The
groupdict()
method returns a dictionary with the group names as the key and the matched strings as the value for that key. All the dictionaries for each line are appended to thelog_data
list.
We leave it up to the reader to interpret how each regex part goes about capturing the desired string.
Now that we’ve clarified a few points in the code, let’s import the log into Power BI:
- Open Power BI Desktop through the shortcut that activates the
pbi_powerquery_env
environment. - Make sure Power BI Desktop is referencing the
pbi_powerquery_env
environment in Options. - Go to Get Data, More … pyt and select the Python script.
- Copy and paste the script from the
01-apache-access-log-parser-python.py
file into the Python script editor, making sure to change the path to the Apache log file. Then click on OK. - Then, select the df dataframe from the Navigator window and click on Load:
Figure 6.30: Selecting the df dataframe returned by the Python script
- If you click on the Data icon, you can view the entire log loaded as a structured table:
Figure 6.31: The Apache access log is loaded in Power BI with Python
Awesome! As you can see, Python has done all the magic of transforming semi-structured data into table-structured data thanks to regexes! So now you can easily import what looked like a complex log file to manage into Power BI.
Importing Apache access logs in Power BI with R
In this section, you will load the information of the apache_logs.txt
file, but this time, using an R script.
Compared to what you’ve learned previously about regexes in R, in the 02-apache-access-log-parser-r.R script
(which you’ll find in the same preceding folder), you’ll encounter these new constructs:
- To be able to read a text file line by line in R, you’ll use the
read_lines()
function from thereadr
package. Specifically, you’re going to read each line of theapache_logs.txt
file in order to persist them to a vector. - In order to take full advantage of named capturing groups in R, you need to install and use the features of a package called
namedCapture
. Thanks to this package, both regex syntaxes for named groups are allowed: the standard(?<group-name>…)
regex syntax and the(?P<group-name>…)
regex syntax. - Just as we did in the Python script, in R, you’ll also define a vector of regex parts, which you’ll merge with the
paste(..., collapse = '...')
function. Essentially, the regex components within the vector are combined using the text specified in thecollapse
argument. Specifically, this function will join regex parts together through the\s+
separator. After merging all of the parts, the$
character is added to the end of the resulting string using thepaste0(…)
function. Remember that raw strings have a different syntax in R than in Python. In this case, we will use ther'{...}'
syntax:regex_parts <- c( r'{(?P<hostName>\S+)}' , r'{\S+}' , r'{(?P<userName>\S+)}' , r'{\[(?P<requestDateTime>[\w:/]+\s[+\-]\d{4})\]}' , r'{"(?P<requestContent>\S+\s?\S+?\s?\S+?)"}' , r'{(?P<requestStatus>\d{3}|-)}' , r'{(?P<responseSizeBytes>\d+|-)}' , r'{"(?P<requestReferrer>[^"]*)"}' , r'{"(?P<requestAgent>[^"]*)?"}' ) pattern <- paste0( paste(regex_parts, collapse = r'{\s+}'), '$' )
- Pattern matching is done using the
str_match_named()
function of thenamedCapture
package over the whole log vector, using a single-line command:df <- as.data.frame( str_match_named( access_log_lines, pattern = pattern ) )
Again, we leave it to the reader to interpret how each individual regex part goes about capturing the desired string.
IMPORTANT NOTE
Toby Dylan Hocking, author of the namedCapture
package, has developed a newer package called nc
, which uses user-friendly functions for extracting a data table from non-tabular text data using regular expressions. For the sake of training on this very topic, we use the old namedCapture
package in this case, because it avoids hiding the construction of the entire pattern used for matching.
Now that we’ve clarified a few points in the code, let’s import the log into Power BI:
- First, you need to install the
namedCapture
package. So, open RStudio and make sure that the engine being referenced is the latest one (CRAN R 4.2.2 in our case) in Global Options. - Now, go to the console and enter and run the following code:
install.packages("namedCapture")
- Open Power BI Desktop, go to Get Data, More … and select the R script.
- Copy and paste the script from the
02-apache-access-log-parser-r.R
file into the R script editor and then click on OK. - Then, select the df dataframe from the Navigator window and click on Load:
Figure 6.32: Selecting the df dataframe returned by the R script
- If you click on the Data icon, you can view the entire log loaded as a structured table:
Figure 6.33: The Apache access log loaded in Power BI with R
Great job! You were able to import a semi-structured log file into Power BI even with R.