Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Extending Power BI with Python and R

You're reading from   Extending Power BI with Python and R Perform advanced analysis using the power of analytical languages

Arrow left icon
Product type Paperback
Published in Mar 2024
Publisher Packt
ISBN-13 9781837639533
Length 814 pages
Edition 2nd Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Luca Zavarella Luca Zavarella
Author Profile Icon Luca Zavarella
Luca Zavarella
Arrow right icon
View More author details
Toc

Table of Contents (27) Chapters Close

Preface 1. Where and How to Use R and Python Scripts in Power BI FREE CHAPTER 2. Configuring R with Power BI 3. Configuring Python with Power BI 4. Solving Common Issues When Using Python and R in Power BI 5. Importing Unhandled Data Objects 6. Using Regular Expressions in Power BI 7. Anonymizing and Pseudonymizing Your Data in Power BI 8. Logging Data from Power BI to External Sources 9. Loading Large Datasets Beyond the Available RAM in Power BI 10. Boosting Data Loading Speed in Power BI with Parquet Format 11. Calling External APIs to Enrich Your Data 12. Calculating Columns Using Complex Algorithms: Distances 13. Calculating Columns Using Complex Algorithms: Fuzzy Matching 14. Calculating Columns Using Complex Algorithms: Optimization Problems 15. Adding Statistical Insights: Associations 16. Adding Statistical Insights: Outliers and Missing Values 17. Using Machine Learning without Premium or Embedded Capacity 18. Using SQL Server External Languages for Advanced Analytics and ML Integration in Power BI 19. Exploratory Data Analysis 20. Using the Grammar of Graphics in Python with plotnine 21. Advanced Visualizations 22. Interactive R Custom Visuals 23. Other Books You May Enjoy
24. Index
Appendix 1: Answers
1. Appendix 2: Glossary

Loading complex log files using regex in Power BI

Log files are a very useful tool for developers and administrators of computer systems. They record what happened to the system, when it happened, and which user actually caused the event. Thanks to these files, you can find information about any system failure, which allows you to diagnose the causes of these failures more quickly.

Logs are often semi-structured data, information that cannot be persisted in a relational database in the format in which it was generated. In order to be analyzed with standard tools, this data must first be transformed into a more suitable format.

Because it is not structured data, it is difficult to import it into Power BI as is, unless someone has developed a custom connector to do so. In these scenarios, using a regex in languages like Python or R can help us get the results we want.

Apache access logs

Let’s suppose your company has a website published through an Apache web server. Your manager asks you to carry out an analysis regarding which web pages of the site are the most clicked on. The only way to get this information is to analyze the access log file. This file records data about all requests made to the web server. Here is an example of an Apache access log:

Diagram  Description automatically generated with medium confidence

Figure 6.29: An example of an Apache access log

As you can see, at first glance, there is a fairly organized structure to the information in this log. If no one has customized the output of the Apache log files, it uses the Common Log Format (CLF) by default. You can find a real example of an Apache access log in the apache_logs.txt file, which is inside the repository that comes with this book, in the 02-loading-complex-log-files-using-regex folder. We can find it in the GitHub repository at http://bit.ly/apache-access-log (click on Download to view it).

If you go ahead and read the documentation of those log files, you will deduce that the information recorded in the access log follows the NCSA extended/combined log format. So, the data that is recorded is as follows:

  1. The remote host name (the IP address).
  2. The remote log name (if empty, you’ll find a dash; it is not used in the sample file).
  3. The remote user if the request was authenticated (if empty, you’ll find a dash).
  4. The datetime that the request was received, in the [18/Sep/2011:19:18:28 -0400] format.
  5. The first line of the request made to the server between double quotes.
  6. The HTTP status code for the request.
  7. The size of the response in bytes, excluding the HTTP headers (could be a dash).
  8. The Referer HTTP request header, which contains the absolute or partial address of the page making the request.
  9. The User-Agent HTTP request header, which contains a string that identifies the application, operating system, vendor, and/or version of the requesting user agent.

Once you know both the nature of the information written in the log and the form in which it is written, you can take advantage of the powerful tools provided by regexes to better structure this information and import it into Power BI.

Importing Apache access logs in Power BI with Python

As mentioned earlier, you can find a real example of an Apache access log in the apache_logs.txt file, which is inside the repository that comes with this book, in the 02-loading-complex-log-files-using-regex folder. You will load the information in this file using a Python script, not a Power BI connector.

Compared to what you’ve learned before about regexes and Python, in the 01-apache-access-log-parser-python.py Python script (which you’ll find in the preceding folder), you will encounter the new constructs that we will highlight below.

To be able to read a text file line by line in Python, you’ll use the open(file, mode) functions and the readlines() method. Specifically, you’re going to read the apache_logs.txt file as read-only ('r') and read each of its lines to store them in a list.

In regexes, it is possible to refer to groups identified by round brackets not only by a numerical index but also by a name. This is thanks to named capturing groups. Usually, the generic regex syntax that is used to assign a name to a group is (?<group-name>…). In Python, it is (?P<group-name>…):

  • In Python, you can define a list of regex parts that can be merged (join) using a separator. In our case, the separator is defined by one or more space characters (\s+). Also note that in the various patterns below, the character class \S (uppercase) is used to denote a character that is not one of the spaces identified by \s (lowercase):
    regex_parts = [
        r'(?P<hostName>\S+)',
        r'\S+',
        r'(?P<userName>\S+)',
        r'\[(?P<requestDateTime>[\w:/]+\s[+\-]\d{4})\]',
        r'"(?P<requestContent>\S+\s?\S+?\s?\S+?)"',
        r'(?P<requestStatus>\d{3}|-)',
        r'(?P<responseSizeBytes>\d+|-)',
        r'"(?P<requestReferrer>[^"]*)"',
        r'"(?P<requestAgent>[^"]*)?"',
    ]
    pattern = re.compile(r'\s+'.join(regex_parts) + r'$')
    

    Note that, in this case, the re.compile() function is used since the match must be done many times on all lines of the log; therefore, precompiling the regex could have computational advantages.

  • Pattern matching is done for each line in the log:
    for line in access_log_lines:    
        log_data.append(pattern.match(line).groupdict())
    

    The groupdict() method returns a dictionary with the group names as the key and the matched strings as the value for that key. All the dictionaries for each line are appended to the log_data list.

We leave it up to the reader to interpret how each regex part goes about capturing the desired string.

Now that we’ve clarified a few points in the code, let’s import the log into Power BI:

  1. Open Power BI Desktop through the shortcut that activates the pbi_powerquery_env environment.
  2. Make sure Power BI Desktop is referencing the pbi_powerquery_env environment in Options.
  3. Go to Get Data, More … pyt and select the Python script.
  4. Copy and paste the script from the 01-apache-access-log-parser-python.py file into the Python script editor, making sure to change the path to the Apache log file. Then click on OK.
  5. Then, select the df dataframe from the Navigator window and click on Load:
Table  Description automatically generated

Figure 6.30: Selecting the df dataframe returned by the Python script

  1. If you click on the Data icon, you can view the entire log loaded as a structured table:
Graphical user interface, text, application  Description automatically generated

Figure 6.31: The Apache access log is loaded in Power BI with Python

Awesome! As you can see, Python has done all the magic of transforming semi-structured data into table-structured data thanks to regexes! So now you can easily import what looked like a complex log file to manage into Power BI.

Importing Apache access logs in Power BI with R

In this section, you will load the information of the apache_logs.txt file, but this time, using an R script.

Compared to what you’ve learned previously about regexes in R, in the 02-apache-access-log-parser-r.R script (which you’ll find in the same preceding folder), you’ll encounter these new constructs:

  • To be able to read a text file line by line in R, you’ll use the read_lines() function from the readr package. Specifically, you’re going to read each line of the apache_logs.txt file in order to persist them to a vector.
  • In order to take full advantage of named capturing groups in R, you need to install and use the features of a package called namedCapture. Thanks to this package, both regex syntaxes for named groups are allowed: the standard (?<group-name>…) regex syntax and the (?P<group-name>…) regex syntax.
  • Just as we did in the Python script, in R, you’ll also define a vector of regex parts, which you’ll merge with the paste(..., collapse = '...') function. Essentially, the regex components within the vector are combined using the text specified in the collapse argument. Specifically, this function will join regex parts together through the \s+ separator. After merging all of the parts, the $ character is added to the end of the resulting string using the paste0(…) function. Remember that raw strings have a different syntax in R than in Python. In this case, we will use the r'{...}' syntax:
    regex_parts <- c(
        r'{(?P<hostName>\S+)}'
      , r'{\S+}'
      , r'{(?P<userName>\S+)}'
      , r'{\[(?P<requestDateTime>[\w:/]+\s[+\-]\d{4})\]}'
      , r'{"(?P<requestContent>\S+\s?\S+?\s?\S+?)"}'
      , r'{(?P<requestStatus>\d{3}|-)}'
      , r'{(?P<responseSizeBytes>\d+|-)}'
      , r'{"(?P<requestReferrer>[^"]*)"}'
      , r'{"(?P<requestAgent>[^"]*)?"}'
    )
    pattern <- paste0( paste(regex_parts, collapse = r'{\s+}'), '$' )
    
  • Pattern matching is done using the str_match_named() function of the namedCapture package over the whole log vector, using a single-line command:
    df <- as.data.frame( str_match_named( access_log_lines, pattern = pattern ) )
    

Again, we leave it to the reader to interpret how each individual regex part goes about capturing the desired string.

IMPORTANT NOTE

Toby Dylan Hocking, author of the namedCapture package, has developed a newer package called nc, which uses user-friendly functions for extracting a data table from non-tabular text data using regular expressions. For the sake of training on this very topic, we use the old namedCapture package in this case, because it avoids hiding the construction of the entire pattern used for matching.

Now that we’ve clarified a few points in the code, let’s import the log into Power BI:

  1. First, you need to install the namedCapture package. So, open RStudio and make sure that the engine being referenced is the latest one (CRAN R 4.2.2 in our case) in Global Options.
  2. Now, go to the console and enter and run the following code:
    install.packages("namedCapture")
    
  3. Open Power BI Desktop, go to Get Data, More … and select the R script.
  4. Copy and paste the script from the 02-apache-access-log-parser-r.R file into the R script editor and then click on OK.
  5. Then, select the df dataframe from the Navigator window and click on Load:
Table  Description automatically generated

Figure 6.32: Selecting the df dataframe returned by the R script

  1. If you click on the Data icon, you can view the entire log loaded as a structured table:
A picture containing graphical user interface  Description automatically generated

Figure 6.33: The Apache access log loaded in Power BI with R

Great job! You were able to import a semi-structured log file into Power BI even with R.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image