Examining big-text log file access
MonitorWare is a network monitoring solution for Windows machines. It has sample log files that show access to different systems. I downloaded the HTTP log file sample set from http://www.monitorware.com/en/logsamples/apache.php. The log file has entries for different HTTP requests made to a server.
The URl downloads the apache-samples.rar
file. A .rar
file is a type of compressed format for very large files that would overwhelm the normal .zip
file format. This example is only 20 KB. You need to extract the log file from the .rar
file for access in the following coding.
How to do it...
We can use a similar script that loads the file, and then we use additional functions to pull out the records of interest. The coding is:
import pyspark if not 'sc' in globals(): sc = pyspark.SparkContext() textFile = sc.textFile("access_log") print(textFile.count(),"access records") gets = textFile.filter(lambda line: "GET" in line) print(gets.count(),"GETs") posts...