Data munging
The arsenal of tools for data munging is huge, and while we will focus on Python we want to mention some useful tools as well. If they are available on your system and you expect to work a lot with data, they are worth learning.
One group of tools belongs to the UNIX tradition, which emphasizes text processing and as a consequence has, over the last four decades, developed many high-performance and battle-tested tools for dealing with text. Some common tools are: sed
, grep
, awk
, sort
, uniq
, tr
, cut
, tail
, and head
. They do very elementary things, such as filtering out lines (grep
) or columns (cut
) from files, replacing text (sed
, tr
) or displaying only parts of files (head
, tail
).
We want to demonstrate the power of these tools with a single example only.
Imagine you are handed the log files of a web server and you are interested in the distribution of the IP addresses.
Each line of the log file contains an entry in the common log server format (you can download this data set from...