As the field of data science is still fairly new (it used to be called operations research), the tools and frameworks are also fairly new. With that being said, the command line is almost 50 years old and still one of the most powerful tools used today. If you're familiar with interpreters, the command line will come easy to you. Think of it as a place to experiment and see your results in real time. Every command you enter is executed interactively, and when you call a bash script to run, it executes sequentially (unless you decide not to, more in later chapters). As we know, experimenting and exploring is most of what data science tries to accomplish (and it's the most fun!).
I was having a conversation with a newly-graduated data science student about parsing text and asked, "How would you take a small file and provide a word count on how many time the words appear?" By now everyone is familiar with the infamous Hadoop word-count example. It's considered the "Hello, World" of data science.
The answer I received was a little shocking but expected. The student instantly replied that they'd use Hadoop to read the file, tokenize the words to form a key/value pair, reduce all the keys and values that are grouped together, and add up the occurrences. The student isn't wrong, in fact, that's a perfectly acceptable answer. Especially if the file is too large for a single system (big data), you already have the code in place to scale.
With that being said, what if I told you there's a quicker way to obtain the results that doesn't require programming in Java and setting up a cluster or having Hadoop run locally? In fact, it would only take one line to complete the task? Check out the following code:
cat file.txt | tr '[:space:]' '[\n*]' | grep -v "^$" | sort | uniq -c | sort -bnr
(tr '[:space:]' '[\n*]' | grep -v "^$" | sort | uniq -c | sort -bnr )<file.txt
This may seem like a lot, especially if you've never used the command line before, so let's break it down. The cat command reads files sequentially and writes them to standard output. |, also known as pipe or the pipe operator, combines a sequence of commands chained together by their standard streams so that the output of each process (stdout) feeds directly as input (stdin) to the next one. tr (translate) reads the input from cat (via | ) and writes the result to standard output that replaces spaces with new lines. The grep command is very powerful and the most used for a lot of data parsing. grep is used to search plain-text data for lines that match a regular expression. In this example, grep trims out the empty lines. sort is used for, well, sorting! You'll notice a lot of the commands are named for what they actually do. The sort command prints the lines of its input or concatenation of files listed in its argument list in sorted order. uniq is a command that, when fed a text file, outputs the file with adjacent identical lines collapsed to one. It usually works well with the sort command. In this example, uniq -c is called to count occurrences. And finally, sort -bnr sorts in numeric reverse order and ignores whitespace.
Don't worry if the example looks foreign to you. The command line also comes with manual pages for each command. All you have to do is man the command to view the page. You can even man man to get an idea of what the man command does! Give it a whirl and man tr or man sort. Oh, you don't have the command line set up? It's easier than you think, and we can get you up in running in minutes, so let's get started.