Subscription

Explore Products

Best Sellers

New Releases

Books

Videos

Audiobooks

Learning Hub

Free Learning

You're reading from Hands-On Data Science with the Command Line Automate everyday data science tasks using command-line tools

Product type Paperback

Published in Jan 2019

Publisher Packt

ISBN-13 9781789132984

Length 124 pages

Edition 1st Edition

Languages

Python

Tools

UNIX

Concepts

Data Science

Authors (3):

Jason Morris

Raymond Page

Chris McCubbin

View More author details

Table of Contents (8) Chapters

Preface

1. Data Science at the Command Line and Setting It Up

2. Essential Commands FREE CHAPTER

3. Shell Workflows, and Data Acquisition and Massaging

4. Bash Functions and Data Visualization

5. Loops, Functions, and String Processing

6. SQL, Math, and Wrapping it up

7. Other Books You May Enjoy

Leave a review - let other readers know what you think

Simulating selects

In the previous sections, we saw how to SELECT data, inner JOIN data, and even do GROUP BY and ORDER BY operations on flat files or streams of data. Rounding out the commonly-used operations, we can also create sub-selected tables of data by simply wrapping a set of calls into a stream and then processing them further. This is what we've been doing using the piping model, but to illustrate a point, say we wanted to sub-select out of the grouped-by reviews only those reviewers who had between 100 and 200 reviews. We can take the command in the preceding example and awk it once more:

zcat amazon_reviews_us_Digital_Ebook_Purchase_v1_01.tsv.gz | cut -d$'\t' -f2,8 | awk '{sum[$1]+=$2;count[$1]+=1} END {for (i in sum) {print i,sum[i],count[i],sum[i]/count[i]}}' | sort -k3 -r -n | awk '$3 >= 100 && $3 <=200' | head

...

The rest of the chapter is locked

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $19.99/month. Cancel anytime

Authors (3)

Morris

Jason Morris is a systems and research engineer with over 19 years of experience in system architecture, research engineering, and large data analysis. His primary focus is machine learning with TensorFlow, CUDA, and Apache Spark. Jason is also a speaker and a consultant for designing large-scale architectures, implementing best security practices on the cloud, creating near real-time image detection analytics with deep learning, and developing serverless architectures to aid in ETL. His most recent roles include solution architect, big data engineer, big data specialist, and instructor at Amazon Web Services. He is currently the Chief Technology Officer of Next Rev Technologies and his favorite command line program is netcat

See other products by Morris

Page

Raymond Page is a computer engineer specializing in site reliability. His experience with embedded development engendered a passion for removing the pervasive bloat from web technologies and cloud computing. His favorite command is cat.

See other products by Page

McCubbin

Chris McCubbin is a data scientist and software developer with 20 years experience in developing complex systems and analytics. He co-founded the successful big data security startup Sqrrl, since acquired by Amazon. He has also developed smart swarming systems for drones, social network analysis systems in MapReduce and big data security analytic platforms using the Apache projects Accumulo and Spark. He has been using the Unix command line starting on IRIX platforms in college and his favorite command line program is find.

See other products by McCubbin