Subscription

Explore Products

Best Sellers

New Releases

Books

Videos

Audiobooks

Learning Hub

Newsletter Hub

Free Learning

You're reading from Hands-On Data Science with the Command Line Automate everyday data science tasks using command-line tools

Product type Paperback

Published in Jan 2019

Publisher Packt

ISBN-13 9781789132984

Length 124 pages

Edition 1st Edition

Languages

Python

Tools

UNIX

Concepts

Data Science

Authors (3):

Jason Morris

Raymond Page

Chris McCubbin

View More author details

Table of Contents (8) Chapters

Preface

1. Data Science at the Command Line and Setting It Up FREE CHAPTER

2. Essential Commands

3. Shell Workflows, and Data Acquisition and Massaging

4. Bash Functions and Data Visualization

5. Loops, Functions, and String Processing

6. SQL, Math, and Wrapping it up

7. Other Books You May Enjoy

Leave a review - let other readers know what you think

cut and viewing data as columnar

The first thing you will likely need to do is partition data in files into rows of data and columns of data. We saw some transformations in the previous chapters that allow us to manipulate data one row at a time. For this chapter, we'll assume the rows of your data correspond with the lines of data in your files. If this isn't the case, this may be the first thing you want to do in your pipeline.

Given that we have some rows of data in our file or stream, we would like to view those rows in a columnar fashion, such as a traditional database. We can do this using the help of the cut command. cut will allow us to chop the lines of the file into columns by a delimiter, and to select which of those columns get passed through to the output.

If your data is a comma-separated or tab-separated file, cut is quite simple:

zcat amazon_reviews_us_Digital_Ebook_Purchase_v1_01...

The rest of the chapter is locked

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $19.99/month. Cancel anytime

Authors (3)

Morris

Jason Morris is a systems and research engineer with over 19 years of experience in system architecture, research engineering, and large data analysis. His primary focus is machine learning with TensorFlow, CUDA, and Apache Spark. Jason is also a speaker and a consultant for designing large-scale architectures, implementing best security practices on the cloud, creating near real-time image detection analytics with deep learning, and developing serverless architectures to aid in ETL. His most recent roles include solution architect, big data engineer, big data specialist, and instructor at Amazon Web Services. He is currently the Chief Technology Officer of Next Rev Technologies and his favorite command line program is netcat

See other products by Morris

McCubbin

Chris McCubbin is a data scientist and software developer with 20 years experience in developing complex systems and analytics. He co-founded the successful big data security startup Sqrrl, since acquired by Amazon. He has also developed smart swarming systems for drones, social network analysis systems in MapReduce and big data security analytic platforms using the Apache projects Accumulo and Spark. He has been using the Unix command line starting on IRIX platforms in college and his favorite command line program is find.

See other products by McCubbin

Page

Raymond Page is a computer engineer specializing in site reliability. His experience with embedded development engendered a passion for removing the pervasive bloat from web technologies and cloud computing. His favorite command is cat.

See other products by Page