Data science is the ‘sexiest job’ of 21st century’ (at least until Skynet takes over the world). But for all the emphasis on ‘Data’, it is the ‘Science’ that makes you - the practitioner - valuable. To practice high-quality science with data, first you need to make sure it is properly sourced, cleaned, formatted, and pre-processed.
This course teaches you the most essential basics of this invaluable component of the data science pipeline – data wrangling.
“Data is the new Oil” and it is ruling the modern way of life through incredibly smart tools and transformative technologies. But oil from the rig is far from being usable. It has to be refined through a complex processing network.
Similarly, data needs to be curated, massaged and refined to become fit for use in intelligent algorithms and consumer products. This is called “wrangling” and (according to CrowdFlower) all good data scientists spend almost 60-80% of their time on this, each day, every project.
It generally involves the following:
This course aims to teach you all the core ideas behind this process and to equip you with the knowledge of the most popular tools and techniques in the domain. As the programming framework, we have chosen Python, the most widely used language for data science. We work through real-life examples and at the end of this course, you will be confident to handle a myriad array of sources to extract, clean, transform, and format your data for further analysis or exciting machine learning model building.
The lessons start with a refresher on Python focusing mainly on advanced data structures, and then quickly jumping into NumPy and Panda libraries as fundamental tools for data wrangling.
It emphasizes why you should stay away from traditional ways of data cleaning, as done in other languages, and take advantage of specialized pre-built routines in Python.
Thereafter, it covers how using the same Python backend, one can extract and transform data from a diverse array of sources - internet, large database vaults, or Excel financial tables.
Further lessons teach how to handle missing or wrong data, and reformat based on the requirement from a downstream analytics tool.
The course emphasizes learning by real example and showcases the power of an inquisitive and imaginative mind primed for success.
First, let us be clear that there is no substitute for the data wrangling process itself. There is no short-cut either. Data wrangling must be performed before the modeling task but there is always the debate of doing this process using an enterprise tool or by directly using a programming language and associated frameworks. There are many commercial, enterprise-level tools for data formatting and pre-processing, which does not involve coding on the part of the user.
Common examples of such tools are:
At the end of the day, it really depends on the organizational approach whether to use any of these off-the-shelf tools or to have more flexibility, control, and power by using a programming language like Python to perform data wrangling. As the volume, velocity, and variety (three V’s of Big Data) of data undergo rapid changes, it is always a good idea to develop and nurture significant amount of in-house expertise in data wrangling. This is done using fundamental programming frameworks so that an organization is not betrothed to the whims and fancies of any particular enterprise platform as a basic task as data wrangling.
Some of the obvious advantages of using an open-source, free programming paradigm like Python for data wrangling are:
Here are five best practices that will help you out in your data wrangling journey with Python. And in the end, all you’ll have is clean and ready to use data for your business needs.
Though data wrangling is an important task, there are certain myths associated with data wrangling which developers should be cautious of. Myths such as:
Learn in detail about these misconceptions. We hope that these misconceptions would help you realize that data wrangling is not as difficult as it seems. Have fun wrangling data!
Dr. Tirthajyoti Sarkar works as a Sr. Principal Engineer in the semiconductor technology domain where he applies cutting-edge data science/machine learning techniques for design automation and predictive analytics.
Shubhadeep Roychowdhury works as a Sr. Software Engineer at a Paris based Cyber Security startup. He holds a Master Degree in Computer Science from West Bengal University Of Technology and certifications in Machine Learning from Stanford.
5 best practices to perform data wrangling with Python
4 misconceptions about data wrangling
Data cleaning is the worst part of data analysis, say data scientists