Wrangling data with pixiedust_rosie
Working in a controlled experiment is, most of the time, not the same as working in the real world. By this I mean that, during development, we usually pick (or I should say manufacture) a sample dataset that is designed to behave; it has the right format, it complies with the schema specification, no data is missing, and so on. The goal is to focus on verifying the hypotheses and build the algorithms, and not so much on data cleansing, which can be very painful and time-consuming. However, there is an undeniable benefit to get data that is as close to the real thing as early as possible in the development process. To help with this task, I worked with two IBM colleagues, Jamie Jennings and Terry Antony, who volunteered to build an extension to PixieDust called pixiedust_rosie
.
This Python package implements a simple wrangle_data()
method to automate the cleansing of raw data. The pixiedust_rosie
package currently supports CSV and JSON, but more formats...