Version control for datasets
Since we took a quick little side journey to discuss data collection, I hope you’ll indulge me once more while we talk about using data in a version control system such as Git. A little earlier, we opened a data file and PyCharm immediately complained about the size of the file. By modern standards, an 8 MB file isn’t very big. However, consider that most code files, PyCharm’s raison d’être, are on average well under 100K in size. If your files are very large, that’s a code smell and you should figure out what you’re doing wrong.
Here, we’re presenting PyCharm with a file that is about 8,000% bigger than what it is used to. Git is also primarily used to deal with small files coming out of an IDE. I’m bringing this up because there is somewhat of a crisis of reproducibility in the data science and scientific computing community. This is when one data team can extract a specific insight from...