Reducing the size of your data
If you are working directly on Kaggle Notebooks, you will find their limitations quite annoying and dealing with them a timesink. One of these limitations is the out-of-memory errors that will stop the execution and force you to restart the script from the beginning. This is quite common in many competitions. However, unlike deep learning competitions based on text or images where you can retrieve the data from disk in small batches and have them processed, most of the algorithms that work with tabular data require handling all the data in memory.
The most common situation is when you have uploaded the data from a CSV file using Pandas’ read_csv
, but the DataFrame is too large to be handled for feature engineering and machine learning in a Kaggle Notebook. The solution is to compress the size of the Pandas DataFrame you are using without losing any information (lossless compression). This can easily be achieved using the following script derived...