Converting distributed CSV files to a TensorFlow dataset
If you are not sure about the data size, or are unsure as to whether it can all fit in the Python runtime's memory, then reading the data into a pandas DataFrame is not a viable option. In this case, we may use a TF dataset to directly access the data without opening it.
Typically, when data is stored in a storage bucket as parts, the naming convention follows a general pattern. This pattern is similar to that of a Hadoop Distributed File System (HDFS), where the data is stored in parts and the complete data can be inferred via a wildcard symbol, *
.
When storing distributed files in a Google Cloud Storage bucket, a common pattern for filenames is as follows:
<FILE_NAME>-<pattern>-001.csv … <FILE_NAME>-<pattern>-00n.csv
Alternatively, there is the following pattern:
<FILE_NAME>-<pattern>-aa.csv … <FILE_NAME>-<pattern>-zz.csv...