Automating schema inference
The spark-tensorflow-connector library, which integrates Spark with TensorFlow, supports automatic schema inference when reading TensorFlow records into Spark DataFrames. Schema inference is an expensive operation because it requires an extra reading pass through the data, and therefore it's good practice to specify it as it will improve the overall performance of our pipeline.
The following Python code example demonstrates how we can do this on some test data we create as an example:
- Our first step is to define the schema of our data:
from pyspark.sql.types import * path = "test-output.tfrecord" fields = [StructField("id", IntegerType()), StructField("IntegerCol", IntegerType()), StructField("LongCol", LongType()), StructField("FloatCol", FloatType()), StructField("DoubleCol", DoubleType()), StructField("VectorCol", ArrayType(DoubleType(), Â Â Â Â &...