When we load data into BigQuery, each column of that data is stored separately. The values in each column are compressed, run-length encoded, and encrypted, and the corresponding data file is replicated. Each of these replicas is then stored in the underlying distributed filesystem, known as Colossus.
This peculiar representation, columnar, compressed, and replicated, explains a couple of features of BigQuery that otherwise strike us as odd:
- Does not support indices: This makes it very different from traditional RDBMS. This makes sense, given that each column's data is effectively stored separately anyway, and uses a representation not that different from many indices
- Cost more for each column they pull in: This also makes sense if you consider that each additional column requires access to a different file in the underlying file...