This article is an excerpt from the book, Data Modeling with Snowflake, by Serge Gershkovich. Discover how Snowflake's unique objects and features can be used to leverage universal modeling techniques through real-world examples and SQL recipes.
In the era of big data, the ability to efficiently manage and analyze semi-structured data is crucial for businesses. Snowflake, a leading cloud-based data platform, offers robust features to handle semi-structured data formats like JSON, Avro, and Parquet. This article explores the benefits of using the VARIANT data type in Snowflake and provides a hands-on guide to managing semi-structured data.
Semi-structured data formats are popular due to their flexibility when working with dynamically varying information. Unlike relational schemas, where a precise entity structure must be predefined, semi-structured data can adapt to include or omit attributes as needed, as long as they are properly nested within corresponding parent objects.
For example, consider the contact list on your phone. It contains a list of people and their contact details but does not capture those details uniformly. Some contacts may have multiple phone numbers, while others have only one. Some entries might include an email address and street address, while others have just a number and a vague description.
To handle this type of data, Snowflake uses the VARIANT data type, which allows semi-structured data to be stored as a column in a relational table. Snowflake optimizes how VARIANT data is stored internally, ensuring better compression and faster access. Semi-structured data can sit alongside relational data in the same table, and users can access it using basic extensions to standard SQL, achieving similar performance.
Another compelling reason to use the VARIANT data type is its adaptability to change. If columns are added or removed from semi-structured data, there is no need to modify ELT (extract, load, and transform) pipelines. The VARIANT data type does not require schema changes, and read operations will not fail for an attribute that no longer exists.
Let's delve into a practical example of working with semi-structured data in Snowflake. This example uses JSON data representing information about pirates, such as details about the crew, weapons, and their ship. All this information is stored in a single VARIANT data type column. In relational data, a row represents a single entity; in semi-structured data, a row can represent an entire file containing multiple entities.
Here is a sample SQL script to create a table with semi-structured data:
CREATE TABLE pirates_data (
id NUMBER AUTOINCREMENT PRIMARY KEY,
load_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
data VARIANT
);
```
In this example, the `AUTOINCREMENT` keyword generates a unique ID for each record inserted, and the `VARIANT` column stores the semi-structured JSON data.
To load semi-structured data into Snowflake, you can use the `COPY INTO` command. Here’s an example of how to load JSON data from an external stage into the `pirates_data` table:
COPY INTO pirates_data
FROM @my_stage/pirates_data.json
FILE_FORMAT = (TYPE = 'JSON');
```
Once the data is loaded, you can query it using standard SQL. For instance, to extract specific attributes from the JSON data, you can use the dot notation:
SELECT
data:id::NUMBER AS pirate_id,
data:crew AS crew,
data:weapons AS weapons
FROM pirates_data;
```
This query extracts the `id`, `crew`, and `weapons` fields from the JSON data stored in the `data` column.
Although semi-structured data offers flexibility, converting it into a relational format can provide better performance for certain queries. Snowflake allows you to transform VARIANT data into relational columns using the `FLATTEN` function.
Here's an example of how to flatten a JSON array into a relational table:
SELECT
value:id::NUMBER AS pirate_id,
value:name::STRING AS name,
value:rank::STRING AS rank
FROM pirates_data,
LATERAL FLATTEN(input => data:crew);
```
This query converts the `crew` array from the JSON data into individual rows in a relational format, making it easier to query and analyze.
One of the main advantages of using the VARIANT data type in Snowflake is the flexibility of schema-on-read. This approach allows you to ingest data without a predefined schema, and then define the schema at the time of reading the data. This contrasts with the traditional schema-on-write approach, where the schema must be defined before data ingestion.
Here’s an example demonstrating schema-on-read with semi-structured data in Snowflake:
SELECT
data:id::NUMBER AS pirate_id,
data:ship.name::STRING AS ship_name,
data:ship.type::STRING AS ship_type
FROM pirates_data;
```
In this query, the schema is defined at read time, allowing you to extract specific attributes from the nested JSON data.
Snowflake’s support for semi-structured data also extends to handling nested and repeated data structures. The FLATTEN function is particularly useful for working with such data, enabling you to transform nested arrays into a more manageable relational format.
Consider a JSON structure where each pirate has a nested array of previous voyages. To flatten this nested data, you can use the following query:
SELECT
data:id::NUMBER AS pirate_id,
value:date::DATE AS voyage_date,
value:destination::STRING AS voyage_destination
FROM pirates_data,
LATERAL FLATTEN(input => data:previous_voyages);
```
This query extracts the nested `previous_voyages` array and converts it into individual rows in a relational format.
When working with semi-structured data in Snowflake, it’s important to consider performance implications. While the VARIANT data type offers flexibility, it can also introduce overhead if not managed properly.
Handling semi-structured data in Snowflake using the VARIANT data type provides immense flexibility and performance benefits. Whether you are dealing with dynamically changing schemas or integrating semi-structured data with relational data, Snowflake’s capabilities can significantly enhance your data management and analytics workflows. By leveraging the techniques outlined in this article, you can efficiently model and transform semi-structured data, unlocking new insights and value for your organization.
For more detailed guidance and advanced techniques, refer to the book "Data Modeling with Snowflake," which provides comprehensive insights into modern data modeling practices and Snowflake’s powerful features.
Serge Gershkovich is a seasoned data architect with decades of experience designing and maintaining enterprise-scale data warehouse platforms and reporting solutions. He is a leading subject matter expert, speaker, content creator, and Snowflake Data Superhero. Serge earned a bachelor of science degree in information systems from the State University of New York (SUNY) Stony Brook. Throughout his career, Serge has worked in model-driven development from SAP BW/HANA to dashboard design to cost-effective cloud analytics with Snowflake. He currently serves as product success lead at SqlDBM, an online database modeling tool.