Importing files into Delta Lake is easy. Data can be imported through files or streaming. The steps for this recipe are as follows:
- In Databricks, open the data panel by clicking on the Data button, click on the Add Data button, and drag your file into the Upload section.
- Click on Create Table in Notebook. The code generated for you will start with this:
# File location and type
file_location = "/FileStore/tables/soilmoisture_dataset.csv"
file_type = "csv"
# CSV options
infer_schema = "false"
first_row_is_header = "false"
delimiter = ","
df = spark.read.format(file_type) \
.option("inferSchema", infer_schema) \
.option("header", first_row_is_header) \
.option("sep", delimiter) \
.load(file_location)
display(df)
- Review the data and when you are ready to save to Delta Lake, uncomment the last line:
# df.write.format("parquet").saveAsTable(permanent_table_name)
- Then, change "parquet" to "delta":
df.write.format("delta").saveAsTable(permanent_table_name)
- From here, query the data:
%sql
SELECT * FROM soilmoisture
- Alternatively, you can optimize how Delta Lake saves the file, making querying faster:
%sql
OPTIMIZE soilmoisture ZORDER BY (deviceid)
Delta Lake data can be updated, filtered, and aggregated. In addition, it can be turned into a Spark or Koalas DataFrame easily.