Sunday 19 March 2023

Synapse - Using Spark SQL to time travel Delta Tables

Case
In a previous blog post you showed how to create and query Delta Tables with PySpark for a Lake House, however most Data Warehouse people are more familiar with the SQL language. How can you query a Delta Table with the good old SQL language?
Using Spark SQL to time travel Delta Tables
















Solution
In that previous blog post we showed you that you can query the Delta Tables in for example a SQL Serverless pool by creating External Tables on those Delta Tables. This allows you to use TSQL to query Delta Tables, but it doesn't allow you to use time travel. You always get the latest version of the data.
External Tables on Delta in Serverless SQL Pool
























However we can use Synapse Notebooks with Spark SQL as a language which is very similar to TSQL to query Delta Tables. This allows you to time travel the data in a familiar language.

1) Add Delta Table to Lake Database
For easily querying Delta Tables you first need make the Delta Tables visible in Synapse by adding them to the Lake Database. We explained this in the previous blog post.
Adding Delta Table to Lake Database









Once the Delta Table is available in the Lake Database you can query it like a regular table. By default you will see the latest version of the data.
%%sql
SELECT * FROM silver.cities
The alternative is to use the entire path:
%%sql
SELECT * FROM delta.`abfss://mysource@mydatalake.dfs.core.windows.net/silver/cities`
2) Show historical versions
You can check which historical versions are available with the DESCRIBE HISTORY command.
%%sql
DESCRIBE HISTORY silver.cities
Show versions of the Delta Table

















Besides showing the history you can also check where the Delta Table is stored in your Data Lake with the DESCRIBE EXTENDED command. It will give you various details like the location of the Delta Table.
See details of Delta Table
















3) Show specifict version by version number
With the DESCRIBE HISTORY command you get a table with various versions of your table. The fist column shows the version number that starts with 0 for the initial version of the table.

When you query a Delta Table you can add VERSION AS OF X behind the query where you replace the X by the version number. In this example we take version 2 (the third version of the table).
%%sql
SELECT * FROM silver.cities VERSION AS OF 2
Showing version 2 of the Delta Table















4) Show specifict version by date
Time traveling with a specific version number is cumbersome because you first need to determine the version you need. Lucily you can also get a version that was active on a specific date by adding TIMESTAMP AS OF "2022-01-01" behind the query.
%%sql
SELECT * FROM silver.cities TIMESTAMP AS OF "2022-01-01"
Showing version of a specific date

















Conclusions
In this post you learned how to time travel a Delta Table with Spark SQL. The same options as with PySpark, but for some people just a little bit more readable. In a next post we will discus Change Data Feed to get data changes between versions.

No comments:

Post a Comment

All comments will be verified first to avoid URL spammers. यूआरएल स्पैमर से बचने के लिए सभी टिप्पणियों को पहले सत्यापित किया जाएगा।