How can I query my lake house files and tables in Synapse without specifying the entire data lake path for each file or table?
Solution
In the Data tab (left menu) of the Synapse Workspace you can create a Lake database and then add your files and tables to it. By default there is already a Lake Database present called 'default', but it won't be visible until you add tables to it or add other databases.
1) Create database
Let's create a new database. We will create one for bronze, silver and gold. For the files we could create a separate database called Ingest for easy querying. In the documentation they will either mention a CREATE DATABASE or a CREATE SCHEMA command. They are the same thing.
- Create a new notebook with either SPARK SQL or PySpark as language.
- Attach it to a spark pool
- In the first Code block add the code below and execute it. The IF NOT EXISTS and the COMMENT are optional.
%%sql CREATE DATABASE IF NOT EXISTS Ingest COMMENT 'Raw files'; CREATE DATABASE IF NOT EXISTS Bronze COMMENT 'Raw Layer'; CREATE DATABASE IF NOT EXISTS Silver COMMENT 'Validated Layer'; CREATE DATABASE IF NOT EXISTS Gold COMMENT 'Enriched Layer';
%%pyspark spark.sql(f'CREATE DATABASE IF NOT EXISTS Ingest COMMENT \'Raw files\';'); spark.sql(f'CREATE DATABASE IF NOT EXISTS Bronze COMMENT \'Raw Layer\';'); spark.sql(f'CREATE DATABASE IF NOT EXISTS Silver COMMENT \'Validated Layer\';'); spark.sql(f'CREATE DATABASE IF NOT EXISTS Gold COMMENT \'Enriched Layer\';');
SPARK SQL Code block |
Now go to the Data and then you will see a Lake database list with 4 databases (default + the three you created)
2) Create table on Parquet file
For the raw files we will create a table base on the parquet (or flat) files from your ingestion from the source into the data lake. If the file has timestamp in the name you could even use a wildcard in the path.
- Go back to your notebook and create a second Code block
- Add the code below and execute it. It will create a table in the Ingest Lake database
- Then go to the Data tab and then unfold the Ingest database and then its tables (if you are to fast then you might have to refresh the tables list.
%%sql CREATE TABLE IF NOT EXISTS Ingest.Cities USING PARQUET Location 'abfss://mysource@mydatalake.dfs.core.windows.net/Ingestfolder/Cities*.parquet'
%%pyspark spark.sql(f'CREATE TABLE IF NOT EXISTS Ingest.{table_name} USING PARQUET LOCATION \'{parquet_path}\'')
3) Create table on Delta Table
For the Bronze (or Silver or Gold) layer we will create a table based on an existing Delta Table from the data lake.
- Go back to your notebook and create a third Code block
- Add the code below and execute it. It will create a table in the Bronze Lake database
- Then go to the Data tab and then unfold the Bronze (or Silver or Gold) database and then its tables (if you are to fast then you might have to refresh the tables list.
%%sql CREATE TABLE IF NOT EXISTS Bronze.Cities USING DELTA Location 'abfss://mysource@mydatalake.dfs.core.windows.net/Bronze/Cities'
$$pyspark spark.sql(f'CREATE TABLE IF NOT EXISTS Bronze.{table_name} USING DELTA LOCATION \'{delta_table_path}\'')
4) Query the new tables
Now you can query the files and delta tables like a regular database table in either a notebook with a Spark pool running or a SQL script. There is one difference between those two. In the SQL Script you get the default dbo schema between the database name and the table name. This in mandatory for the SQL script, but not allowed in a notebook.
5) Create views on new tables
You can also create views on those new tables where you already add some business logic that is usefull for other colleagues working with this data or you can create views for dimensions and facts. You could even create views for dimensions and facts on your silver table and then serve them to Power BI
Conclusion
In this post you learned how to make your data lake files and delta lake tables easier to query. A few extra steps for each table, but after that querying is much easier. In a future post we will show you how to do this directy during the INGEST and DELTA steps. Then you don't have any manual steps for each new file or table.
In the next post we will show you how to use timetravel on those Delta Tables with SPARK SQL. This post is the start position of that post.
Special thanks to colleague Martijn Broeks for helping out.