Databricks Community

aockenden · ‎01-10-2024

Hey all, my team has settled on using directory-scoped SAS tokens to provision access to data in our Azure Gen2 Datalakes. However, we have encountered an issue when switching from a first SAS token (which is used to read a first parquet table in the datalake into a Spark DF) to a second SAS token (which is used to read a second parquet table in the datalake into a Spark DF). Code example is below:

import json

from pyspark.sql import SparkSession

spark_session = SparkSession.builder.appName("test-app").getOrCreate()

datalake_name = "STORAGE ACCOUNT NAME"

container_name = "CONTAINER NAME"

################# Common Configs ###########################

spark_session.conf.set(f"fs.azure.account.auth.type.{datalake_name}.dfs.core.windows.net", "SAS")

spark_session.conf.set(f"fs.azure.sas.token.provider.type.{datalake_name}.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")

################ Dataframe 1 Read ##############################

spark_session.conf.set(f"fs.azure.sas.fixed.token.{datalake_name}.dfs.core.windows.net", "FIRST SAS TOKEN")

target_file_path = f"abfss://{container_name}@{datalake_name}.dfs.core.windows.net/folder_1/test_parquet_data"

df1 = spark_session.read.format("parquet").load(target_file_path)

################# Dataframe 2 Read ###############################

spark_session.conf.set(f"fs.azure.sas.fixed.token.{datalake_name}.dfs.core.windows.net", "SECOND SAS TOKEN")

target_file_path = f"abfss://{container_name}@{datalake_name}.dfs.core.windows.net/folder_2/test_parquet_data"

df2 = spark_session.read.format("parquet").load(target_file_path)

df3 = df2.join(df1)

df3.show()

Both SAS tokens are valid in their permission sets, unexpired etc. The two SAS tokens are directory-scoped on folder_1 and folder_2 in the datalake respectively. The last line of the above code (df3.show()) breaks on a permission error.

I'm guessing this is because the data in the datalake is not actually retrieved into cluster memory by the spark dataframes until the .show() command has been executed on the joined df3 object, at which point the fs.azure.sas.fixed.token spark.conf setting has been switched to a token which doesn't have permission to access the data in folder_1, so then actually executing the join (in service of the .show command) throws an access error.

If this explanation is correct, then are there any workarounds? What about the approach used in the code above is theoretically / structurally flawed? How should things be done differently? Is there a way to configure the spark session to accept multiple SAS tokens simultaneously?

Thanks for any help you can provide.

Alex

aockenden · ‎01-12-2024

Bump

aockenden · ‎01-18-2024

Thanks for the reply.

"One potential workaround could be to read all the necessary data into memory before switching the SAS token. However, this might not be feasible if the data is too large to fit into memory." - Of course we could do this, but this totally defeats the purpose / power of Spark which is designed to help deal with huge datasets WITHOUT needing enormous amounts of RAM on your machine / cluster to store it all in memory.

Regarding the Service Principals: as that is some Active-Directory based access paradigm, I am guessing you are then beholden to either assigning ACL's onto files and folders in the Datalake or providing RBAC roles to control the level of access in the lake?

Databricks Community

Switching SAS Tokens Mid-Script With Spark Dataframes

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences