Hey all, my team has settled on using directory-scoped SAS tokens to provision access to data in our Azure Gen2 Datalakes. However, we have encountered an issue when switching from a first SAS token (which is used to read a first parquet table in the datalake into a Spark DF) to a second SAS token (which is used to read a second parquet table in the datalake into a Spark DF). Code example is below:
import json
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.appName("test-app").getOrCreate()
datalake_name = "STORAGE ACCOUNT NAME"
container_name = "CONTAINER NAME"
################# Common Configs ###########################
spark_session.conf.set(f"fs.azure.account.auth.type.{datalake_name}.dfs.core.windows.net", "SAS")
spark_session.conf.set(f"fs.azure.sas.token.provider.type.{datalake_name}.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")
################ Dataframe 1 Read ##############################
spark_session.conf.set(f"fs.azure.sas.fixed.token.{datalake_name}.dfs.core.windows.net", "FIRST SAS TOKEN")
target_file_path = f"abfss://{container_name}@{datalake_name}.dfs.core.windows.net/folder_1/test_parquet_data"
df1 = spark_session.read.format("parquet").load(target_file_path)
################# Dataframe 2 Read ###############################
spark_session.conf.set(f"fs.azure.sas.fixed.token.{datalake_name}.dfs.core.windows.net", "SECOND SAS TOKEN")
target_file_path = f"abfss://{container_name}@{datalake_name}.dfs.core.windows.net/folder_2/test_parquet_data"
df2 = spark_session.read.format("parquet").load(target_file_path)
df3 = df2.join(df1)
df3.show()
Both SAS tokens are valid in their permission sets, unexpired etc. The two SAS tokens are directory-scoped on folder_1 and folder_2 in the datalake respectively. The last line of the above code (df3.show()) breaks on a permission error.
I'm guessing this is because the data in the datalake is not actually retrieved into cluster memory by the spark dataframes until the .show() command has been executed on the joined df3 object, at which point the fs.azure.sas.fixed.token spark.conf setting has been switched to a token which doesn't have permission to access the data in folder_1, so then actually executing the join (in service of the .show command) throws an access error.
If this explanation is correct, then are there any workarounds? What about the approach used in the code above is theoretically / structurally flawed? How should things be done differently? Is there a way to configure the spark session to accept multiple SAS tokens simultaneously?
Thanks for any help you can provide.
Alex