cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Switching SAS Tokens Mid-Script With Spark Dataframes

aockenden
New Contributor III

Hey all, my team has settled on using directory-scoped SAS tokens to provision access to data in our Azure Gen2 Datalakes. However, we have encountered an issue when switching from a first SAS token (which is used to read a first parquet table in the datalake into a Spark DF) to a second SAS token (which is used to read a second parquet table in the datalake into a Spark DF). Code example is below:

 

import json

from pyspark.sql import SparkSession

 

spark_session = SparkSession.builder.appName("test-app").getOrCreate()

datalake_name = "STORAGE ACCOUNT NAME"

container_name = "CONTAINER NAME"

#################  Common Configs  ###########################

spark_session.conf.set(f"fs.azure.account.auth.type.{datalake_name}.dfs.core.windows.net", "SAS")

spark_session.conf.set(f"fs.azure.sas.token.provider.type.{datalake_name}.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")

################   Dataframe 1 Read   ##############################

spark_session.conf.set(f"fs.azure.sas.fixed.token.{datalake_name}.dfs.core.windows.net", "FIRST SAS TOKEN")

target_file_path = f"abfss://{container_name}@{datalake_name}.dfs.core.windows.net/folder_1/test_parquet_data"

df1 = spark_session.read.format("parquet").load(target_file_path)

#################  Dataframe 2 Read  ###############################

spark_session.conf.set(f"fs.azure.sas.fixed.token.{datalake_name}.dfs.core.windows.net", "SECOND SAS TOKEN")

target_file_path = f"abfss://{container_name}@{datalake_name}.dfs.core.windows.net/folder_2/test_parquet_data"

df2 = spark_session.read.format("parquet").load(target_file_path)

 

df3 = df2.join(df1)

df3.show()

 

 

Both SAS tokens are valid in their permission sets, unexpired etc. The two SAS tokens are directory-scoped on folder_1 and folder_2 in the datalake respectively. The last line of the above code (df3.show()) breaks on a permission error.

I'm guessing this is because the data in the datalake is not actually retrieved into cluster memory by the spark dataframes until the .show() command has been executed on the joined df3 object, at which point the fs.azure.sas.fixed.token spark.conf setting has been switched to a token which doesn't have permission to access the data in folder_1, so then actually executing the join (in service of the .show command) throws an access error.

If this explanation is correct, then are there any workarounds? What about the approach used in the code above is theoretically / structurally flawed? How should things be done differently? Is there a way to configure the spark session to accept multiple SAS tokens simultaneously?

Thanks for any help you can provide.

Alex

2 REPLIES 2

aockenden
New Contributor III

Bump

Thanks for the reply.

"One potential workaround could be to read all the necessary data into memory before switching the SAS token. However, this might not be feasible if the data is too large to fit into memory." - Of course we could do this, but this totally defeats the purpose / power of Spark which is designed to help deal with huge datasets WITHOUT needing enormous amounts of RAM on your machine / cluster to store it all in memory.

Regarding the Service Principals: as that is some Active-Directory based access paradigm, I am guessing you are then beholden to either assigning ACL's onto files and folders in the Datalake or providing RBAC roles to control the level of access in the lake?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group