cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Switching SAS Tokens Mid-Script With Spark Dataframes

aockenden
New Contributor III

Hey all, my team has settled on using directory-scoped SAS tokens to provision access to data in our Azure Gen2 Datalakes. However, we have encountered an issue when switching from a first SAS token (which is used to read a first parquet table in the datalake into a Spark DF) to a second SAS token (which is used to read a second parquet table in the datalake into a Spark DF). Code example is below:

 

import json

from pyspark.sql import SparkSession

 

spark_session = SparkSession.builder.appName("test-app").getOrCreate()

datalake_name = "STORAGE ACCOUNT NAME"

container_name = "CONTAINER NAME"

#################  Common Configs  ###########################

spark_session.conf.set(f"fs.azure.account.auth.type.{datalake_name}.dfs.core.windows.net", "SAS")

spark_session.conf.set(f"fs.azure.sas.token.provider.type.{datalake_name}.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")

################   Dataframe 1 Read   ##############################

spark_session.conf.set(f"fs.azure.sas.fixed.token.{datalake_name}.dfs.core.windows.net", "FIRST SAS TOKEN")

target_file_path = f"abfss://{container_name}@{datalake_name}.dfs.core.windows.net/folder_1/test_parquet_data"

df1 = spark_session.read.format("parquet").load(target_file_path)

#################  Dataframe 2 Read  ###############################

spark_session.conf.set(f"fs.azure.sas.fixed.token.{datalake_name}.dfs.core.windows.net", "SECOND SAS TOKEN")

target_file_path = f"abfss://{container_name}@{datalake_name}.dfs.core.windows.net/folder_2/test_parquet_data"

df2 = spark_session.read.format("parquet").load(target_file_path)

 

df3 = df2.join(df1)

df3.show()

 

 

Both SAS tokens are valid in their permission sets, unexpired etc. The two SAS tokens are directory-scoped on folder_1 and folder_2 in the datalake respectively. The last line of the above code (df3.show()) breaks on a permission error.

I'm guessing this is because the data in the datalake is not actually retrieved into cluster memory by the spark dataframes until the .show() command has been executed on the joined df3 object, at which point the fs.azure.sas.fixed.token spark.conf setting has been switched to a token which doesn't have permission to access the data in folder_1, so then actually executing the join (in service of the .show command) throws an access error.

If this explanation is correct, then are there any workarounds? What about the approach used in the code above is theoretically / structurally flawed? How should things be done differently? Is there a way to configure the spark session to accept multiple SAS tokens simultaneously?

Thanks for any help you can provide.

Alex

3 REPLIES 3

aockenden
New Contributor III

Bump

Kaniz
Community Manager
Community Manager

Hi @aockenden, The data in the Data Lake is not actually retrieved into cluster memory by the Spark dataframes until an action (like .show()) is executed. At this point, the fs.azure.sas.fixed.token Spark configuration setting has been switched to a token which doesn’t have permission to access the data in folder_1, hence the access error.

 

As for configuring the Spark session to accept multiple SAS tokens simultaneously, it is indeed poss.... You can create as many SAS tokens as you would like by using different combinations of permissions, .... The SAS token is a string that you generate on the client side, for example, by using one of the Azu....

 

However, the challenge here is that the Spark configuration setting fs.azure.sas.fixed.token is global and applies to the entire Spark session. When you set this configuration for the second time, it overwrites the first SAS token. This is why you’re seeing the permission error when trying to access data from folder_1 after setting the SAS token for folder_1.

 

One potential workaround could be to read all the necessary data into memory before switching the SAS token. However, this might not be feasible if the data is too large to fit into memory.

 

Another approach could be to use Azure service principals to connect to Azure storage2. Service principals are a type of application in Azure Active Directory that can be used to automate .... This might provide a more flexible way to manage access to your Data Lake storage.

aockenden
New Contributor III

Thanks for the reply.

"One potential workaround could be to read all the necessary data into memory before switching the SAS token. However, this might not be feasible if the data is too large to fit into memory." - Of course we could do this, but this totally defeats the purpose / power of Spark which is designed to help deal with huge datasets WITHOUT needing enormous amounts of RAM on your machine / cluster to store it all in memory.

Regarding the Service Principals: as that is some Active-Directory based access paradigm, I am guessing you are then beholden to either assigning ACL's onto files and folders in the Datalake or providing RBAC roles to control the level of access in the lake?

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.