cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Construct Dataframe or RDD from S3 bucket with Delta tables

Ovi
New Contributor III

Hi all!

I have an S3 bucket with Delta parquet files/folders with different schemas each. I need to create an RDD or DataFrame from all those Delta Tables that should contain the path, name and different schema of each.

How could I do that?

Thank you!

PS: I need this to be able to compare their Delta schema with the Avroschema of the same tables (or similar at least) from another S3 bucket.

5 REPLIES 5

Debayan
Esteemed Contributor III
Esteemed Contributor III

Hi @Ovidiu Eremia​ , DataFrameReader options allow you to create a DataFrame from a Delta table that is fixed to a specific version of the table, for example in Python:

df1 = spark.read.format('delta').option('timestampAsOf', '2019-01-01').table("people_10m")
display(df1)

Please refer: https://docs.databricks.com/delta/quick-start.html#query-an-earlier-version-of-the-table-time-travel

Please let us know if this helps.

Kaniz
Community Manager
Community Manager

Hi @Ovidiu Eremia​ ​, We haven’t heard from you since the last response from @Debayan Mukherjee​ , and I was checking back to see if you have a resolution yet.

If you have any solution, please share it with the community as it can be helpful to others. Otherwise, we will respond with more details and try to help.

Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

Ovi
New Contributor III

Thank you @Debayan Mukherjee​ but I think I was misunderstood. Let me give you more details:

  • I need to compare several Delta tables with different schema each with their analogue avro schemas
  • I've managed to build a dataframe with the avro schemas using wholeTextFiles from spark RDD and I want to do something similar for the Delta schemas of those Delta parquet files
  • Because those delta tables have different schemas I can't use the spark standard methods and I guess I need to do a loop in Scala through all those folders with parquet files and load each of them separately.
  • But I wanted to know if there would be another method similar to wholeTextFiles fir text files.

Thank you,

Ovi

Anonymous
Not applicable

Hi @Ovidiu Eremia​ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Hubert-Dudek
Esteemed Contributor III

You can mount S3 bucket or read directly from it.

access_key = dbutils.secrets.get(scope = "aws", key = "aws-access-key")
secret_key = dbutils.secrets.get(scope = "aws", key = "aws-secret-key")
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", access_key)
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", secret_key)
 
# If you are using Auto Loader file notification mode to load files, provide the AWS Region ID.
aws_region = "aws-region-id"
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3." + aws_region + ".amazonaws.com")
 
myRDD = sc.textFile("s3a://%s/.../..." % aws_bucket_name)
myRDD.count()

for mount:

access_key = dbutils.secrets.get(scope = "aws", key = "aws-access-key")
secret_key = dbutils.secrets.get(scope = "aws", key = "aws-secret-key")
encoded_secret_key = secret_key.replace("/", "%2F")
aws_bucket_name = "<aws-bucket-name>"
mount_name = "<mount-name>"
 
dbutils.fs.mount(f"s3a://{access_key}:{encoded_secret_key}@{aws_bucket_name}", f"/mnt/{mount_name}")
display(dbutils.fs.ls(f"/mnt/{mount_name}"))

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.