Databricks Community

Ovi · ‎10-18-2022

Hi all!

I have an S3 bucket with Delta parquet files/folders with different schemas each. I need to create an RDD or DataFrame from all those Delta Tables that should contain the path, name and different schema of each.

How could I do that?

Thank you!

PS: I need this to be able to compare their Delta schema with the Avroschema of the same tables (or similar at least) from another S3 bucket.

Debayan · ‎10-19-2022

Hi @Ovidiu Eremia , DataFrameReader options allow you to create a DataFrame from a Delta table that is fixed to a specific version of the table, for example in Python:

df1 = spark.read.format('delta').option('timestampAsOf', '2019-01-01').table("people_10m")
display(df1)

Please refer: https://docs.databricks.com/delta/quick-start.html#query-an-earlier-version-of-the-table-time-travel

Please let us know if this helps.

Ovi · ‎10-19-2022

Thank you @Debayan Mukherjee but I think I was misunderstood. Let me give you more details:

I need to compare several Delta tables with different schema each with their analogue avro schemas
I've managed to build a dataframe with the avro schemas using wholeTextFiles from spark RDD and I want to do something similar for the Delta schemas of those Delta parquet files
Because those delta tables have different schemas I can't use the spark standard methods and I guess I need to do a loop in Scala through all those folders with parquet files and load each of them separately.
But I wanted to know if there would be another method similar to wholeTextFiles fir text files.

Thank you,

Ovi

Anonymous · ‎11-27-2022

Hi @Ovidiu Eremia

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!

Hubert-Dudek · ‎12-05-2022

You can mount S3 bucket or read directly from it.

access_key = dbutils.secrets.get(scope = "aws", key = "aws-access-key")
secret_key = dbutils.secrets.get(scope = "aws", key = "aws-secret-key")
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", access_key)
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", secret_key)
 
# If you are using Auto Loader file notification mode to load files, provide the AWS Region ID.
aws_region = "aws-region-id"
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3." + aws_region + ".amazonaws.com")
 
myRDD = sc.textFile("s3a://%s/.../..." % aws_bucket_name)
myRDD.count()

for mount:

access_key = dbutils.secrets.get(scope = "aws", key = "aws-access-key")
secret_key = dbutils.secrets.get(scope = "aws", key = "aws-secret-key")
encoded_secret_key = secret_key.replace("/", "%2F")
aws_bucket_name = "<aws-bucket-name>"
mount_name = "<mount-name>"
 
dbutils.fs.mount(f"s3a://{access_key}:{encoded_secret_key}@{aws_bucket_name}", f"/mnt/{mount_name}")
display(dbutils.fs.ls(f"/mnt/{mount_name}"))

Databricks Community

Construct Dataframe or RDD from S3 bucket with Delta tables

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!