cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Construct Dataframe or RDD from S3 bucket with Delta tables

Ovi
New Contributor III

Hi all!

I have an S3 bucket with Delta parquet files/folders with different schemas each. I need to create an RDD or DataFrame from all those Delta Tables that should contain the path, name and different schema of each.

How could I do that?

Thank you!

PS: I need this to be able to compare their Delta schema with the Avroschema of the same tables (or similar at least) from another S3 bucket.

4 REPLIES 4

Debayan
Databricks Employee
Databricks Employee

Hi @Ovidiu Eremia​ , DataFrameReader options allow you to create a DataFrame from a Delta table that is fixed to a specific version of the table, for example in Python:

df1 = spark.read.format('delta').option('timestampAsOf', '2019-01-01').table("people_10m")
display(df1)

Please refer: https://docs.databricks.com/delta/quick-start.html#query-an-earlier-version-of-the-table-time-travel

Please let us know if this helps.

Ovi
New Contributor III

Thank you @Debayan Mukherjee​ but I think I was misunderstood. Let me give you more details:

  • I need to compare several Delta tables with different schema each with their analogue avro schemas
  • I've managed to build a dataframe with the avro schemas using wholeTextFiles from spark RDD and I want to do something similar for the Delta schemas of those Delta parquet files
  • Because those delta tables have different schemas I can't use the spark standard methods and I guess I need to do a loop in Scala through all those folders with parquet files and load each of them separately.
  • But I wanted to know if there would be another method similar to wholeTextFiles fir text files.

Thank you,

Ovi

Anonymous
Not applicable

Hi @Ovidiu Eremia​ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Hubert-Dudek
Esteemed Contributor III

You can mount S3 bucket or read directly from it.

access_key = dbutils.secrets.get(scope = "aws", key = "aws-access-key")
secret_key = dbutils.secrets.get(scope = "aws", key = "aws-secret-key")
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", access_key)
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", secret_key)
 
# If you are using Auto Loader file notification mode to load files, provide the AWS Region ID.
aws_region = "aws-region-id"
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3." + aws_region + ".amazonaws.com")
 
myRDD = sc.textFile("s3a://%s/.../..." % aws_bucket_name)
myRDD.count()

for mount:

access_key = dbutils.secrets.get(scope = "aws", key = "aws-access-key")
secret_key = dbutils.secrets.get(scope = "aws", key = "aws-secret-key")
encoded_secret_key = secret_key.replace("/", "%2F")
aws_bucket_name = "<aws-bucket-name>"
mount_name = "<mount-name>"
 
dbutils.fs.mount(f"s3a://{access_key}:{encoded_secret_key}@{aws_bucket_name}", f"/mnt/{mount_name}")
display(dbutils.fs.ls(f"/mnt/{mount_name}"))

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group