โ10-18-2022 09:31 AM
Hi all!
I have an S3 bucket with Delta parquet files/folders with different schemas each. I need to create an RDD or DataFrame from all those Delta Tables that should contain the path, name and different schema of each.
How could I do that?
Thank you!
PS: I need this to be able to compare their Delta schema with the Avroschema of the same tables (or similar at least) from another S3 bucket.
โ10-19-2022 12:34 AM
Hi @Ovidiu Eremiaโ , DataFrameReader options allow you to create a DataFrame from a Delta table that is fixed to a specific version of the table, for example in Python:
df1 = spark.read.format('delta').option('timestampAsOf', '2019-01-01').table("people_10m")
display(df1)
Please refer: https://docs.databricks.com/delta/quick-start.html#query-an-earlier-version-of-the-table-time-travel
Please let us know if this helps.
โ10-19-2022 01:50 AM
Hi @Ovidiu Eremiaโ โ, We havenโt heard from you since the last response from @Debayan Mukherjeeโ , and I was checking back to see if you have a resolution yet.
If you have any solution, please share it with the community as it can be helpful to others. Otherwise, we will respond with more details and try to help.
Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.
โ10-19-2022 02:00 AM
Thank you @Debayan Mukherjeeโ but I think I was misunderstood. Let me give you more details:
Thank you,
Ovi
โ11-27-2022 05:30 AM
Hi @Ovidiu Eremiaโ
Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.
We'd love to hear from you.
Thanks!
โ12-05-2022 08:38 AM
You can mount S3 bucket or read directly from it.
access_key = dbutils.secrets.get(scope = "aws", key = "aws-access-key")
secret_key = dbutils.secrets.get(scope = "aws", key = "aws-secret-key")
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", access_key)
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", secret_key)
# If you are using Auto Loader file notification mode to load files, provide the AWS Region ID.
aws_region = "aws-region-id"
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3." + aws_region + ".amazonaws.com")
myRDD = sc.textFile("s3a://%s/.../..." % aws_bucket_name)
myRDD.count()
for mount:
access_key = dbutils.secrets.get(scope = "aws", key = "aws-access-key")
secret_key = dbutils.secrets.get(scope = "aws", key = "aws-secret-key")
encoded_secret_key = secret_key.replace("/", "%2F")
aws_bucket_name = "<aws-bucket-name>"
mount_name = "<mount-name>"
dbutils.fs.mount(f"s3a://{access_key}:{encoded_secret_key}@{aws_bucket_name}", f"/mnt/{mount_name}")
display(dbutils.fs.ls(f"/mnt/{mount_name}"))
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group