topic Re: Construct Dataframe or RDD from S3 bucket with Delta tables in Data Engineering

Construct Dataframe or RDD from S3 bucket with Delta tables

Ovi — Tue, 18 Oct 2022 16:31:21 GMT

Hi all!

I have an S3 bucket with Delta parquet files/folders with different schemas each. I need to create an RDD or DataFrame from all those Delta Tables that should contain the path, name and different schema of each.

How could I do that?

Thank you!

PS: I need this to be able to compare their Delta schema with the Avroschema of the same tables (or similar at least) from another S3 bucket.

Re: Construct Dataframe or RDD from S3 bucket with Delta tables

Ovi — Wed, 19 Oct 2022 09:00:50 GMT

Thank you @Debayan Mukherjee but I think I was misunderstood. Let me give you more details:

I need to compare several Delta tables with different schema each with their analogue avro schemas
I've managed to build a dataframe with the avro schemas using wholeTextFiles from spark RDD and I want to do something similar for the Delta schemas of those Delta parquet files
Because those delta tables have different schemas I can't use the spark standard methods and I guess I need to do a loop in Scala through all those folders with parquet files and load each of them separately.
But I wanted to know if there would be another method similar to wholeTextFiles fir text files.

Thank you,

Ovi

Re: Construct Dataframe or RDD from S3 bucket with Delta tables

Anonymous — Sun, 27 Nov 2022 13:30:34 GMT

Hi @Ovidiu Eremia

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!

Re: Construct Dataframe or RDD from S3 bucket with Delta tables

Hubert-Dudek — Mon, 05 Dec 2022 16:38:19 GMT

You can mount S3 bucket or read directly from it.

access_key = dbutils.secrets.get(scope = "aws", key = "aws-access-key")
secret_key = dbutils.secrets.get(scope = "aws", key = "aws-secret-key")
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", access_key)
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", secret_key)
 
# If you are using Auto Loader file notification mode to load files, provide the AWS Region ID.
aws_region = "aws-region-id"
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3." + aws_region + ".amazonaws.com")
 
myRDD = sc.textFile("s3a://%s/.../..." % aws_bucket_name)
myRDD.count()

for mount:

access_key = dbutils.secrets.get(scope = "aws", key = "aws-access-key")
secret_key = dbutils.secrets.get(scope = "aws", key = "aws-secret-key")
encoded_secret_key = secret_key.replace("/", "%2F")
aws_bucket_name = "<aws-bucket-name>"
mount_name = "<mount-name>"
 
dbutils.fs.mount(f"s3a://{access_key}:{encoded_secret_key}@{aws_bucket_name}", f"/mnt/{mount_name}")
display(dbutils.fs.ls(f"/mnt/{mount_name}"))

Re: Construct Dataframe or RDD from S3 bucket with Delta tables

Debayan — Wed, 19 Oct 2022 07:34:13 GMT

Hi @Ovidiu Eremia , DataFrameReader options allow you to create a DataFrame from a Delta table that is fixed to a specific version of the table, for example in Python:

df1 = spark.read.format('delta').option('timestampAsOf', '2019-01-01').table("people_10m")
display(df1)

Please refer: https://docs.databricks.com/delta/quick-start.html#query-an-earlier-version-of-the-table-time-travel

Please let us know if this helps.