Using spark.read.json with a {} literal in my path
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-22-2025 10:42 AM
I am pulling data from an S3 bucket using spark.read.json like this
s3_uri = "s3://snowflake-genesys/v2.outbound.campaigns.{id}/2025-01-22/00/"
df = spark.read.json(s3_uri)
My s3 url has the {id} in the file path. I have used r"s3://snowflake-genesys/v2.outbound.campaigns.{id}/2025-01-22/00/" and f"s3://snowflake-genesys/v2.outbound.campaigns.{{id}}/2025-01-22/00/".
I can get an
How can I get spark to appropriately read the file. What am I doing wrong?
My s3 url has the {id} in the file path. I have used r"s3://snowflake-genesys/v2.outbound.campaigns.{id}/2025-01-22/00/" and f"s3://snowflake-genesys/v2.outbound.campaigns.{{id}}/2025-01-22/00/".
I can get an
dbutils.fs.ls(f"s3://snowflake-genesys/v2.outbound.campaigns.{{id}}/2025-01-22/00/") to return the files for me.
I can get it to work with a wild card but that's not optimal because I have other large folders between campaigns and /2025-01-22/. This returns way to much data:
I can get it to work with a wild card but that's not optimal because I have other large folders between campaigns and /2025-01-22/. This returns way to much data:
"s3://snowflake-genesys/v2.outbound.campaigns.*/2025-01-22/00/". I have other folders like this v2.outbound.campaigns.{id}.progress/2025-01-22/
How can I get spark to appropriately read the file. What am I doing wrong?
2 REPLIES 2
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-22-2025 12:00 PM
Hi @johngabbradley,
Would below approach work for you?
s3_uri = "s3://snowflake-genesys/v2.outbound.campaigns.{id}/2025-01-22/00/"
files = dbutils.fs.ls(s3_uri)
file_paths = [file.path for file in files]
df = spark.read.json(file_paths)
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-22-2025 12:06 PM
Thanks so much for responding. It is still bombing out:
Path does not exist: s3://snowflake-genesys/v2.outbound.campaigns.{id}/2025-01-22/00/002054-134158ad-1647-f75a-7cd9-b36910365e09.json. SQLSTATE: 42K03
data:image/s3,"s3://crabby-images/d6be0/d6be025e52e1a61c30ea16a2fda1ef9155483c43" alt=""
data:image/s3,"s3://crabby-images/d6be0/d6be025e52e1a61c30ea16a2fda1ef9155483c43" alt=""