Databricks Community

biafch · ‎08-12-2024

Hi,

I have a folder that contains subfolders that have json files.

My subfolders look like this:

2024-08-12T09:34:37:452Z
2024-08-12T09:25:45:185Z

I attach these subfolder names to a variable called FolderName and then try to read my json file like this:

df = spark.read.option("multiLine", "true").json(f"/mnt/middleware/changerequests/1. ingest/{FolderName}/changerequests.json")

However, when I try to read the file it gives me the error:

Unable to infer schema due to a colon in the file name. To fix the issue, either rename all files with a colon or specify a schema manually.

I tried to replace the colons with %3A but then I am getting the error:

Path does not exist: dbfs:/mnt/middleware/changerequests/1. ingest/2024-08-12T09%3A34%3A37%3A452Z/changerequests.json

Can anyone give me any suggestions please?

I am trying to read the

szymon_dybczak · ‎08-12-2024

Hi @biafch ,

I've tried to replicate your example and it worked for me. But it seems that it is common problem and some object storage may not support that.
[HADOOP-14217] Object Storage: support colon in object path - ASF JIRA (apache.org)
Which object storage you are using? AWS, GCP or Azure?

folder_name = '2024-08-12T09:34:37:452Z'
file_path = f'/mnt/lakehouse/test/{folder_name}/search_console_data_0.json'

spark.read.option("multiline", "true").json(file_path)

biafch · ‎08-12-2024

Hello @szymon_dybczak

I use object storage within/from Azure.

I have solved it now by executing a workaround. So with azure data factory I was copying the files to the azure folder storage. But instead of giving the folder the name "yyyy-MM-ddTHH:mm:ss:fffK" I gave it the naming "yyyy-MM-ddTHH-mm-ss-fffK".

And then in databricks I use the python datetime.strptime functionality "%Y-%m-%dT%H-%M-%S-%fZ" so that I can later then show the latest folder and read it with the functionality:

df = spark.read.option("multiLine", "true").json(f"/mnt/middleware/changerequests/1. ingest/{FolderName}/changerequests.json")

So not an ideal solution unfortunately but I was able to fix it by a workaround.

Btw, which object storage are you using? Because I am wondering why it is working for you and not for me... In the Hadoop link you shared I can't find anything about not working in Azure?

Databricks Community

How to load a json file in pyspark with colon character in folder name

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences