cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How to load a json file in pyspark with colon character in folder name

biafch
Contributor

Hi,


I have a folder that contains subfolders that have json files.


My subfolders look like this:

  • 2024-08-12T09:34:37:452Z
  • 2024-08-12T09:25:45:185Z

I attach these subfolder names to a variable called FolderName and then try to read my json file like this:

  • df = spark.read.option("multiLine", "true").json(f"/mnt/middleware/changerequests/1. ingest/{FolderName}/changerequests.json")
However, when I try to read the file it gives me the error:
  • Unable to infer schema due to a colon in the file name. To fix the issue, either rename all files with a colon or specify a schema manually.

I tried to replace the colons with %3A but then I am getting the error:

  • Path does not exist: dbfs:/mnt/middleware/changerequests/1. ingest/2024-08-12T09%3A34%3A37%3A452Z/changerequests.json
 
Can anyone give me any suggestions please?
 

I am trying to read the

2 REPLIES 2

szymon_dybczak
Contributor III

Hi @biafch ,

I've tried to replicate your example and it worked for me. But it seems that it is common problem and some object storage may not support that.
[HADOOP-14217] Object Storage: support colon in object path - ASF JIRA (apache.org)
Which object storage you are using? AWS, GCP or Azure?


 

 

folder_name = '2024-08-12T09:34:37:452Z'
file_path = f'/mnt/lakehouse/test/{folder_name}/search_console_data_0.json'

spark.read.option("multiline", "true").json(file_path)

 

 

Slash_0-1723460041690.png

 

Hello @szymon_dybczak

I use object storage within/from Azure.


I have solved it now by executing a workaround. So with azure data factory I was copying the files to the azure folder storage. But instead of giving the folder the name "yyyy-MM-ddTHH:mm:ss:fffK" I gave it the naming "yyyy-MM-ddTHH-mm-ss-fffK".

 

And then in databricks I use the python datetime.strptime functionality "%Y-%m-%dT%H-%M-%S-%fZ" so that I can later then show the latest folder and read it with the functionality:

  • df = spark.read.option("multiLine", "true").json(f"/mnt/middleware/changerequests/1. ingest/{FolderName}/changerequests.json")

So not an ideal solution unfortunately but I was able to fix it by a workaround. 

 

Btw, which object storage are you using? Because I am wondering why it is working for you and not for me... In the Hadoop link you shared I can't find anything about not working in Azure?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group