topic Re: How to load a json file in pyspark with colon character in folder name in Data Engineering

How to load a json file in pyspark with colon character in folder name

biafch — Mon, 12 Aug 2024 10:13:45 GMT

Hi,

I have a folder that contains subfolders that have json files.

My subfolders look like this:

2024-08-12T09:34:37:452Z
2024-08-12T09:25:45:185Z

I attach these subfolder names to a variable called FolderName and then try to read my json file like this:

df = spark.read.option("multiLine", "true").json(f"/mnt/middleware/changerequests/1. ingest/{FolderName}/changerequests.json")

However, when I try to read the file it gives me the error:

Unable to infer schema due to a colon in the file name. To fix the issue, either rename all files with a colon or specify a schema manually.

I tried to replace the colons with %3A but then I am getting the error:

Path does not exist: dbfs:/mnt/middleware/changerequests/1. ingest/2024-08-12T09%3A34%3A37%3A452Z/changerequests.json

Can anyone give me any suggestions please?

I am trying to read the

Re: How to load a json file in pyspark with colon character in folder name

szymon_dybczak — Mon, 12 Aug 2024 11:18:32 GMT

Hi @biafch ,

I've tried to replicate your example and it worked for me. But it seems that it is common problem and some object storage may not support that.
[HADOOP-14217] Object Storage: support colon in object path - ASF JIRA (apache.org)
Which object storage you are using? AWS, GCP or Azure?

folder_name = '2024-08-12T09:34:37:452Z' file_path = f'/mnt/lakehouse/test/{folder_name}/search_console_data_0.json' spark.read.option("multiline", "true").json(file_path)

Re: How to load a json file in pyspark with colon character in folder name

biafch — Mon, 12 Aug 2024 11:39:29 GMT

Hello @szymon_dybczak

I use object storage within/from Azure.

I have solved it now by executing a workaround. So with azure data factory I was copying the files to the azure folder storage. But instead of giving the folder the name "yyyy-MM-ddTHH:mm:ss:fffK" I gave it the naming "yyyy-MM-ddTHH-mm-ss-fffK".

And then in databricks I use the python datetime.strptime functionality "%Y-%m-%dT%H-%M-%S-%fZ" so that I can later then show the latest folder and read it with the functionality:

df = spark.read.option("multiLine", "true").json(f"/mnt/middleware/changerequests/1. ingest/{FolderName}/changerequests.json")

So not an ideal solution unfortunately but I was able to fix it by a workaround.

Btw, which object storage are you using? Because I am wondering why it is working for you and not for me... In the Hadoop link you shared I can't find anything about not working in Azure?