<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to load a json file in pyspark with colon character in folder name in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-load-a-json-file-in-pyspark-with-colon-character-in/m-p/82733#M36732</link>
    <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/110502"&gt;@szymon_dybczak&lt;/a&gt;&lt;/P&gt;&lt;P&gt;I use object storage within/from Azure.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;I have solved it now by executing a workaround. So with azure data factory I was copying the files to the azure folder storage. But instead of giving the folder the name "yyyy-MM-ddTHH:mm:ss:fffK" I gave it the naming "yyyy-MM-ddTHH-mm-ss-fffK".&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;And then in databricks I use the python datetime.strptime functionality "%Y-%m-%dT%H-%M-%S-%fZ" so that I can later then show the latest folder and read it with the functionality:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;df&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;spark.read.&lt;/SPAN&gt;&lt;SPAN&gt;option&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"multiLine"&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN&gt;"true"&lt;/SPAN&gt;&lt;SPAN&gt;).&lt;/SPAN&gt;&lt;SPAN&gt;json&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;f&lt;/SPAN&gt;&lt;SPAN&gt;"/mnt/middleware/changerequests/1. ingest/&lt;/SPAN&gt;&lt;SPAN&gt;{&lt;/SPAN&gt;&lt;SPAN&gt;FolderName&lt;/SPAN&gt;&lt;SPAN&gt;}&lt;/SPAN&gt;&lt;SPAN&gt;/changerequests.json"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;So not an ideal solution unfortunately but I was able to fix it by a workaround.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Btw, which object storage are you using? Because I am wondering why it is working for you and not for me... In the Hadoop link you shared I can't find anything about not working in Azure?&lt;/P&gt;</description>
    <pubDate>Mon, 12 Aug 2024 11:39:29 GMT</pubDate>
    <dc:creator>biafch</dc:creator>
    <dc:date>2024-08-12T11:39:29Z</dc:date>
    <item>
      <title>How to load a json file in pyspark with colon character in folder name</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-load-a-json-file-in-pyspark-with-colon-character-in/m-p/82722#M36726</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;I have a folder that contains subfolders that have json files.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;My subfolders look like this:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;2024-08-12T09:34:37:452Z&lt;/LI&gt;&lt;LI&gt;2024-08-12T09:25:45:185Z&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;I attach these subfolder names to a variable called FolderName and then try to read my json file like this:&lt;/P&gt;&lt;DIV&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;df &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; spark.read.&lt;/SPAN&gt;&lt;SPAN&gt;option&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"multiLine"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;"true"&lt;/SPAN&gt;&lt;SPAN&gt;).&lt;/SPAN&gt;&lt;SPAN&gt;json&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;f&lt;/SPAN&gt;&lt;SPAN&gt;"/mnt/middleware/changerequests/1. ingest/&lt;/SPAN&gt;&lt;SPAN&gt;{&lt;/SPAN&gt;&lt;SPAN&gt;FolderName&lt;/SPAN&gt;&lt;SPAN&gt;}&lt;/SPAN&gt;&lt;SPAN&gt;/changerequests.json"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;DIV&gt;&lt;SPAN&gt;However, when I try to read the file it gives me the error:&lt;/SPAN&gt;&lt;/DIV&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;Unable to infer schema due to a colon in the file name. To fix the issue, either rename all files with a colon or specify a schema manually.&lt;BR /&gt;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;SPAN&gt;I tried to replace the colons with %3A but then I am getting the error:&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;Path does not exist: dbfs:/mnt/middleware/changerequests/1. ingest/2024-08-12T09%3A34%3A37%3A452Z/changerequests.json&lt;BR /&gt;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Can anyone give me any suggestions please?&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;I am trying to read the&lt;/P&gt;</description>
      <pubDate>Mon, 12 Aug 2024 10:13:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-load-a-json-file-in-pyspark-with-colon-character-in/m-p/82722#M36726</guid>
      <dc:creator>biafch</dc:creator>
      <dc:date>2024-08-12T10:13:45Z</dc:date>
    </item>
    <item>
      <title>Re: How to load a json file in pyspark with colon character in folder name</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-load-a-json-file-in-pyspark-with-colon-character-in/m-p/82729#M36730</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/115205"&gt;@biafch&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;I've tried to replicate your example and it worked for me. But it seems that it is common problem and some object storage may not support that.&lt;BR /&gt;&lt;A href="https://issues.apache.org/jira/browse/HADOOP-14217" target="_blank"&gt;[HADOOP-14217] Object Storage: support colon in object path - ASF JIRA (apache.org)&lt;/A&gt;&lt;BR /&gt;Which object storage you are using? AWS, GCP or Azure?&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;folder_name = '2024-08-12T09:34:37:452Z'
file_path = f'/mnt/lakehouse/test/{folder_name}/search_console_data_0.json'

spark.read.option("multiline", "true").json(file_path)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Slash_0-1723460041690.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/10303i778BF58A81B9C25D/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400" role="button" title="Slash_0-1723460041690.png" alt="Slash_0-1723460041690.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 12 Aug 2024 11:18:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-load-a-json-file-in-pyspark-with-colon-character-in/m-p/82729#M36730</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2024-08-12T11:18:32Z</dc:date>
    </item>
    <item>
      <title>Re: How to load a json file in pyspark with colon character in folder name</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-load-a-json-file-in-pyspark-with-colon-character-in/m-p/82733#M36732</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/110502"&gt;@szymon_dybczak&lt;/a&gt;&lt;/P&gt;&lt;P&gt;I use object storage within/from Azure.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;I have solved it now by executing a workaround. So with azure data factory I was copying the files to the azure folder storage. But instead of giving the folder the name "yyyy-MM-ddTHH:mm:ss:fffK" I gave it the naming "yyyy-MM-ddTHH-mm-ss-fffK".&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;And then in databricks I use the python datetime.strptime functionality "%Y-%m-%dT%H-%M-%S-%fZ" so that I can later then show the latest folder and read it with the functionality:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;df&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;spark.read.&lt;/SPAN&gt;&lt;SPAN&gt;option&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"multiLine"&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN&gt;"true"&lt;/SPAN&gt;&lt;SPAN&gt;).&lt;/SPAN&gt;&lt;SPAN&gt;json&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;f&lt;/SPAN&gt;&lt;SPAN&gt;"/mnt/middleware/changerequests/1. ingest/&lt;/SPAN&gt;&lt;SPAN&gt;{&lt;/SPAN&gt;&lt;SPAN&gt;FolderName&lt;/SPAN&gt;&lt;SPAN&gt;}&lt;/SPAN&gt;&lt;SPAN&gt;/changerequests.json"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;So not an ideal solution unfortunately but I was able to fix it by a workaround.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Btw, which object storage are you using? Because I am wondering why it is working for you and not for me... In the Hadoop link you shared I can't find anything about not working in Azure?&lt;/P&gt;</description>
      <pubDate>Mon, 12 Aug 2024 11:39:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-load-a-json-file-in-pyspark-with-colon-character-in/m-p/82733#M36732</guid>
      <dc:creator>biafch</dc:creator>
      <dc:date>2024-08-12T11:39:29Z</dc:date>
    </item>
  </channel>
</rss>

