<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Parallel processing of json files in databricks pyspark in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/parallel-processing-of-json-files-in-databricks-pyspark/m-p/34645#M25378</link>
    <description>&lt;P&gt;spark.read.json("/mnt/dbfs/&amp;lt;ENTER PATH OF JSON DIR HERE&amp;gt;/*.json&lt;/P&gt;&lt;P&gt;you first have to mount your blob storage to databricks, I assume that is already done.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://spark.apache.org/docs/latest/sql-data-sources-json.html" alt="https://spark.apache.org/docs/latest/sql-data-sources-json.html" target="_blank"&gt;https://spark.apache.org/docs/latest/sql-data-sources-json.html&lt;/A&gt;&lt;/P&gt;</description>
    <pubDate>Mon, 22 Nov 2021 09:54:07 GMT</pubDate>
    <dc:creator>-werners-</dc:creator>
    <dc:date>2021-11-22T09:54:07Z</dc:date>
    <item>
      <title>Parallel processing of json files in databricks pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/parallel-processing-of-json-files-in-databricks-pyspark/m-p/34642#M25375</link>
      <description>&lt;P&gt;How we can read files from azure blob storage and process parallel in databricks using pyspark.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;As of now we are reading all 10 files at a time into dataframe and flattening it.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks &amp;amp; Regards,&lt;/P&gt;&lt;P&gt;Sujata&lt;/P&gt;</description>
      <pubDate>Mon, 22 Nov 2021 07:34:20 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parallel-processing-of-json-files-in-databricks-pyspark/m-p/34642#M25375</guid>
      <dc:creator>AzureDatabricks</dc:creator>
      <dc:date>2021-11-22T07:34:20Z</dc:date>
    </item>
    <item>
      <title>Re: Parallel processing of json files in databricks pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/parallel-processing-of-json-files-in-databricks-pyspark/m-p/34643#M25376</link>
      <description>&lt;P&gt;if you use the spark json reader, it will happen in parallel automatically.&lt;/P&gt;&lt;P&gt;Depending on the cluster size, you will be able to read more files in parallel.&lt;/P&gt;&lt;P&gt;Mind that json usually are small files.  Spark does not like a lot of small files, so performance may suffer.&lt;/P&gt;&lt;P&gt;Depending on the use case it can be a good idea to do an initial conversion to parquet/delta lake (which will take some time because of multiple small files), and then keep on adding new files to this table.&lt;/P&gt;&lt;P&gt;For your data jobs, you can read the parquet/delta lake which will be a lot faster.&lt;/P&gt;</description>
      <pubDate>Mon, 22 Nov 2021 07:49:18 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parallel-processing-of-json-files-in-databricks-pyspark/m-p/34643#M25376</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2021-11-22T07:49:18Z</dc:date>
    </item>
    <item>
      <title>Re: Parallel processing of json files in databricks pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/parallel-processing-of-json-files-in-databricks-pyspark/m-p/34644#M25377</link>
      <description>&lt;P&gt;can you provide us sample to read read json files parallel from blob. We are reading all files one by one from directory it is taking time to load into data frame&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you&lt;/P&gt;</description>
      <pubDate>Mon, 22 Nov 2021 09:51:27 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parallel-processing-of-json-files-in-databricks-pyspark/m-p/34644#M25377</guid>
      <dc:creator>AzureDatabricks</dc:creator>
      <dc:date>2021-11-22T09:51:27Z</dc:date>
    </item>
    <item>
      <title>Re: Parallel processing of json files in databricks pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/parallel-processing-of-json-files-in-databricks-pyspark/m-p/34645#M25378</link>
      <description>&lt;P&gt;spark.read.json("/mnt/dbfs/&amp;lt;ENTER PATH OF JSON DIR HERE&amp;gt;/*.json&lt;/P&gt;&lt;P&gt;you first have to mount your blob storage to databricks, I assume that is already done.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://spark.apache.org/docs/latest/sql-data-sources-json.html" alt="https://spark.apache.org/docs/latest/sql-data-sources-json.html" target="_blank"&gt;https://spark.apache.org/docs/latest/sql-data-sources-json.html&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 22 Nov 2021 09:54:07 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parallel-processing-of-json-files-in-databricks-pyspark/m-p/34645#M25378</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2021-11-22T09:54:07Z</dc:date>
    </item>
    <item>
      <title>Re: Parallel processing of json files in databricks pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/parallel-processing-of-json-files-in-databricks-pyspark/m-p/34646#M25379</link>
      <description>&lt;P&gt;Thank you.. We are using mount already..&lt;/P&gt;</description>
      <pubDate>Mon, 22 Nov 2021 10:59:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parallel-processing-of-json-files-in-databricks-pyspark/m-p/34646#M25379</guid>
      <dc:creator>SailajaB</dc:creator>
      <dc:date>2021-11-22T10:59:03Z</dc:date>
    </item>
    <item>
      <title>Re: Parallel processing of json files in databricks pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/parallel-processing-of-json-files-in-databricks-pyspark/m-p/34647#M25380</link>
      <description>&lt;P&gt;Hi @Sailaja B​&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Check the number of stages and task when you are reading the JSON files. How many do you see? are you JSON files nested? how long does it takes to read a single JSON file?&lt;/P&gt;</description>
      <pubDate>Mon, 22 Nov 2021 19:57:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parallel-processing-of-json-files-in-databricks-pyspark/m-p/34647#M25380</guid>
      <dc:creator>jose_gonzalez</dc:creator>
      <dc:date>2021-11-22T19:57:46Z</dc:date>
    </item>
  </channel>
</rss>

