<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: JSON validation is getting failed after writing Pyspark dataframe to json format in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/json-validation-is-getting-failed-after-writing-pyspark/m-p/28940#M20705</link>
    <description>&lt;P&gt;But if we use the output in other Azure resource it will get failed right&lt;/P&gt;</description>
    <pubDate>Thu, 10 Feb 2022 13:24:02 GMT</pubDate>
    <dc:creator>SailajaB</dc:creator>
    <dc:date>2022-02-10T13:24:02Z</dc:date>
    <item>
      <title>JSON validation is getting failed after writing Pyspark dataframe to json format</title>
      <link>https://community.databricks.com/t5/data-engineering/json-validation-is-getting-failed-after-writing-pyspark/m-p/28936#M20701</link>
      <description>&lt;P&gt;Hi &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We have to convert transformed dataframe to json format. So we used write and json format on top of final dataframe to convert it to json. But when we validating the output json its not in proper json format.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Could you please provide your suggestion that how can achieve this in databricks pyspark&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you&lt;/P&gt;</description>
      <pubDate>Thu, 10 Feb 2022 06:39:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/json-validation-is-getting-failed-after-writing-pyspark/m-p/28936#M20701</guid>
      <dc:creator>SailajaB</dc:creator>
      <dc:date>2022-02-10T06:39:24Z</dc:date>
    </item>
    <item>
      <title>Re: JSON validation is getting failed after writing Pyspark dataframe to json format</title>
      <link>https://community.databricks.com/t5/data-engineering/json-validation-is-getting-failed-after-writing-pyspark/m-p/28937#M20702</link>
      <description>&lt;P&gt;Could you please share:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;sample of dataframe and the improper JSON received&lt;/LI&gt;&lt;LI&gt;your code to convert the data to JSON format&lt;/LI&gt;&lt;/OL&gt;</description>
      <pubDate>Thu, 10 Feb 2022 08:33:26 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/json-validation-is-getting-failed-after-writing-pyspark/m-p/28937#M20702</guid>
      <dc:creator>AmanSehgal</dc:creator>
      <dc:date>2022-02-10T08:33:26Z</dc:date>
    </item>
    <item>
      <title>Re: JSON validation is getting failed after writing Pyspark dataframe to json format</title>
      <link>https://community.databricks.com/t5/data-engineering/json-validation-is-getting-failed-after-writing-pyspark/m-p/28938#M20703</link>
      <description>&lt;P&gt;Hi Melbourne,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you for the reply&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We are using below code to convert to JSON&lt;/P&gt;&lt;P&gt;df.coalesce(1).write.format("json").save(dataLocation)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We are receiving in the o/p as below&lt;/P&gt;&lt;P&gt;{"col1":"A","col2":"B"}&lt;/P&gt;&lt;P&gt;{"col1":"C","col2":"D"}&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We are excepting in JSON format as below&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;[{"col1":"A","col2":"B"},&lt;/P&gt;&lt;P&gt;{"col1":"C","col2":"D"}]&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you&lt;/P&gt;</description>
      <pubDate>Thu, 10 Feb 2022 09:37:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/json-validation-is-getting-failed-after-writing-pyspark/m-p/28938#M20703</guid>
      <dc:creator>SailajaB</dc:creator>
      <dc:date>2022-02-10T09:37:32Z</dc:date>
    </item>
    <item>
      <title>Re: JSON validation is getting failed after writing Pyspark dataframe to json format</title>
      <link>https://community.databricks.com/t5/data-engineering/json-validation-is-getting-failed-after-writing-pyspark/m-p/28939#M20704</link>
      <description>&lt;P&gt;What you're seeing in the file is JSONlines. The difference between that and JSON is absence of square brackets and commas after every record. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;You shouldn't have face problem in reading the JSON data back using Spark. &lt;/P&gt;</description>
      <pubDate>Thu, 10 Feb 2022 12:49:57 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/json-validation-is-getting-failed-after-writing-pyspark/m-p/28939#M20704</guid>
      <dc:creator>AmanSehgal</dc:creator>
      <dc:date>2022-02-10T12:49:57Z</dc:date>
    </item>
    <item>
      <title>Re: JSON validation is getting failed after writing Pyspark dataframe to json format</title>
      <link>https://community.databricks.com/t5/data-engineering/json-validation-is-getting-failed-after-writing-pyspark/m-p/28940#M20705</link>
      <description>&lt;P&gt;But if we use the output in other Azure resource it will get failed right&lt;/P&gt;</description>
      <pubDate>Thu, 10 Feb 2022 13:24:02 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/json-validation-is-getting-failed-after-writing-pyspark/m-p/28940#M20705</guid>
      <dc:creator>SailajaB</dc:creator>
      <dc:date>2022-02-10T13:24:02Z</dc:date>
    </item>
    <item>
      <title>Re: JSON validation is getting failed after writing Pyspark dataframe to json format</title>
      <link>https://community.databricks.com/t5/data-engineering/json-validation-is-getting-failed-after-writing-pyspark/m-p/28941#M20706</link>
      <description>&lt;P&gt;Could you provide an example where this could be an issue? There are libraries available that read JSONlines from a JSON file. You can use them, or maybe you can add a transformation logic to process JSON files before the resource consumes it.&lt;/P&gt;</description>
      <pubDate>Thu, 10 Feb 2022 14:06:30 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/json-validation-is-getting-failed-after-writing-pyspark/m-p/28941#M20706</guid>
      <dc:creator>AmanSehgal</dc:creator>
      <dc:date>2022-02-10T14:06:30Z</dc:date>
    </item>
    <item>
      <title>Re: JSON validation is getting failed after writing Pyspark dataframe to json format</title>
      <link>https://community.databricks.com/t5/data-engineering/json-validation-is-getting-failed-after-writing-pyspark/m-p/28942#M20707</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Is there any way to produce proper JSON in databricks itself?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you&lt;/P&gt;</description>
      <pubDate>Thu, 10 Feb 2022 14:37:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/json-validation-is-getting-failed-after-writing-pyspark/m-p/28942#M20707</guid>
      <dc:creator>SailajaB</dc:creator>
      <dc:date>2022-02-10T14:37:19Z</dc:date>
    </item>
    <item>
      <title>Re: JSON validation is getting failed after writing Pyspark dataframe to json format</title>
      <link>https://community.databricks.com/t5/data-engineering/json-validation-is-getting-failed-after-writing-pyspark/m-p/28943#M20708</link>
      <description>&lt;P&gt;Convert your dataframe in to pandas and write to your storage using `.to_json(&amp;lt;path&amp;gt;, orient='records'). &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;To get desired output, set orient as records.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Here is AWS S3 equivalent code:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;#Creating Session using Boto3.
session = boto3.Session(
aws_access_key_id='&amp;lt;key ID&amp;gt;',
aws_secret_access_key='&amp;lt;secret_key&amp;gt;'
)
&amp;nbsp;
#Create s3 session with boto3
s3 = session.resource('s3')
&amp;nbsp;
json_buffer = io.StringIO()
&amp;nbsp;
# Create dataframe and convert to pandas
df = spark.range(4).withColumn("organisation", lit("Databricks"))
df_p = df.toPandas()
df_p.to_json(json_buffer, orient='records')
&amp;nbsp;
#Create s3 object
object = s3.Object('&amp;lt;bucket-name&amp;gt;', '&amp;lt;JSON file name&amp;gt;')
&amp;nbsp;
#Put the object into bucket
&amp;nbsp;result = object.put(Body=json_buffer.getvalue())&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Hope this helps.&lt;/P&gt;</description>
      <pubDate>Thu, 10 Feb 2022 15:20:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/json-validation-is-getting-failed-after-writing-pyspark/m-p/28943#M20708</guid>
      <dc:creator>AmanSehgal</dc:creator>
      <dc:date>2022-02-10T15:20:05Z</dc:date>
    </item>
    <item>
      <title>Re: JSON validation is getting failed after writing Pyspark dataframe to json format</title>
      <link>https://community.databricks.com/t5/data-engineering/json-validation-is-getting-failed-after-writing-pyspark/m-p/28944#M20709</link>
      <description>&lt;P&gt;Thank you for the reply.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We tried to convert Pyspark df to Pandas df to achieve the expected JSON format. But due to below issues we stopped the conversion process&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Our Pyspark dataframe is very huge we can say around 400 Million+ rows , so our output should be in multiple files. As pyspark df is distributed one we no need to worry about for the multiple file logic. But where as pandas df is single cpu one it will generate a huge single output file.&lt;/LI&gt;&lt;LI&gt;When we tried to convert the pyspark df to pandas df its getting failed as our dataframe contains deeply nested attributes&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 11 Feb 2022 05:06:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/json-validation-is-getting-failed-after-writing-pyspark/m-p/28944#M20709</guid>
      <dc:creator>SailajaB</dc:creator>
      <dc:date>2022-02-11T05:06:24Z</dc:date>
    </item>
    <item>
      <title>Re: JSON validation is getting failed after writing Pyspark dataframe to json format</title>
      <link>https://community.databricks.com/t5/data-engineering/json-validation-is-getting-failed-after-writing-pyspark/m-p/28945#M20710</link>
      <description>&lt;P&gt;But by using coalesce(1) in you pyspark df, you're doing the same thing. It'll be processed on one node.&lt;/P&gt;</description>
      <pubDate>Fri, 11 Feb 2022 05:08:28 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/json-validation-is-getting-failed-after-writing-pyspark/m-p/28945#M20710</guid>
      <dc:creator>AmanSehgal</dc:creator>
      <dc:date>2022-02-11T05:08:28Z</dc:date>
    </item>
    <item>
      <title>Re: JSON validation is getting failed after writing Pyspark dataframe to json format</title>
      <link>https://community.databricks.com/t5/data-engineering/json-validation-is-getting-failed-after-writing-pyspark/m-p/28946#M20711</link>
      <description>&lt;P&gt;yes, sorry its my bad. We removed that part&lt;/P&gt;</description>
      <pubDate>Fri, 11 Feb 2022 05:09:36 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/json-validation-is-getting-failed-after-writing-pyspark/m-p/28946#M20711</guid>
      <dc:creator>SailajaB</dc:creator>
      <dc:date>2022-02-11T05:09:36Z</dc:date>
    </item>
    <item>
      <title>Re: JSON validation is getting failed after writing Pyspark dataframe to json format</title>
      <link>https://community.databricks.com/t5/data-engineering/json-validation-is-getting-failed-after-writing-pyspark/m-p/28947#M20712</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;P&gt;Converting 400mn+ rows into JSON, in my opinion is not a good solution, as it'll take a lot of space for no reason.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Anyway, so you've JSONlines in file but you want it to be JSON object only in the file. There's a simpler way to do this. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Let Spark write your data with 400mn+ records into 'x' number of JSON files. &lt;/P&gt;&lt;P&gt;Since databricks cells supports shell commands, you can run following script to convert JSONL to JSON files. Run it recursively or however you would like it.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Let's say your blob store location is mounted on dbfs in mnt directory.&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;%sh
cat /dbfs/mnt/&amp;lt;path to JSONlines input file&amp;gt; | sed -e ':a' -e 'N' -e '$!ba' -e 's/\n/,/g' | sed 's/n/,/' | sed 's/^/[/'| sed 's/$/]/' &amp;gt; /dbfs/mnt/&amp;lt;path to JSON output file&amp;gt;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;The above command should convert your files in seconds.&lt;/P&gt;&lt;P&gt;Do share on how it goes with this approach.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Credit: Medium &lt;A href="https://slotix.medium.com/this-command-converts-jsonl-to-json-it-takes-about-a-second-to-convert-30-mb-file-733a72877187" alt="https://slotix.medium.com/this-command-converts-jsonl-to-json-it-takes-about-a-second-to-convert-30-mb-file-733a72877187" target="_blank"&gt;post&lt;/A&gt;.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 11 Feb 2022 06:25:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/json-validation-is-getting-failed-after-writing-pyspark/m-p/28947#M20712</guid>
      <dc:creator>AmanSehgal</dc:creator>
      <dc:date>2022-02-11T06:25:23Z</dc:date>
    </item>
    <item>
      <title>Re: JSON validation is getting failed after writing Pyspark dataframe to json format</title>
      <link>https://community.databricks.com/t5/data-engineering/json-validation-is-getting-failed-after-writing-pyspark/m-p/28948#M20713</link>
      <description>&lt;P&gt;@Sailaja B​&amp;nbsp;- Does @Aman Sehgal​'s most recent answer help solve the problem? If it does, would you be happy to mark their answer as best?&lt;/P&gt;</description>
      <pubDate>Wed, 02 Mar 2022 17:01:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/json-validation-is-getting-failed-after-writing-pyspark/m-p/28948#M20713</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-03-02T17:01:40Z</dc:date>
    </item>
  </channel>
</rss>

