<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Read JSON files from the s3 bucket in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/read-json-files-from-the-s3-bucket/m-p/13489#M8162</link>
    <description>&lt;P&gt;other ideas:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;validate location and file existance for example using "data" on left menu in databricks,&lt;/LI&gt;&lt;LI&gt;validate S3 access rights (aws admin attach policy to user/role maybe something is missing),&lt;/LI&gt;&lt;LI&gt;try read that as text file to check is content loading: &lt;/LI&gt;&lt;/UL&gt;&lt;PRE&gt;&lt;CODE&gt;spark.read.text()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Thu, 14 Oct 2021 10:55:02 GMT</pubDate>
    <dc:creator>Hubert-Dudek</dc:creator>
    <dc:date>2021-10-14T10:55:02Z</dc:date>
    <item>
      <title>Read JSON files from the s3 bucket</title>
      <link>https://community.databricks.com/t5/data-engineering/read-json-files-from-the-s3-bucket/m-p/13483#M8156</link>
      <description>&lt;P&gt;Hello guys, I'm trying to read JSON files from the s3 bucket. but no matter what I try I get Query returned no result or if I don't specify the schema I get unable to infer a schema.&lt;/P&gt;&lt;P&gt;I tried to mount the s3 bucket, still not works.&lt;/P&gt;&lt;P&gt;here is some code that I tried:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;df = spark.read.json('dbfs:/mnt/path_to_json', multiLine="true", schema= json_schema) 
&amp;nbsp;
df = spark.read.option('multiline','true').format('json').load(path_to_json)
&amp;nbsp;
df = spark.read.json('s3a:// path_to _json, multiline=True)
&amp;nbsp;
display(df)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;The json file look like this:&lt;/P&gt;&lt;P&gt;{&lt;/P&gt;&lt;P&gt;'key1' : 'value1',&lt;/P&gt;&lt;P&gt;'key2' : 'value2',&lt;/P&gt;&lt;P&gt;...&lt;/P&gt;&lt;P&gt;}&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;hope you guys can help me,&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;B&gt;**EDIT**: &lt;/B&gt;inside the JSON i have string value that contains " \ " which throw corrupted error, is there any way to overcome this without change the value for the specific key?&lt;/P&gt;</description>
      <pubDate>Thu, 14 Oct 2021 08:59:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-json-files-from-the-s3-bucket/m-p/13483#M8156</guid>
      <dc:creator>Orianh</dc:creator>
      <dc:date>2021-10-14T08:59:31Z</dc:date>
    </item>
    <item>
      <title>Re: Read JSON files from the s3 bucket</title>
      <link>https://community.databricks.com/t5/data-engineering/read-json-files-from-the-s3-bucket/m-p/13484#M8157</link>
      <description>&lt;P&gt;Please try the below code and let me know if it helps you.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;%scala
val mdf = spark.read.option("multiline", "true").json("s3://&amp;lt;path-to-jsonfile&amp;gt;/sample.json")
mdf.show(false)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 14 Oct 2021 10:26:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-json-files-from-the-s3-bucket/m-p/13484#M8157</guid>
      <dc:creator>Prabakar</dc:creator>
      <dc:date>2021-10-14T10:26:25Z</dc:date>
    </item>
    <item>
      <title>Re: Read JSON files from the s3 bucket</title>
      <link>https://community.databricks.com/t5/data-engineering/read-json-files-from-the-s3-bucket/m-p/13485#M8158</link>
      <description>&lt;P&gt;Thanks for your answer, I get unable to infer a schema error.&lt;/P&gt;&lt;P&gt;error :&lt;/P&gt;&lt;P&gt;org.apache.spark.sql.AnalysisException: Unable to infer schema for JSON. It must be specified manually.&lt;/P&gt;&lt;P&gt;tired s3:// and s3a:// -- both didn't work.&lt;/P&gt;</description>
      <pubDate>Thu, 14 Oct 2021 10:31:10 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-json-files-from-the-s3-bucket/m-p/13485#M8158</guid>
      <dc:creator>Orianh</dc:creator>
      <dc:date>2021-10-14T10:31:10Z</dc:date>
    </item>
    <item>
      <title>Re: Read JSON files from the s3 bucket</title>
      <link>https://community.databricks.com/t5/data-engineering/read-json-files-from-the-s3-bucket/m-p/13486#M8159</link>
      <description>&lt;P&gt;Please verify json in some online json validator. Try double quotes in json - had issue with single quotes that one time.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Your code examples are correct.&lt;/P&gt;</description>
      <pubDate>Thu, 14 Oct 2021 10:34:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-json-files-from-the-s3-bucket/m-p/13486#M8159</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2021-10-14T10:34:40Z</dc:date>
    </item>
    <item>
      <title>Re: Read JSON files from the s3 bucket</title>
      <link>https://community.databricks.com/t5/data-engineering/read-json-files-from-the-s3-bucket/m-p/13487#M8160</link>
      <description>&lt;P&gt;Please refer to the &lt;A href="https://docs.databricks.com/data/data-sources/read-json.html#json-file" alt="https://docs.databricks.com/data/data-sources/read-json.html#json-file" target="_blank"&gt;doc&lt;/A&gt; that helps you to read JSON. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If you are getting this error the problem should be with the JSON schema. Please validate it.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;As a test, create a simple JSON file (you can get it on the internet), upload it to your S3 bucket, and try to read that. If it works then your JSON file schema has to be checked. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Further, the methods that you tried should also work if the JSON format is valid. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 14 Oct 2021 10:42:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-json-files-from-the-s3-bucket/m-p/13487#M8160</guid>
      <dc:creator>Prabakar</dc:creator>
      <dc:date>2021-10-14T10:42:37Z</dc:date>
    </item>
    <item>
      <title>Re: Read JSON files from the s3 bucket</title>
      <link>https://community.databricks.com/t5/data-engineering/read-json-files-from-the-s3-bucket/m-p/13488#M8161</link>
      <description>&lt;P&gt;the json is valid. when i tried to write a json file in fs and then read it evrey thing went fine.&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;dbutils.fs.put("/tmp/test.json", """
{"string":"string1",
"int":1,
"array":[1,2,3],
"dict": {"key": "value1"}}
""", True)
&amp;nbsp;
df = spark.read.json('/tmp/test.json')&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;but when tried to read from s3 bucket, or from mount its failed&lt;/P&gt;</description>
      <pubDate>Thu, 14 Oct 2021 10:44:13 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-json-files-from-the-s3-bucket/m-p/13488#M8161</guid>
      <dc:creator>Orianh</dc:creator>
      <dc:date>2021-10-14T10:44:13Z</dc:date>
    </item>
    <item>
      <title>Re: Read JSON files from the s3 bucket</title>
      <link>https://community.databricks.com/t5/data-engineering/read-json-files-from-the-s3-bucket/m-p/13489#M8162</link>
      <description>&lt;P&gt;other ideas:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;validate location and file existance for example using "data" on left menu in databricks,&lt;/LI&gt;&lt;LI&gt;validate S3 access rights (aws admin attach policy to user/role maybe something is missing),&lt;/LI&gt;&lt;LI&gt;try read that as text file to check is content loading: &lt;/LI&gt;&lt;/UL&gt;&lt;PRE&gt;&lt;CODE&gt;spark.read.text()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 14 Oct 2021 10:55:02 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-json-files-from-the-s3-bucket/m-p/13489#M8162</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2021-10-14T10:55:02Z</dc:date>
    </item>
    <item>
      <title>Re: Read JSON files from the s3 bucket</title>
      <link>https://community.databricks.com/t5/data-engineering/read-json-files-from-the-s3-bucket/m-p/13490#M8163</link>
      <description>&lt;P&gt;I wrote the real json inside /tmp/test.json and tried to read it now.&lt;/P&gt;&lt;P&gt;when i didn't defined the schema i got an error:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the&lt;/P&gt;&lt;P&gt;referenced columns only include the internal corrupt record column&lt;/P&gt;&lt;P&gt;(named _corrupt_record by default). For example:&lt;/P&gt;&lt;P&gt;spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()&lt;/P&gt;&lt;P&gt;and spark.read.schema(schema).json(file).select("_corrupt_record").show().&lt;/P&gt;&lt;P&gt;Instead, you can cache or save the parsed results and then send the same query.&lt;/P&gt;&lt;P&gt;For example, val df = spark.read.schema(schema).json(file).cache() and then&lt;/P&gt;&lt;P&gt;df.filter($"_corrupt_record".isNotNull).count().;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;but when i defined the schema i got a df with all columns null.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;i have access to s3 bucket, since i already read text files from there and the json files have data inside it ( 800 KB)&lt;/P&gt;&lt;P&gt;Thanks a lot for your help&lt;/P&gt;</description>
      <pubDate>Thu, 14 Oct 2021 11:24:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-json-files-from-the-s3-bucket/m-p/13490#M8163</guid>
      <dc:creator>Orianh</dc:creator>
      <dc:date>2021-10-14T11:24:00Z</dc:date>
    </item>
    <item>
      <title>Re: Read JSON files from the s3 bucket</title>
      <link>https://community.databricks.com/t5/data-engineering/read-json-files-from-the-s3-bucket/m-p/13491#M8164</link>
      <description>&lt;P&gt;I think i  found the problem, inside the json i have a string value that contains '\' &lt;/P&gt;&lt;P&gt;and its throw corrupted error, any idea how to overcome on this without change all the json files?&lt;/P&gt;</description>
      <pubDate>Thu, 14 Oct 2021 11:51:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-json-files-from-the-s3-bucket/m-p/13491#M8164</guid>
      <dc:creator>Orianh</dc:creator>
      <dc:date>2021-10-14T11:51:59Z</dc:date>
    </item>
    <item>
      <title>Re: Read JSON files from the s3 bucket</title>
      <link>https://community.databricks.com/t5/data-engineering/read-json-files-from-the-s3-bucket/m-p/13492#M8165</link>
      <description>&lt;P&gt;try to experiment with this options:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;df = spark.read\
.option("mode", "PERMISSIVE")\
.option("columnNameOfCorruptRecord", "_corrupt_record")\
.json(...&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 14 Oct 2021 11:59:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-json-files-from-the-s3-bucket/m-p/13492#M8165</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2021-10-14T11:59:55Z</dc:date>
    </item>
    <item>
      <title>Re: Read JSON files from the s3 bucket</title>
      <link>https://community.databricks.com/t5/data-engineering/read-json-files-from-the-s3-bucket/m-p/13493#M8166</link>
      <description>&lt;P&gt;still not working  -- same corrupted error. I uploaded to s3 bucket same JSON just without the problematic value and every thing went well.&lt;/P&gt;</description>
      <pubDate>Thu, 14 Oct 2021 12:36:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-json-files-from-the-s3-bucket/m-p/13493#M8166</guid>
      <dc:creator>Orianh</dc:creator>
      <dc:date>2021-10-14T12:36:50Z</dc:date>
    </item>
    <item>
      <title>Re: Read JSON files from the s3 bucket</title>
      <link>https://community.databricks.com/t5/data-engineering/read-json-files-from-the-s3-bucket/m-p/13494#M8167</link>
      <description>&lt;P&gt;so last effort is just replace '\' in files like you do. You can do that programmatically before reading json.&lt;/P&gt;</description>
      <pubDate>Thu, 14 Oct 2021 13:20:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-json-files-from-the-s3-bucket/m-p/13494#M8167</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2021-10-14T13:20:53Z</dc:date>
    </item>
  </channel>
</rss>

