<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Autoloader error &amp;quot;Failed to infer schema for format json from existing files in input&amp;quot; in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/autoloader-error-quot-failed-to-infer-schema-for-format-json/m-p/82352#M36620</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/109380"&gt;@hpant&lt;/a&gt;&amp;nbsp;would you consider testing the new &lt;A href="https://docs.databricks.com/en/semi-structured/variant.html" target="_self"&gt;VARIANT&lt;/A&gt; type for your JSON data? I appreciate it will require rewriting the next step in your pipeline, but should be more robust wrt errors.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Disclaimer: I haven't personally tested variant with Autoloader. It should speed up some reads, but I don't think it yet creates file statistics.&lt;/P&gt;
&lt;P&gt;Alternatively, you can partially define a schema and put the remainder of the columns in a 'rescued' column, but it is a hassle splitting it out afterwards.&lt;/P&gt;</description>
    <pubDate>Thu, 08 Aug 2024 09:29:59 GMT</pubDate>
    <dc:creator>holly</dc:creator>
    <dc:date>2024-08-08T09:29:59Z</dc:date>
    <item>
      <title>Autoloader error "Failed to infer schema for format json from existing files in input"</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-error-quot-failed-to-infer-schema-for-format-json/m-p/75569#M34988</link>
      <description>&lt;P&gt;I have two json files in one of the location in Azure gen 2 storage e.g.&amp;nbsp;&lt;SPAN&gt;'/mnt/abc/Testing/'. When I trying to read the files using autoloader I am getting this error:&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;"Failed to infer schema for format json from existing files in input path &lt;SPAN&gt;/mnt/abc/Testing/&lt;/SPAN&gt;. Please ensure you configured the options properly or explicitly specify the schema."&lt;/STRONG&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 24 Jun 2024 10:30:44 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-error-quot-failed-to-infer-schema-for-format-json/m-p/75569#M34988</guid>
      <dc:creator>hpant</dc:creator>
      <dc:date>2024-06-24T10:30:44Z</dc:date>
    </item>
    <item>
      <title>Re: Autoloader error "Failed to infer schema for format json from existing files in input"</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-error-quot-failed-to-infer-schema-for-format-json/m-p/75706#M35035</link>
      <description>&lt;P&gt;Hey&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;Thanks for your response. I have tried both with without schema and with schema, it is not working.&lt;/P&gt;&lt;P&gt;1. Without Schema&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ss1.PNG" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/8922i67CB84366C662E47/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400" role="button" title="ss1.PNG" alt="ss1.PNG" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;2. With Schema: It is taking forever to read the data even with two rows of data. (It kept running for 1 hour and I interrupted it after).&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ss2.PNG" style="width: 986px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/8923i23FE6B26D024FE32/image-size/large/is-moderation-mode/true?v=v2&amp;amp;px=999" role="button" title="ss2.PNG" alt="ss2.PNG" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;For your reference, I have attached sample json file:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ss3.PNG" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/8924iE0CA0FC86F8CC93C/image-size/large/is-moderation-mode/true?v=v2&amp;amp;px=999" role="button" title="ss3.PNG" alt="ss3.PNG" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 25 Jun 2024 13:28:28 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-error-quot-failed-to-infer-schema-for-format-json/m-p/75706#M35035</guid>
      <dc:creator>hpant</dc:creator>
      <dc:date>2024-06-25T13:28:28Z</dc:date>
    </item>
    <item>
      <title>Re: Autoloader error "Failed to infer schema for format json from existing files in input"</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-error-quot-failed-to-infer-schema-for-format-json/m-p/76299#M35189</link>
      <description>&lt;P&gt;&lt;SPAN&gt;Hey&amp;nbsp;&lt;/SPAN&gt;&lt;A href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9" target="_blank"&gt;@Kaniz_Fatma&lt;/A&gt;&lt;SPAN&gt;&amp;nbsp;,&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Any comment on the above message?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Thanks&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 01 Jul 2024 12:34:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-error-quot-failed-to-infer-schema-for-format-json/m-p/76299#M35189</guid>
      <dc:creator>hpant</dc:creator>
      <dc:date>2024-07-01T12:34:05Z</dc:date>
    </item>
    <item>
      <title>Re: Autoloader error "Failed to infer schema for format json from existing files in input"</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-error-quot-failed-to-infer-schema-for-format-json/m-p/76554#M35263</link>
      <description>&lt;P&gt;Hi !&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Given the issues you're facing with schema inference and long processing times, it might be helpful to preprocess the JSON files to ensure they are properly formatted and manageable. Here's a step-by-step approach to split the JSON files into smaller chunks and then use Databricks Auto Loader to read them.&lt;/P&gt;&lt;H3&gt;Step 1: Split the JSON Files&lt;/H3&gt;&lt;P&gt;You can use a Python script to split the JSON files into smaller chunks. Here's an example of how you can do this:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;import json

def split_json_file(input_path, output_dir, chunk_size=1):
    with open(input_path, 'r') as file:
        data = json.load(file)
    
    for i in range(0, len(data), chunk_size):
        chunk = data[i:i + chunk_size]
        output_path = f"{output_dir}/chunk_{i//chunk_size}.json"
        with open(output_path, 'w') as chunk_file:
            json.dump(chunk, chunk_file, indent=4)

# Example usage
input_path = "/dbfs/mnt/abc/Testing/input.json"
output_dir = "/dbfs/mnt/abc/Testing/split"
split_json_file(input_path, output_dir, chunk_size=1)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;H3&gt;Step 2: Use Databricks Auto Loader to Read the Split JSON Files&lt;/H3&gt;&lt;P&gt;Now, you can use Databricks Auto Loader to read the split JSON files into a DataFrame.&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Ensure the directory structure is correct&lt;/STRONG&gt;: The split JSON files should be located in a directory that Auto Loader can access.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Update the path in your Auto Loader code&lt;/STRONG&gt;: Point to the directory where the split JSON files are stored.&lt;/P&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;H3&gt;Example Code to Read Split JSON Files&lt;/H3&gt;&lt;P&gt;Here's the updated code to read the split JSON files using Auto Loader:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;from pyspark.sql.types import StructType, StructField, StringType, TimestampType, DoubleType
from pyspark.sql import SparkSession

# Define the schema
schema = StructType([
    StructField("device_id", StringType(), True),
    StructField("documentTime", TimestampType(), True),
    StructField("radonShortTermAvg", DoubleType(), True),
    StructField("temp", DoubleType(), True),
    StructField("humidity", DoubleType(), True),
    StructField("co2", DoubleType(), True),
    StructField("voc", DoubleType(), True),
    StructField("pressure", DoubleType(), True),
    StructField("light", DoubleType(), True)
])

# Initialize Spark session
spark = SparkSession.builder \
    .appName("Auto Loader Example") \
    .getOrCreate()

# Define paths
path = '/mnt/abc/Testing/split/'  # Update to the directory with split JSON files
checkpoint_path = '/mnt/deskoccupancy-historical/Testing/Autoloader/'
table_name = 'bronze.occupancy_table_bronze'

# Read stream using Auto Loader
df = spark.readStream.format("cloudFiles") \
    .option("cloudFiles.format", "json") \
    .option("cloudFiles.schemaLocation", checkpoint_path) \
    .option("cloudFiles.includeExistingFiles", "true") \
    .option("encoding", "utf-8") \
    .option("multiline", "true") \
    .schema(schema) \
    .load(path)

# Write stream to Delta Lake
df.writeStream \
    .format("delta") \
    .option("mergeSchema", "true") \
    .option("checkpointLocation", checkpoint_path) \
    .start(table_name) \
    .awaitTermination()&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;M.T&lt;/P&gt;</description>
      <pubDate>Tue, 02 Jul 2024 21:39:57 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-error-quot-failed-to-infer-schema-for-format-json/m-p/76554#M35263</guid>
      <dc:creator>mtajmouati</dc:creator>
      <dc:date>2024-07-02T21:39:57Z</dc:date>
    </item>
    <item>
      <title>Re: Autoloader error "Failed to infer schema for format json from existing files in input"</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-error-quot-failed-to-infer-schema-for-format-json/m-p/76768#M35312</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/109839"&gt;@mtajmouati&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;Thanks for your response. Your solution might work, but the problem is that we are receiving data every hour, and Autoloader can handle new data with its checkpoint system. However, if we keep the function to split the JSON file, we need to add a mechanism that only splits files which have not been read before, which would defeat the purpose of the Autoloader checkpoint system. Is there any way to achieve this using only Autoloader?&lt;/P&gt;&lt;P&gt;Thanks,&lt;/P&gt;&lt;P&gt;Himanshu&lt;/P&gt;</description>
      <pubDate>Thu, 04 Jul 2024 13:22:43 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-error-quot-failed-to-infer-schema-for-format-json/m-p/76768#M35312</guid>
      <dc:creator>hpant</dc:creator>
      <dc:date>2024-07-04T13:22:43Z</dc:date>
    </item>
    <item>
      <title>Re: Autoloader error "Failed to infer schema for format json from existing files in input"</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-error-quot-failed-to-infer-schema-for-format-json/m-p/82352#M36620</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/109380"&gt;@hpant&lt;/a&gt;&amp;nbsp;would you consider testing the new &lt;A href="https://docs.databricks.com/en/semi-structured/variant.html" target="_self"&gt;VARIANT&lt;/A&gt; type for your JSON data? I appreciate it will require rewriting the next step in your pipeline, but should be more robust wrt errors.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Disclaimer: I haven't personally tested variant with Autoloader. It should speed up some reads, but I don't think it yet creates file statistics.&lt;/P&gt;
&lt;P&gt;Alternatively, you can partially define a schema and put the remainder of the columns in a 'rescued' column, but it is a hassle splitting it out afterwards.&lt;/P&gt;</description>
      <pubDate>Thu, 08 Aug 2024 09:29:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-error-quot-failed-to-infer-schema-for-format-json/m-p/82352#M36620</guid>
      <dc:creator>holly</dc:creator>
      <dc:date>2024-08-08T09:29:59Z</dc:date>
    </item>
  </channel>
</rss>

