Thanks to everyone who joined the Data Ingestion Part 2 webinar on semi-structured data. You can access the on-demand recording here.
We received a number of great questions throughout the session so we’re sharing a subset of the Q&A in this Databricks Community post. Please feel free to ask follow-up questions or add comments as threads.
TOPIC: Data Ingestion with Auto Loader
Q: Is Auto Loader only for JSON files?
No. Auto Loader can support many different formats, including JSON, CSV, PARQUET, AVRO, TEXT, BINARYFILE, and ORC files. See all file formats in the docs [AWS] [Azure] [GCP]
Q: Can Auto Loader load Excel files?
Auto Loader doesn't currently load excel files directly. We are adding other ingestion features soon which can upload Excel files. Contact us if you would like to know more.
Q: Is Auto Loader free of charge with Databricks?
Yes, there is no extra cost apart from Databricks usage.
Q: Does Auto Loader require a specific Databricks Runtime (DBR) version?
Yes, Auto Loader requires DBR 8.3 or above
Q: Are there advantages to explicitly defining the schema when using Auto Loader?
Yes, in the case that you want to explicitly define what the data should ingest as. You can also define only specific columns (even nested columns) and let the rest be inferred.
Q: Can Auto Loader be used for calculating columns for a real time meter readings scenario?
Yes, Auto Loader helps you to read your data and gives you a dataframe that you can implement any spark features like windowing to aggregate data.
Q: Is there a best practice for handling a full refresh of data to understand the deleted records?
Please refer to the delta change data feed documentation for best practices to deal with changes.
Q: What kind of eventing does Auto Loader support in Azure?
Auto Loader is designed to read from a cloud storage. But there are also many tools to fetch data from message queues like kafka. Please refer to the Azure Databricks documentation.
Q: Do any Databricks features help in masking/ encryption of data when we give users differentiated access (one having direct column access another having masked column access)?
Read more on our blog about Databricks Unity Catalog: Fine-grained Governance for Data and AI on the Lakehouse.
TOPIC: Ingest JSON data with Auto Loader
Q: With nested JSON, will Delta Lake algorithms automatically be able to infer the structure or should it be handled via code as done in other clouds?
Auto Loader can infer the schema for nested JSON and you can also use schema hints to give certain columns a defined datatype.
Q: Are JSON files created using spark supported by Auto Loader?
Yes, as long as the JSON data is written on a cloud storage.
Q: If we are bringing all columns of nested JSON structure in the silver layer but only bringing selected columns in the gold layer, how can we use Auto Loader to add a new column (with historical load of data)?
You can select columns and apply any spark ETL function before writing to a table. You can simply add .option("mergeSchema", "true") on your writer to add more columns.
Q: Can JSON schema generated outside of Databricks be used with Databricks without "inference"?
Yes, as long as it is valid JSON.
Q: My JSON files are delivered by the middleware tool that also does JSON schema validation. It uses JSON schema as per the http://json-schema.org/ notation. Is it possible to use schemas derived from API documentation or generated outside of Spark/ Databricks?
Yes, in this case, you could simply infer the schema since it would be pretty uniform or if your cluster has access to the schema you can use it programmatically but I'm thinking it would be easiest to infer it.
Add your follow-up questions to threads! You can also check out the Q&A from Data Ingestion Part 1 in this post.