Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
I have a simple job scheduled every 5 min. Basically it listens to cloudfiles on storage account and writes them into delta table, extremely simple. The code is something like this:df = (spark
.readStream
.format("cloudFiles")
.option('cloudFil...
Following are the details of the requirement:1. I am using databricks notebook to read data from Kafka topic and writing into ADLS Gen2 container i.e., my landing layer.2. I am using Spark code to read data from Kafka and write into landing...
Need some help in choosing between where to do deduplication of data. So I have sensor data in blob storage that I'm picking up with Databricks Autoloader. The data and files can have duplicates in them.Which of the 2 options do I choose?Option 1:Cre...
@peter_mcnally You can use watermark to pick the late records and send only the latest records to the bronze table. This will ensure that you always have the latest information in your bronze table.This feature is explained in detail here - https://w...
Hi @Ravi Vishwakarma Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answ...
I have a simple autoloader job which looks like thisdf_dwu_limit = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "JSON") \
.schema(schemaFromJson) \
.load("abfss://synapse-usage@xxxxx.dfs.core.windows.net/synapse-us...
I am using Databricks Autoloader to load JSON files from ADLS gen2 incrementally in directory listing mode. All source filename has Timestamp on them. The autoloader works perfectly couple of days with the below configuration and breaks the next day ...
Hi Everyone,I'm seeing this issue as well - same configuration of the previous posts, using autoloader with incremental file listing turned on. The strange part is that it mostly works despite almost all of the files we're loading having colons incl...
Hello ! I playing with autoloader schema inference on a big S3 repo with +300 tables and large CSV files. I'm looking at autoloader with great attention, as it can be a great time saver on our ingestion process (data comes from a transactional DB gen...
PySpark by default is using \ as an escape character. You can change it to "Doc: https://docs.databricks.com/ingestion/auto-loader/options.html#csv-options
I have a databricks autoloader notebook that reads json files from an input location and writes the flattened version of json files to an output location. However, the notebook is behaving differently for two different but similar scenarios as descri...
I met with an issue when I was trying to use autoloader to read json files from Azure ADLS Gen2. I am getting this issue for specific files only. I checked the file are good and not corrupted.Following is the issue:java.lang.IllegalArgumentException:...
I got the issue resolved. The issues was by mistake we have duplicate columns in the schema files. Because of that it was showing that error. However, the error is totally mis-leading, that's why didn't able to rectify it.
I'm using Auto Loader in a SQL notebook and I would like to configure file notification mode, but I don't know how to retrieve the client secret of the service principal from Azure Key Vault. Is there any example notebook somewhere? The notebook is p...
Hi @Magnus Johannesson , you must use the Secrets utility (dbutils.secrets) in a notebook or job to read a secret.https://learn.microsoft.com/en-us/azure/databricks/dev-tools/databricks-utils#dbutils-secretsHope it helps!
I want to set up an S3 stream using Databricks Auto Loader. I have managed to set up the stream, but my S3 bucket contains different type of JSON files. I want to filter them out, preferably in the stream itself rather than using a filter operation.A...
According to the docs you linked, the glob filter on input-path only works on directories, not on the files themselves.So if you want to filter on certain files in the concerning dirs, you can include an additional filter through the pathGlobFilter o...
Hello:
As you can see from below link, that it support 7 file formats. I am dealing with GeoSpatial Shape files and I want to know if Auto Loader can support Shape Files ???Any help on this is greatly appreciated.
Thanks.
https://docs.microsoft.com/...
You could try to use the binary file type. But the disadvantage of this is that the content of the shape files will be put into a column, that might not be what you want.If you absolutely want to use the autoloader, maybe some thinking outside the b...