cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Benji0934
by New Contributor II
  • 2627 Views
  • 2 replies
  • 3 kudos

Auto Loader: Empty fields (discovery_time, commit_time, archive_time) in cloud_files_state

Hi! Why are the fields discovery_time, commit_time, and archive_time NULL in cloud_files_state? Do I need to configure anything when creating my Auto Loader? df = spark.readStream.format("cloudFiles") \ .option("cloudFiles.format", "json") \ ...

  • 2627 Views
  • 2 replies
  • 3 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 3 kudos

Please be sure that the DBR version is 10.5 or highercommit_time and archive_time can be null but discovery_time is set even as NOT NULL in the table definition so it is a bit strange. Please change the DBR version first.

  • 3 kudos
1 More Replies
Chris_Shehu
by Valued Contributor III
  • 1701 Views
  • 2 replies
  • 2 kudos

Map("skipRows", "1") ignored during autoloader process. Something wrong with the format?

I've tried multiple variations of the following code. It seems like the map parameters are being completely ignored. CREATE LIVE TABLE a_raw2 TBLPROPERTIES ("quality" = "bronze") AS SELECT * FROM cloud_files("dbfs:/mnt/c-raw/a/c_medcheck_export*.csv"...

  • 1701 Views
  • 2 replies
  • 2 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 2 kudos

skipRows was added in DBR 11.1 -- what DBR is your DLT pipeline on?

  • 2 kudos
1 More Replies
lawrence009
by Contributor
  • 1350 Views
  • 2 replies
  • 3 kudos

Advice on efficiently cleansing and transforming delta table

I have a delta table that is being updated nightly using Auto Loader. After the merge, the job kicks off a second notebook to clean and rewrite certain value using a series of UPDATE statements, e.g.,UPDATE TABLE foo SET field1 = some_value WHER...

  • 1350 Views
  • 2 replies
  • 3 kudos
Latest Reply
Jfoxyyc
Valued Contributor
  • 3 kudos

I would partition the table by some sort of date that autoloader can use. You could then filter your update further and it'll automatically use partition pruning and only scan related files.

  • 3 kudos
1 More Replies
SudiptaBiswas
by New Contributor III
  • 3025 Views
  • 3 replies
  • 3 kudos

databricks autoloader getting stuck in flattening json files for different scenarios similar in nature.

I have a databricks autoloader notebook that reads json files from an input location and writes the flattened version of json files to an output location. However, the notebook is behaving differently for two different but similar scenarios as descri...

  • 3025 Views
  • 3 replies
  • 3 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 3 kudos

Could you provide a code snippet? also do you see any error logs in the driver logs?

  • 3 kudos
2 More Replies
hello_world
by New Contributor III
  • 4015 Views
  • 3 replies
  • 2 kudos

What exact difference does Auto Loader make?

New to Databricks and here is one thing that confuses me.Since Spark Streaming is already capable of incremental loading by checkpointing. What difference does it make by enabling Auto Loader?

  • 4015 Views
  • 3 replies
  • 2 kudos
Latest Reply
Meghala
Valued Contributor II
  • 2 kudos

Auto Loader provides a Structured Streaming source called cloudFiles. Given an input directory path on the cloud file storage, the cloudFiles source automatically processes new files as they arrive, with the option of also processing existing files i...

  • 2 kudos
2 More Replies
brickster_2018
by Databricks Employee
  • 2375 Views
  • 3 replies
  • 0 kudos

Resolved! For the Autoloader, cloudFiles.includeExistingFiles option, is ordering respected?

If Yes, how is order ensured?  For example, let's say there are a number of CDC change files that are uploaded to a directory over time. If a table were to be created using the cloudFiles source, in what order would those files be processed?

  • 2375 Views
  • 3 replies
  • 0 kudos
Latest Reply
Hanish_Goel
New Contributor II
  • 0 kudos

Hi, Is there any new development in terms of ensuring ordering of the files in autoloader?

  • 0 kudos
2 More Replies
avenu
by New Contributor
  • 2322 Views
  • 1 replies
  • 0 kudos

AutoLoader - process multiple files

I need to process files of different schema coming to different folders in ADLS using Autoloader. Do I need to start a separate read stream for each file type / folder or can this be handled using a single stream ?When I tried using a single stream, ...

  • 2322 Views
  • 1 replies
  • 0 kudos
Latest Reply
Wassim
New Contributor III
  • 0 kudos

As you are talking about different schemas ,perhaps schemaevolutionmode, infercolumntypes, or schemahints may help?? Check out this- 32min onward - https://youtu.be/8a38Fv9cpd8 ​Hope it helps, do let know how you solve it if you can.​

  • 0 kudos
Gilg
by Contributor II
  • 5867 Views
  • 4 replies
  • 5 kudos

Avro Deserialization from Event Hub capture and Autoloader

Hi All,I am getting data from Event Hub capture in Avro format and using Auto Loader to process it.I get into the point where I can read the Avro by casting the Body into a string.Now I wanted to deserialized the Body column so it will in table forma...

image image
  • 5867 Views
  • 4 replies
  • 5 kudos
Latest Reply
UmaMahesh1
Honored Contributor III
  • 5 kudos

If you still want to go with the above approach and don't want to provide schema manually, then you can fetch a tiny batch with 1 record and build the schema into a variable using a .schema option. Once done, you can add a new Body column by providin...

  • 5 kudos
3 More Replies
Chris_Konsur
by New Contributor III
  • 3180 Views
  • 2 replies
  • 0 kudos

Resolved! Configure Autoloader with the file notification mode for production

I configured ADLS Gen2 standard storage and successfully configured Autoloader with the file notification mode.In this documenthttps://docs.databricks.com/ingestion/auto-loader/file-notification-mode.html"ADLS Gen2 provides different event notificati...

  • 3180 Views
  • 2 replies
  • 0 kudos
Latest Reply
Ryan_Chynoweth
Esteemed Contributor
  • 0 kudos

Hi, @Chris Konsur​. You do not need anything with the FlushWithClose event REST API that is just the event type that we listen to. As for backfill setting, this is for handling late data or late event that are being triggered. This setting largely de...

  • 0 kudos
1 More Replies
Magnus
by Contributor
  • 3431 Views
  • 3 replies
  • 10 kudos

Resolved! How to retrieve Auto Loader client secret from Azure Key Vault?

I'm using Auto Loader in a SQL notebook and I would like to configure file notification mode, but I don't know how to retrieve the client secret of the service principal from Azure Key Vault. Is there any example notebook somewhere? The notebook is p...

  • 3431 Views
  • 3 replies
  • 10 kudos
Latest Reply
Geeta1
Valued Contributor
  • 10 kudos

Hi @Magnus Johannesson​ , you must use the Secrets utility (dbutils.secrets) in a notebook or job to read a secret.https://learn.microsoft.com/en-us/azure/databricks/dev-tools/databricks-utils#dbutils-secretsHope it helps!

  • 10 kudos
2 More Replies
Anonymous
by New Contributor III
  • 9635 Views
  • 5 replies
  • 5 kudos

Resolved! Override and Merge mode write using AutoLoader in Databricks

We are reading files using Autoloader in Databricks. Source system is giving full snapshot of complete data in files. So we want to read the data and write in delta table in override mode so all old data is replaced by the new data. Similarly for oth...

  • 9635 Views
  • 5 replies
  • 5 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 5 kudos

@Ranjeet Jaiswal​ ,afaik merge is supported:https://docs.databricks.com/_static/notebooks/merge-in-streaming.htmlThis link does some aggregation but that can be ommitted of course.​The interesting part here is outputMode("update"), and the foreachBat...

  • 5 kudos
4 More Replies
Chris_Konsur
by New Contributor III
  • 1319 Views
  • 0 replies
  • 2 kudos

Schema supported by Autoloader

We do not want to use schema inference with schema evolution in Autoloader. Instead, we want to apply our schema and use the merge option. Our schema is very complex, with multiple nested following levels. When I apply this schema to Autoloader, it r...

  • 1319 Views
  • 0 replies
  • 2 kudos
RK_AV
by New Contributor III
  • 4907 Views
  • 5 replies
  • 8 kudos

Autoloader cluster

I wanted to setup Autoloader to process files from Azure Data Lake (Blob) automatically whenever new files arrive. For this to work, I wanted to know if AutoLoader requires that the cluster is on all the time.

  • 4907 Views
  • 5 replies
  • 8 kudos
Latest Reply
asif5494
New Contributor III
  • 8 kudos

@Kaniz Fatma​ , If my cluster is not active, and I have uploaded 50 files in storage location, then where this Auto Loader will list out these 50 files. Will it use any checkpoint location, if yes, then how can I set the checkpoint location in Cloud ...

  • 8 kudos
4 More Replies
-werners-
by Esteemed Contributor III
  • 2785 Views
  • 2 replies
  • 17 kudos

Autoloader: how to avoid overlap in files

I'm thinking of using autoloader to process files being put on our data lake.Let's say f.e. every 15 minutes, a parquet files is written. These files however contain overlapping data.Now, every 2 hours I want to process the new data (autoloader) and...

  • 2785 Views
  • 2 replies
  • 17 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 17 kudos

What about forEachBatch and then MERGE?Alternatively, run another process that will clean updates using the window function, as you said.

  • 17 kudos
1 More Replies
Taha_Hussain
by Databricks Employee
  • 1622 Views
  • 0 replies
  • 8 kudos

Ask your technical questions at Databricks Office Hours October 26 - 11:00 AM - 12:00 PM PT: Register HereNovember 9 - 8:00 AM - 9:00 AM GMT: Register...

Ask your technical questions at Databricks Office HoursOctober 26 - 11:00 AM - 12:00 PM PT: Register HereNovember 9 - 8:00 AM - 9:00 AM GMT: Register Here (NEW EMEA Office Hours)Databricks Office Hours connects you directly with experts to answer all...

  • 1622 Views
  • 0 replies
  • 8 kudos
Labels