cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

AutoLoader - Write To Console (Notebook Cell) Long Running Issue

ChristianRRL
Valued Contributor III

Hi there,

I am likely misunderstanding how to use AutoLoader properly while developing/testing. I am trying to write a simple AutoLoader notebook cell to *read* the contents of a path with json files, and *write* them to console (i.e. notebook cell) in order to visualize the results. I kicked this off yesterday before logging off, and when I logged back in the morning, I realized that the cell was running for nearly 16 hours!

Can I get some assistance to understand what I'm doing wrong? I don't want to setup a permanent or long running data stream currently. At this time, I only have a filepath with a very small number of files (less than 10 with some few files being occasionally manually added), and I want to be able to easily view the contents of the files without requiring a permanent or long-running stream.

ChristianRRL_0-1754403001614.png

 

2 ACCEPTED SOLUTIONS

Accepted Solutions

SP_6721
Honored Contributor

Hi @ChristianRRL ,

It looks like spark.readStream with Auto Loader creates a continuous streaming job by default, which means it keeps running while waiting for new files.

To avoid this, you can control the behaviour using trigger(availableNow=True), which processes all data available at the start, but may break the work into multiple micro-batches.

View solution in original post

szymon_dybczak
Esteemed Contributor III

Hi @ChristianRRL ,

This is expected behavior. Under the hood autoloader uses spark structured streaming. In spark structured streaming you can't use display. 

It would be beneficial for you to familiarize yourself with structured streaming concept. It is whole different world than traditional batch approach, so hence your confusion: 

https://spark.apache.org/docs/latest/streaming/index.html

View solution in original post

3 REPLIES 3

SP_6721
Honored Contributor

Hi @ChristianRRL ,

It looks like spark.readStream with Auto Loader creates a continuous streaming job by default, which means it keeps running while waiting for new files.

To avoid this, you can control the behaviour using trigger(availableNow=True), which processes all data available at the start, but may break the work into multiple micro-batches.

ChristianRRL
Valued Contributor III

Fantastic! This is a great step forward, just one more thing. The trigger(availableNow=True) worked as you said, but I'm still not seeing the data displaying in the notebook cell. Is there something else I'm missing?

ChristianRRL_0-1754407753844.png

 

szymon_dybczak
Esteemed Contributor III

Hi @ChristianRRL ,

This is expected behavior. Under the hood autoloader uses spark structured streaming. In spark structured streaming you can't use display. 

It would be beneficial for you to familiarize yourself with structured streaming concept. It is whole different world than traditional batch approach, so hence your confusion: 

https://spark.apache.org/docs/latest/streaming/index.html