cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Autoloader Console Output Issue

ChristianRRL
Valued Contributor III

In reference prior post: Re: Autoloader Error Loading and Displaying - Databricks Community - 122579

I am attempting to output results to the console (notebook cell), but am not seeing anything (other than the dataframe schema). Is this expected? I am starting to use Autoloader and I'd like an easy/straightforward way to debug the data, and this seems to be the simplest by using: .trigger(availableNow=True). 

ChristianRRL_2-1754599656677.png

 

4 REPLIES 4

szymon_dybczak
Esteemed Contributor III

Hi @ChristianRRL ,

Did you run this code before? Maybe all your source files has been already written to checkpoint. Try to upload new json file and run it again. Also, you can check drivers logs. Sometimes you can find them error messages.

 

In the example I shared, basically there's no checkpoint because I'm simulating running this for the first time with a fresh file. Additionally, the data is not being written to any specific location or managed table. I am able to view the data once it's appended to a raw table (not shown in the picture), but basically trying to figure out if there's a simple way to simulate a simple run and display it without actually writing data out anywhere.

I'm down to check out the driver logs. Where/how can I access them?

szymon_dybczak
Esteemed Contributor III

Oh, I didn't notice that you don't have checkpoint. So I guess that's the reason of your issue. You must specify the checkpointLocation option before you run a streaming query. As I replied in different topic, autoloader under the hood is based on spark structured streaming and I Istrongly recommend that you read the overview of spark structured streaming. It should clarify a lot of concepts for you like how streaming query, checkpoints and many more.

szymon_dybczak_0-1754639384027.png

 



PS. To find driver logs go to Compute -> click on your cluster -> Driver logs

Quick couple of follow-ups.

Respectfully (no negative tone I promise), I have browsed through Structured Streaming Programming Guide - Spark 4.0.0 Documentation and other documentation. I'm not an expert, and am learning as I go, but at least when using .format("console"), it doesn't seem like a checkpoint is needed.

ChristianRRL_1-1754673768194.png

I tried running the notebook cell both with & without the checkpoint, and I'm getting the same results (no output on notebook cell).

ChristianRRL_2-1754674146043.png

One thing I stumbled into however, it seems like the console outputMode maybe doesn't work quite how I would've hoped? For example, the Spark Guide shows the execution of an actual python file, whereas I'm trying to run a simple notebook cell. If this is the case, I am not sure why there wouldn't be a simple way to test this in a notebook.

ChristianRRL_0-1754673726836.png

The only way I have been able to test this in a way that works is via @lingareddy_Alva 's 2nd suggestion to use Use Memory Sink for Testing here:

Although I was hoping that the first suggestion would work as it's more concise and intuitive. Please let me know if I'm missing anything!