Databricks Community

drewster · 02-22-2022

I am running a massive history of about 250gb ~6mil phone call transcriptions (json read in as raw text) from a raw -> bronze pipeline in Azure Databricks using pyspark. The source is mounted storage and is continuously having files added and we do n...

drewster · 06-08-2022

I just tested it out and my stream initialization times seem to have gone down. Can someone explain the backfill interval?Based on the documentation located here its sounds like the backfill is almost like a full directory list at a given interval to...

drewster · 05-18-2022

@Dan Zafar @Kaniz Fatma I will be trying the recommendation by @Dan Zafar today.

drewster · 02-24-2022

UPDATE @Joseph Kambourakis It seems that we have found that ADLS Gen2 Premium storage does not support Queue storage. Therefore the Autoloader fails.My Cloud Engineer stood up a standard tier storage in ADLS Gen2 and I was able to connect to it and ...

drewster · 02-24-2022

Hello again @Joseph Kambourakis ,I've been working with my Cloud Engineer and the service principal and permissions are all set up. My new configuration looks like this....def read_stream_raw(spark: SparkSession, rawPath: str) -> DataFrame: """Rea...

drewster · 02-23-2022

Thank you for this. I am working with others now to make sure I have the correct permissions to configure this based on the article and the Azure documentation. Once implemented and tested I will respond.

Databricks Community

User Stats

User Activity

Spark streaming autoloader slow second batch - checkpoint issues?

Re: Spark streaming autoloader slow second batch - checkpoint issues?

Re: Spark streaming autoloader slow second batch - checkpoint issues?

Re: Spark streaming autoloader slow second batch - checkpoint issues?

Re: Spark streaming autoloader slow second batch - checkpoint issues?

Re: Spark streaming autoloader slow second batch - checkpoint issues?