Data Engineering

Forum Posts

Sorted by:

by alejandrofm • Valued Contributor

03-31-2022 7:39:01 AM

6234 Views
8 replies
9 kudos

Resolved! Pandas.spark.checkpoint() doesn't broke lineage

Hi, I'm doing some something simple on Databricks notebook:spark.sparkContext.setCheckpointDir("/tmp/") import pyspark.pandas as ps sql=("""select field1, field2 From table Where date>='2021-01.01""") df = ps.sql(sql) df.spark.checkpoint()That...

Data Engineering

6234 Views
8 replies
9 kudos

03-31-2022 7:39:01 AM

View Replies

Latest Reply

annafina
New Contributor II

11-21-2024 6:34:04 AM

9 kudos

checkpoint() returns a checkpointed DataFrame, so you need to assign it to a new variable:checkpointedDF = df.checkpoint()

9 kudos

11-21-2024 6:34:04 AM

7 More Replies

by UmaMahesh1 • Honored Contributor III

04-11-2023 7:01:42 AM

2391 Views
1 replies
2 kudos

Checkpoint issue when loading data from confluent kafka

I have a streaming notebook which fetches messages from confluent Kafka topic and loads them into adls. It is a streaming notebook with the trigger as continuous processing. Before loading the message (which is in Avro format), I'm flattening out the...

Data Engineering

2391 Views
1 replies
2 kudos

04-11-2023 7:01:42 AM

View Replies

Latest Reply

Avinash_94
New Contributor III

04-14-2023 12:21:44 AM

2 kudos

Best approach is to not to depend on Kafka’s commit mechanism! We can store processing result and message offset to external data store in the same database transaction. So, if the database transaction fails, both commit and processing will fail and ...

2 kudos

04-14-2023 12:21:44 AM

by Fed • New Contributor III

01-26-2023 6:36:45 AM

8046 Views
1 replies
0 kudos

Setting checkpoint directory for checkpointInterval argument of estimators in pyspark.ml

Tree-based estimators in pyspark.ml have an argument called checkpointIntervalcheckpointInterval = Param(parent='undefined', name='checkpointInterval', doc='set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will ...

Data Engineering

8046 Views
1 replies
0 kudos

01-26-2023 6:36:45 AM

View Replies

Latest Reply

Anonymous
Not applicable

04-10-2023 7:00:04 AM

0 kudos

@Federico Trifoglio :If sc.getCheckpointDir() returns None, it means that no checkpoint directory is set in the SparkContext. In this case, the checkpointInterval argument will indeed be ignored. To set a checkpoint directory, you can use the SparkC...

0 kudos

04-10-2023 7:00:04 AM

by mriccardi • New Contributor II

12-01-2022 11:12:26 AM

3611 Views
1 replies
0 kudos

Structured Streaming Checkpoint corrupted.

Hello,We are experiencing an error with one Structured Streaming Job that we have, that basically the checkpoint gets corrupted and we are unable to continue with the execution.I've checked the errors and this happens when it triggers an autocompact,...

Data Engineering

3611 Views
1 replies
0 kudos

12-01-2022 11:12:26 AM

View Replies

Latest Reply

jose_gonzalez
Databricks Employee

01-31-2023 9:14:11 AM

0 kudos

Hi @Martin Riccardi,Could you share the following please:1) whats your Source?2) whats your Sink?3) could you share your readStream() and writeStream() code?4) full error stack trace5) did you stop and re-run your query after weeks of not being acti...

0 kudos

01-31-2023 9:14:11 AM

by hello_world • New Contributor III

12-24-2022 6:35:33 PM

4013 Views
3 replies
2 kudos

What exact difference does Auto Loader make?

New to Databricks and here is one thing that confuses me.Since Spark Streaming is already capable of incremental loading by checkpointing. What difference does it make by enabling Auto Loader?

Data Engineering

4013 Views
3 replies
2 kudos

12-24-2022 6:35:33 PM

View Replies

Latest Reply

Meghala
Valued Contributor II

12-26-2022 2:55:26 AM

2 kudos

Auto Loader provides a Structured Streaming source called cloudFiles. Given an input directory path on the cloud file storage, the cloudFiles source automatically processes new files as they arrive, with the option of also processing existing files i...

2 kudos

12-26-2022 2:55:26 AM

2 More Replies

by alejandrofm • Valued Contributor

05-16-2022 11:49:16 AM

2737 Views
1 replies
5 kudos

Resolved! How to set a global checkpoint for all notebooks?

I have several users doing data analysis on Databricks Spark notebooks, everything is smooth, now I want to make sure that the checkpointdir is configured on the cluster start, so every user doesn't had to set it on the Notebook (ending up in a lot o...

Data Engineering

2737 Views
1 replies
5 kudos

05-16-2022 11:49:16 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

05-17-2022 2:32:42 AM

5 kudos

@Alejandro Martinez , For streaming jobs, there are, but others couldn't find them. Here are spark conf Configuration - Spark 3.2.1 Documentation (apache.org)spark.sql.streaming.checkpointLocation

5 kudos

05-17-2022 2:32:42 AM

by _Orc • New Contributor

03-02-2022 12:19:52 PM

3963 Views
2 replies
1 kudos

Resolved! Checkpoint is getting created even the though the microbatch append has failed

Use caseRead data from source table using structured spark streaming(Round the clock).Apply transformation logic etc etc and finally merge the dataframe in the target table.If there is any failure during transformation or merge ,databricks job should...

Data Engineering

3963 Views
2 replies
1 kudos

03-02-2022 12:19:52 PM

View Replies

Latest Reply

Anonymous
Not applicable

04-12-2022 9:34:32 AM

1 kudos

Hi @Om Singh Hope you are doing well. Just wanted to check in and see if you were able to find a solution to your question?Cheers

1 kudos

04-12-2022 9:34:32 AM

1 More Replies

by RohanB • New Contributor III

01-27-2022 2:39:37 AM

5697 Views
8 replies
3 kudos

Resolved! Spark Streaming - Checkpoint State EOF Exception

I have a Spark Structured Streaming job which reads from 2 Delta tables in streams , processes the data and then writes to a 3rd Delta table. The job is being run with the Databricks service on GCP.Sometimes the job fails with the following exception...

Data Engineering

5697 Views
8 replies
3 kudos

01-27-2022 2:39:37 AM

View Replies

Latest Reply

RohanB
New Contributor III

02-15-2022 4:27:03 AM

3 kudos

Hi @Jose Gonzalez ,Do you require any more information regarding the code? Any idea what could be cause for the issue?Thanks and Regards,Rohan

3 kudos

02-15-2022 4:27:03 AM

7 More Replies

by BorislavBlagoev • Valued Contributor III

09-24-2021 9:24:10 AM

5691 Views
4 replies
4 kudos

Resolved! Databricks writeStream checkpoint

I'm trying to execute this writeStream data_frame.writeStream.format("delta") \ .option("checkpointLocation", checkpoint_path) \ .trigger(processingTime="1 second") \ .option("mergeSchema", "true") \ .o...

Data Engineering

5691 Views
4 replies
4 kudos

09-24-2021 9:24:10 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

10-12-2021 6:46:15 AM

4 kudos

You can remove that folder so it will be recreated automatically.Additionally every new job run should have new (or just empty) checkpoint location.You can add in your code before running streaming:dbutils.fs.rm(checkpoint_path, True)Additionally you...

4 kudos

10-12-2021 6:46:15 AM

3 More Replies

by User16752240150 • New Contributor II

06-04-2021 12:04:59 PM

6868 Views
1 replies
1 kudos

When to use cache vs checkpoint?

I've seen .cache() and .checkpoint() used similarly in some workflows I've come across. What's the difference, and when should I use one over the other?

Data Engineering

6868 Views
1 replies
1 kudos

06-04-2021 12:04:59 PM

View Replies

Latest Reply

Srikanth_Gupta_
Databricks Employee

06-25-2021 5:48:48 AM

1 kudos

Caching is extremely useful than checkpointing when you have lot of available memory to store your RDD or Dataframes if they are massive.Caching will maintain the result of your transformations so that those transformations will not have to be recomp...

1 kudos

06-25-2021 5:48:48 AM

by User16826994223 • Honored Contributor III

06-22-2021 4:53:45 AM

1757 Views
2 replies
0 kudos

Don't want checkpoint in delta

Suppose I am not interested in checkpoints, how can I disable Checkpoints write in delta

Data Engineering

1757 Views
2 replies
0 kudos

06-22-2021 4:53:45 AM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-22-2021 5:13:57 PM

0 kudos

Writing statistics in a checkpoint has a cost which is visible usually only for very large tables. However it is worth mentioning that, this statistics would be very useful for data skipping which speeds up subsequent operations. In Databricks Runti...

0 kudos

06-22-2021 5:13:57 PM

1 More Replies

by User16826992666 • Valued Contributor

06-22-2021 8:24:22 AM

6027 Views
2 replies
0 kudos

Resolved! Can I reset the checkpoint of a streaming job if I want to do a full reload of a table?

Data Engineering

6027 Views
2 replies
0 kudos

06-22-2021 8:24:22 AM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-22-2021 3:44:42 PM

0 kudos

If the read stream definition has something similar to val df = spark .read .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribePattern", "topic.*") .option("startingOffsets", "earliest")resettin...

0 kudos

06-22-2021 3:44:42 PM

1 More Replies

Databricks Community

Resolved! Pandas.spark.checkpoint() doesn't broke lineage

Checkpoint issue when loading data from confluent kafka

Setting checkpoint directory for checkpointInterval argument of estimators in pyspark.ml

Structured Streaming Checkpoint corrupted.

What exact difference does Auto Loader make?

Resolved! How to set a global checkpoint for all notebooks?

Resolved! Checkpoint is getting created even the though the microbatch append has failed

Resolved! Spark Streaming - Checkpoint State EOF Exception

Resolved! Databricks writeStream checkpoint

When to use cache vs checkpoint?

Don't want checkpoint in delta

Resolved! Can I reset the checkpoint of a streaming job if I want to do a full reload of a table?