Databricks Community

Mado · ‎10-20-2022

Hi,

I am practicing with Databricks. In sample notebooks,I have seen different use of writeStream with or without ".start()" method. Samples are below:

Without .start()

  spark.readStream
 
         .format("cloudFiles")
 
         .option("cloudFiles.format", source_format)
 
         .option("cloudFiles.schemaLocation", checkpoint_directory)
 
         .load(data_source)
 
         .writeStream
 
         .option("checkpointLocation", checkpoint_directory)
 
         .option("mergeSchema", "true")
 
         .table(table_name)

With .start()

(myDF
 
 .writeStream
 
 .format("delta")
 
 .option("checkpointLocation", checkpointPath)
 
 .outputMode("append")
 
 .start(path)
 
)

With .start()

query = (streaming_df.writeStream
                         .foreachBatch(streaming_merge.upsert_to_delta)
                         .outputMode("update")
                         .option("checkpointLocation", f"{DA.paths.checkpoints}/recordings")
                         .trigger(availableNow=True)
                         .start())
query.awaitTermination()

1) I didn't understand where should / shouldn't use ".start()" method. I appreciate it if you could guide me on this.

2) If I don't pass "path" to the "start()", where the data files will be written?

Thanks for your help.