Hi @Erik_L, To maintain the Delta Live Tables pipeline compute running between Workflow runs, opting for a long-running Databricks Job instead of a triggered Databricks Workflow is a solid approach. This long-running job will keep a persistent Spark context active, allowing you to execute the necessary data transformations and data merging tasks continuously.
Here's a step-by-step guide on setting up a long-running Databricks Job for these operations:
1. Create a Databricks Job: You can initiate a Databricks Job through the Databricks UI or the Databricks CLI. Ensure that you select the job type as "Continuous" to establish a long-running job.
2. Set Up Dependencies: It's crucial to ensure that all dependencies, such as Python libraries or packages, are appropriately incorporated into the Databricks cluster linked with the long-running job.
3. Define a Continuous Query: Inside your Databricks Job, define a continuous query that will handle the required data transformations and merging. Continuous questions are structured streaming queries designed to run continuously, writing the output of each question to a specified destination. Here's a simple example to kickstart your work:
from pyspark.sql.functions import *
from pyspark.sql.types import *
# Define the schema of the source data
schema = StructType([
StructField("col1", IntegerType(), True),
StructField("col2", StringType(), True)
])
# Read the source data
4. Start the Job: Launch your Databricks Job to trigger the continuous query. This will ensure the Spark context and all the necessary resources are available, allowing the job to run continuously.
By implementing a long-running Databricks Job for your data transformation and merging tasks, you can effectively sustain the Delta Live Tables pipeline compute without interruptions between Workflow runs.
I hope this helps, and if you have further questions or need more guidance, feel free to ask!