Databricks Community

MarsSu · ‎04-20-2023

I would like to ask how to implement zero downtime deployment of spark structured streaming in databricks job compute with terraform.

Because we will upgrade spark application code version. But currently we found every deployment will cancel original job and create new one, and it will approximately 5 minutes interrupt.

Based on this scenario, could we have method to achieve zero downtime in deploy new version. If you have any ideas, please share it to me. I will be appreciated it, thank you.

Anonymous · ‎04-27-2023

@Mars Su :

Yes, in a blue-green deployment scenario, both the blue and green versions of the Spark Structured Streaming job would be running at the same time, with traffic gradually shifted from the blue to the green version.

Regarding the checkpoint location, it is generally recommended to use separate checkpoint locations for each version of the job in order to avoid potential conflicts or data corruption. This is because the checkpoint location stores the state of the streaming query, including the current offset, which is used to resume the query in case of failures or restarts.

To achieve this in Terraform, you can define two separate checkpoint locations for the blue and green versions of the job, and specify them in the

checkpoint_location parameter of the spark_conf block for each job. For example:

# Blue job
resource "databricks_job" "blue_job" {
  # ...
 
  new_cluster {
    # ...
 
    spark_conf = {
      "spark.sql.streaming.checkpointLocation" = "/blue/checkpoints"
    }
  }
 
  # ...
}
 
# Green job
resource "databricks_job" "green_job" {
  # ...
 
  new_cluster {
    # ...
 
    spark_conf = {
      "spark.sql.streaming.checkpointLocation" = "/green/checkpoints"
    }
  }
 
  # ...
}

In this example, the blue job would use the checkpoint location /blue/checkpoints

, while the green job would use /green/checkpoints. Note that you would also need to ensure that any output or intermediate data is written to separate locations for the blue and green versions of the job, to avoid conflicts or data corruption.

View solution in original post

Anonymous · ‎04-25-2023

@Mars Su :

Yes, you can implement zero downtime deployment of Spark Structured Streaming in Databricks job compute using Terraform. One way to achieve this is by using Databricks' "job clusters" feature, which allows you to create a cluster specifically for running a job. Here's how you can implement zero downtime deployment using Terraform:

Create a new job cluster for the new version of your Spark application code. Use the databricks_job_cluster resource in Terraform to create the new cluster. You can specify the version of Spark to use, as well as any other configurations necessary for your application.
Once the new cluster is created, deploy the new version of your Spark application code to the cluster. You can do this using the databricks_job resource in Terraform, which allows you to specify the cluster ID for the job to run on.
Once the new job is running on the new cluster, gradually drain traffic from the old job to the new job. You can do this by slowly reducing the batch size or rate of data that is sent to the old job, and increasing it on the new job.
Once all traffic has been redirected to the new job, you can safely terminate the old job and delete its cluster.

By following these steps, you can achieve zero downtime deployment of your Spark Structured Streaming job in Databricks using Terraform. Note that you should thoroughly test your new job before switching all traffic to it, to ensure that it is working correctly and does not cause any issues in production.

MarsSu · ‎04-26-2023

@Suteja Kanuri

Thanks for your reply my questions.

So based on your scenario, we have 2 spark job are running at the same time, right? Like blue/green deployment.

However, i would like to if we want to achieve it. Do we need split checkpoint location of two spark structured streaming and stored independently?

Anonymous · ‎04-27-2023

@Mars Su :

Yes, in a blue-green deployment scenario, both the blue and green versions of the Spark Structured Streaming job would be running at the same time, with traffic gradually shifted from the blue to the green version.

Regarding the checkpoint location, it is generally recommended to use separate checkpoint locations for each version of the job in order to avoid potential conflicts or data corruption. This is because the checkpoint location stores the state of the streaming query, including the current offset, which is used to resume the query in case of failures or restarts.

To achieve this in Terraform, you can define two separate checkpoint locations for the blue and green versions of the job, and specify them in the

checkpoint_location parameter of the spark_conf block for each job. For example:

# Blue job
resource "databricks_job" "blue_job" {
  # ...
 
  new_cluster {
    # ...
 
    spark_conf = {
      "spark.sql.streaming.checkpointLocation" = "/blue/checkpoints"
    }
  }
 
  # ...
}
 
# Green job
resource "databricks_job" "green_job" {
  # ...
 
  new_cluster {
    # ...
 
    spark_conf = {
      "spark.sql.streaming.checkpointLocation" = "/green/checkpoints"
    }
  }
 
  # ...
}

In this example, the blue job would use the checkpoint location /blue/checkpoints

, while the green job would use /green/checkpoints. Note that you would also need to ensure that any output or intermediate data is written to separate locations for the blue and green versions of the job, to avoid conflicts or data corruption.