topic Databricks connect, set spark config in Data Engineering

Databricks connect, set spark config

mrkure — Mon, 27 Jan 2025 18:39:12 GMT

Hi,

Iam using databricks connect to compute with databricks cluster. I need to set some spark configurations, namely spark.files.ignoreCorruptFiles. As I have experienced, setting spark configuration in databricks connect for the current session, has no effect. Also I cannot configure the cluster itself, as it is shared cluster. Any solution ?

Re: Databricks connect, set spark config

Walter_C — Mon, 27 Jan 2025 20:57:46 GMT

Have you tried setting it up in your code as:

from pyspark.sql import SparkSession # Create a Spark session spark = SparkSession.builder \ .appName("YourAppName") \ .config("spark.files.ignoreCorruptFiles", "true") \ .getOrCreate() # Your Spark code here

Re: Databricks connect, set spark config

mrkure — Tue, 28 Jan 2025 16:21:49 GMT

Yes I did. This time in databricks connect and even in databricks notebook, the behaviour is the same. Small note, I have set the setting to false, as I want the code to fail if any file cannot be loaded.

Following code returns false for the check and ends up with error as expected.

print(spark.conf.get("spark.sql.files.ignoreCorruptFiles")) paths = ["path_to_corrupted_file"] df = spark.read(*paths)

But following code returns false for the check, but df is created succesfully with one file loaded. Expected behaviour is to end up also with error. But it seems that there is still fault tolerance.

print(spark.conf.get("spark.sql.files.ignoreCorruptFiles")) paths = ["path_to_corrupted_file", "path_to_normal_file"] df = spark.read(*paths)

It is probable, that I do not understand the behaviour of the setting correctly, as I expect it to ends up with error too.