topic Re: Change spark configs in Serverless compute clusters in Data Engineering

Change spark configs in Serverless compute clusters

ls — Mon, 13 Jan 2025 22:39:32 GMT

Howdy!
I wanted to know how I can change some spark configs in a Serverless compute. I have a base.yml file and tried placing:
spark_conf:
- spark.driver.maxResultSize: "16g"

but I still get his error:
[CONFIG_NOT_AVAILABLE] Configuration spark.driver.maxResultSize is not available. SQLSTATE: 42K0I

and trying to change a config within the notebook is not allowed either.

Re: Change spark configs in Serverless compute clusters

Walter_C — Mon, 13 Jan 2025 22:47:00 GMT

Spark configs are limited in Serverless, this are the supported configs you can set https://docs.databricks.com/en/release-notes/serverless/index.html#supported-spark-configuration-parameters

Re: Change spark configs in Serverless compute clusters

ls — Tue, 14 Jan 2025 17:39:30 GMT

Is there anything I can do to increase the memory? Or do you know of a way I could make it not run out of memory? Here is the code block:

dt = datetime.strptime(input_date, "%Y/%m/%d") buffer_sec = 6 timestamp_start_ms = int((dt.replace(tzinfo=timezone.utc).timestamp() - buffer_sec) * 1000) timestamp_end_ms = int((timestamp_start_ms + (24 * 3600 * 1000)) + buffer_sec * 2 * 1000) interpolated_filtered = f"SELECT * FROM `catalog`.default.events \ WHERE timestamp >= {timestamp_start_ms} AND timestamp <= {timestamp_end_ms} ORDER BY timestamp ASC" interpolated_df = spark.sql(interpolated_filtered).toPandas()

Re: Change spark configs in Serverless compute clusters

Walter_C — Tue, 14 Jan 2025 17:46:03 GMT

To address the memory issue in your Serverless compute environment, you can consider the following strategies:

Optimize the Query:
- Filter Early: Ensure that you are filtering the data as early as possible in your query to reduce the amount of data being processed. For example, if you can add more specific conditions to your WHERE clause, it will help in reducing the data size.
- Limit Columns: Select only the necessary columns instead of using SELECT *. This reduces the amount of data being transferred and processed.
Use Spark DataFrame Operations:
- Instead of converting the entire result to a Pandas DataFrame using toPandas(), try to perform as many operations as possible using Spark DataFrame operations. Spark DataFrames are distributed and can handle larger datasets more efficiently than Pandas DataFrames.
Use Delta Tables:
- If you are working with large datasets, consider using Delta tables. Delta tables provide optimized storage and query performance, which can help in managing memory usage more efficiently.