05-22-2024 11:41 PM
Hello,
since last night none of our ETL jobs in Databricks are running anymore, although we have not made any code changes.
The identical jobs (deployed with Databricks asset bundles) run on an all-purpose cluster, but fail on a job cluster. We have not changed anything in the cluster configuration. The Databricks runtime version is also identical (14.3 LTS (includes Apache Spark 3.5.0, Scala 2.12)). We have also compared the code and double-checked the configurations.
What could be the reason for the jobs failing without us having made any changes? Have there been changes to Databricks that cause this?
Error messages:
[NOT_COLUMN] Argument `col` should be a Column, got Column.
[SESSION_ALREADY_EXIST] Cannot start a remote Spark session because there is a regular Spark session already running.
Does anyone else have problems with jobs?
Best regards
Robin
05-23-2024 07:39 AM
switching from
05-23-2024 07:16 AM
Hi Robin
Do you use Databricks Connect creating spark Session?
from databricks.connect import DatabricksSession
spark = DatabricksSession.builder.getOrCreate()
05-23-2024 07:17 AM
Hi @RobinK, I’m sorry to hear that you’re experiencing issues with your ETL jobs in Databricks. Let’s try to address the error messages you’re encountering:
[NOT_COLUMN] Argument col
should be a Column, got Column: This error typically occurs when a DataFra...1. You might want to check your DataFrame operations to ensure that they are receiving the correct input types. For instance, when using the withColumn
method, the second argument should be a Column1. If you’re trying to add a new column with a literal value, you can use the lit
function from pyspark...
2.
[SESSION_ALREADY_EXIST] Cannot start a remote Spark session because there is a regular Spark session...3. You might want to ensure that a Spark session is active on your cluster before you attempt to run yo...4. If you’re using Databricks Connect, you might need to import DatabricksSession
instead of SparkSessi...
3.
As for the issue of jobs failing on the job cluster but not on the all-purpose cluster, it could be due to a variety of reasons. It might be helpful to check if there have been any changes in the environment, such as updates to D...5. You could also consider optimizing your job performance, for instance, by using a compute-optimized ...6.
If the issue persists, I would recommend reaching out to us for further assistance. I hope this helps! 😊
05-23-2024 07:37 AM
@Kaniz_Fatma : We exactly use your second solution. And we get same issue
from databricks.connect import DatabricksSession
spark = DatabricksSession.builder.getOrCreate()
schema = StructType([StructField('category', StringType(), True), StructField('weight', DoubleType(), True)])
data_source = "abfss://......_index_v01_??????_????????.csv"
df = (spark.read.format("csv")
.options(**{'header': 'true'})
.schema(schema)
.load(data_source))
[SESSION_ALREADY_EXIST] Cannot start a remote Spark session because there is a regular Spark session already running.
File /databricks/spark/python/pyspark/instrumentation_utils.py:47, in _wrap_function.<locals>.wrapper(*args, **kwargs) 45 start = time.perf_counter() 46 try: ---> 47 res = func(*args, **kwargs) 48 logger.log_success( 49 module_name, class_name, function_name, time.perf_counter() - start, signature 50 ) 51 return res
File /databricks/spark/python/pyspark/sql/readwriter.py:150, in DataFrameReader.schema(self, schema) 117 """Specifies the input schema. 118 119 Some data sources (e.g. JSON) can infer the input schema automatically from data. (...) 146 |-- col1: double (nullable = true) 147 """ 148 from pyspark.sql import SparkSession --> 150 spark = SparkSession._getActiveSessionOrCreate() 151 if isinstance(schema, StructType): 152 jschema = spark._jsparkSession.parseDataType(schema.json())
File /databricks/spark/python/pyspark/sql/session.py:1265, in SparkSession._getActiveSessionOrCreate(**static_conf) 1263 for k, v in static_conf.items(): 1264 builder = builder.config(k, v) -> 1265 spark = builder.getOrCreate() 1266 return spark
File /databricks/spark/python/pyspark/sql/session.py:521, in SparkSession.Builder.getOrCreate(self) 519 return RemoteSparkSession.builder.config(map=opts).getOrCreate() 520 else: --> 521 raise PySparkRuntimeError( 522 error_class="SESSION_ALREADY_EXIST", 523 message_parameters={}, 524 ) 526 session = SparkSession._instantiatedSession 527 if session is None or session._sc._jsc is None:
05-23-2024 07:39 AM
switching from
05-23-2024 07:46 AM
Yes, I did the same. However, so we have to switch the code from local (VS Code) implementation to Databricks runs (Jobs/Workflow).
@Kaniz_Fatma : Could you check this new issue?
05-23-2024 07:55 AM
This Notebook can be used to recreate the issue:
import pandas as pd
from databricks.connect import DatabricksSession
from pyspark.sql.functions import current_timestamp
spark = DatabricksSession.builder.getOrCreate()
# Create a pandas DataFrame
data = {
"Name": ["John", "Alice", "Bob"],
"Age": [25, 30, 35],
"City": ["New York", "San Francisco", "Los Angeles"],
}
df = pd.DataFrame(data)
# Convert pandas DataFrame to Spark DataFrame
spark_df = spark.createDataFrame(df)
spark_df = spark_df.withColumn("_loaded_at", current_timestamp())
spark_df.show()
I used databricks runtime 14.3 LTS with single user access mode
06-19-2024 02:21 AM
I think Databricks fixed the issue. Now it works with 14.3 LTS and single user access mode
05-23-2024 09:49 PM
@ha2983 I can confirm, that I can recreate the issue with your notebook.
In my case the error [NOT_COLUMN] Argument `col` should be a Column, got Column. occurs, when calling
>>> [NOT_COLUMN_OR_STR] Argument `col` should be a Column or str, got Column.
On a shared cluster the code above works.
@dbruehlmeier we are also using vscode for local development and create our spark session like this:
05-24-2024 12:44 AM
Update:
Removing the following code from all of our notebook fixed the error:
05-24-2024 04:15 AM
we are experiencing the exact same issues. But we do not even create the spark session explicitly. Are there any other fixes to this?
05-25-2024 01:12 PM - edited 05-25-2024 01:12 PM
Hello,
We are also experiencing the same error message [NOT_COLUMN] Argument `col` should be a Column, got Column
This occurs when a workflow is run as a task from another workflow, but not when said workflow is run on its own, that is not triggered by another workflow. The problem seems to be connected to the Databricks Runtime, in 14.3 LTS the workflow fails with said error, as a temporary workaround we switched the job clusters to Runtime 13.3 LTS, this seems to be working.
Any update on this bug is highly appreciated as it affects our production environment.
Best regards
Markus
05-29-2024 02:16 AM - edited 05-29-2024 02:21 AM
We just had the exact same issue and it broke all our jobs in production, any update on this bug would be appreciated. We had failures in Databricks Runtime 15.1 and we fixed by moving all the jobs' clusters to 15.2
05-29-2024 02:51 PM
I do not believe this is solved, similar to a comment over here:
https://community.databricks.com/t5/data-engineering/databrickssession-broken-for-15-1/td-p/70585
We are also seeing this error in 14.3 LTS from a simple example:
from pyspark.sql.functions import col
df = spark.table('things')
things = df.select(col('thing_id')).collect()
[NOT_COLUMN_OR_STR] Argument `col` should be a Column or str, got Column.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group