03-08-2015 04:13 PM
03-08-2015 04:13 PM
If you see this error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: ...
The above error can be triggered when you intialize a variable on the driver (master), but then try to use it on one of the workers. In that case, Spark Streaming will try to serialize the object to send it over to the worker, and fail if the object is not serializable. Consider the following code snippet:
NotSerializable notSerializable = new NotSerializable(); JavaRDD<String> rdd = sc.textFile("/tmp/myfile");rdd.map(s -> notSerializable.doSomething(s)).collect();
This will trigger that error. Here are some ideas to fix this error:
// ...Now process iter });
10-31-2015 10:45 AM
I cannot make the class serializable, and I don't want to create the instance in the lambda function again and again. So,
1. How to make the NotSerializable object as a static and create it once per machine?
2. If calling rdd.forEachPartition, how can I have return values?
05-22-2015 03:55 PM
Also note that in Databricks Cloud, variables in a cell may be broadcast so they can be accessed from the worker nodes. If you don't need to use that variable ever in a transformation on a worker node, another fix is to declare the variable @TransformersPorryient in a scala notebook:
@transient val myNonSerializableObjectThatIDoNotUseInATransformation = ....
11-02-2015 11:14 AM
Hi,
You can use the Singleton Patter to create an object once per machine. This is explained very well on wikipedia:
https://en.wikipedia.org/wiki/Singleton_pattern
If you want to return values, you can use the mapPartitions transformation instead of the forEachPartition action.
11-03-2015 02:19 PM
said differently/functionally,
mapPartitions()
returns a value and does not have side effects . forEachPartition()
does not return a value, but (typically) does have side effects.
05-10-2019 12:35 PM
@cfregly @Vida Ha
I'm having trouble with the same "task not serializable" error when calling foreachPartition.
My code looks like:
myDF
.foreachPartition { (rddpartition: Iterator[Row]) =>
val url = "jdbc:sqlserver://<myurl>"
val un = dbutils.secrets.get("my-secret-scope", "my-username")
val pw = dbutils.secrets.get("my-secret-scope", "my-password")
val connection = DriverManager.getConnection(url, un, pw)
var statement = connection.createStatement()
rddpartition.foreach { (row: Row) =>
statement.addBatch("INSERT INTO dbo.Table(Field1, Field2) VALUES (${row.get(0)}, ${row.get(1)})")
}
statement.executeBatch()
connection.close()
}
The above code results in the error "org.apache.spark.SparkException: Task not serializable"
When I modify the code to use the username and password as strings instead as shown below, it works just fine. Are you aware of a way to get around the serializability of dbutils.secrets.get?
DriverManager.getConnection(url, "<username string>", "<password string>")
09-29-2021 01:15 PM
@Nick Studenski , Can you try declaring the un and pw variables outside the scope of for each partition? Do it before, so that way you are just passing a variable into that function rather than the dbutils object.
06-01-2022 09:03 PM
Hi @Nick Studenski , Could you share, how you solved your problem ?
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group