Databricks

Kash · ‎06-20-2023

Hi there,

I need some help with this example. We're trying to create a linearRegression model that can parallelize for thousands of symbols per date. When we run this we get a picklingError

Any suggestions would be much appreciated!

K

Error:

PicklingError: Could not serialize object: RuntimeError: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

Code:

from pyspark.sql import SparkSession
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
 
# Create a SparkSession
spark = SparkSession.builder.getOrCreate()
 
# Create an RDD with your data
data_rdd = spark.sparkContext.parallelize([
    ("symbol1", 1, 2, 3),
    ("symbol2", 4, 5, 6),
    ("symbol3", 7, 8, 9)
])
 
# Convert the RDD to a DataFrame
data_df = data_rdd.toDF(["Symbol", "Feature1", "Feature2", "Feature3"])
 
# Define the features column
assembler = VectorAssembler(inputCols=["Feature1", "Feature2", "Feature3"], outputCol="features")
 
# Fit models on each partition and collect the weights
def fit_model(partition):
    # Create a new linear regression model
    model = LinearRegression(featuresCol="features", labelCol="Symbol")
 
    # Create an empty list to store the weights
    weights = []
 
    # Convert the partition iterator to a list
    data_list = list(partition)
 
    # Convert the list to a DataFrame
    data_partition_df = spark.createDataFrame(data_list, data_df.columns)
 
    # Perform vector assembly
    data_partition_df = assembler.transform(data_partition_df)
 
    # Fit the model on the partition data
    fitted_model = model.fit(data_partition_df)
 
    # Get the model weights
    weights = [fitted_model.coefficients[i] for i in range(len(fitted_model.coefficients))]
 
    # Yield the weights
    yield weights
 
# Fit models on each partition and collect the weights
partition_weights = data_df.rdd.mapPartitions(fit_model).collect()
 
# Create a DataFrame with the collected weights
weights_df = spark.createDataFrame(partition_weights, ["Weight1", "Weight2", "Weight3"])
 
# Show the weights DataFrame
weights_df.show()

Anonymous · ‎06-20-2023

Hi @Avkash Kana

Great to meet you, and thanks for your question!

Let's see if your peers in the community have an answer to your question. Thanks.

Kash · ‎06-21-2023

Thanks. We're eagerly waiting to see what the community thinks. We're also open to using DB built in ML technology but we're unclear how to use it for our use case.

Kash · ‎06-22-2023

@Vidula Khanna Can you assist?

Databricks

Linear Regression HELP! Pickle + Broadcast Variable Error

Databricks Community Social - May 2024

🔔 Attention Databricks Academy Users: SSO Implementation Incoming! Secure Your Account Today!

Announcing the General Availability of Databricks Asset Bundles

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs