I'm using databricks version 13.3. I have a function which I'm calling by using the applyInPandas function. I need to see the attributes of my df dataset which I'm using inside my function. My sample code looks like
def train_model(df):
# Copy input DataFrame
train = df.copy()
# Use 'age' to create a new column, for example: age groups
train['age_group'] = train['age'].apply(lambda x: 'child' if x < 18 else 'adult' if x < 60 else 'senior')
# Drop the original 'age' column
train = train.drop(columns=['age'])
return train
My applyinPandas function looks like
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
# Define the output schema after transformation
output_schema = StructType([
StructField("id", IntegerType()),
StructField("age_group", StringType()),
# Add other fields present in the input df if needed
])
@pandas_udf(output_schema)
def apply_train_model(df):
return train_model(df)
# Then apply it on a Spark DataFrame grouped or ungrouped
result = spark_df.groupby("some_grouping_column").applyInPandas(apply_train_model, schema=output_schema)
Kindly note that this is a sample code taken from internet, not the actual code as I don't have databricks in my local, I'm using databricks in my client system & I can't able to share client code