cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How do I display output from applyinPandas function?

DbricksLearner1
New Contributor

I'm using databricks version 13.3. I have a function which I'm calling by using the applyInPandas function. I need to see the attributes of my df dataset which I'm using inside my function. My sample code looks like

def train_model(df):
# Copy input DataFrame
train = df.copy()

# Use 'age' to create a new column, for example: age groups
train['age_group'] = train['age'].apply(lambda x: 'child' if x < 18 else 'adult' if x < 60 else 'senior')

# Drop the original 'age' column
train = train.drop(columns=['age'])

return train

 

My applyinPandas function looks like

from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Define the output schema after transformation
output_schema = StructType([
StructField("id", IntegerType()),
StructField("age_group", StringType()),
# Add other fields present in the input df if needed
])

@pandas_udf(output_schema)
def apply_train_model(df):
return train_model(df)

# Then apply it on a Spark DataFrame grouped or ungrouped
result = spark_df.groupby("some_grouping_column").applyInPandas(apply_train_model, schema=output_schema)

 

Kindly note that this is a sample code taken from internet, not the actual code as I don't have databricks in my local, I'm using databricks in my client system & I can't able to share client code 

1 REPLY 1

BigRoux
Databricks Employee
Databricks Employee

Here are some idead/approaches to consider:

 

To inspect the attributes of a df dataset within a function used in applyInPandas on a Databricks Runtime 13.3 cluster, you can use debugging techniques that help you explore the structure and content of your DataFrame. Here are some suggested steps:
  1. Check Input DataFrame Attributes: Before performing transformations, you can use standard Pandas functionality to inspect the attributes of the DataFrame being passed to your function. Add debugging code inside your function to print out relevant details of the DataFrame, such as its columns, data types, and the first few rows. For example: ```python def train_model(df): # Print attributes for debugging print("Columns:", df.columns) print("Data types:\n", df.dtypes) print("First few rows:\n", df.head())
    # Your existing transformations train = df.copy() train['age_group'] = train['age'].apply(lambda x: 'child' if x < 18 else 'adult' if x < 60 else 'senior') train = train.drop(columns=['age']) return train ```
  2. Inspect Attributes Using Pandas UDF Logs: If you are running the UDF on a cluster, you can write logs from within the function that explain the DataFrame's attributes and collect these logs for further inspection. Use the logging module or simply print statements (results will appear in the Databricks notebook logs).
    Example: ```python import logging
    logging.basicConfig(level=logging.INFO)
    def train_model(df): logging.info(f"DataFrame attributes: {df.info()}") logging.info(f"First few rows:\n{df.head()}") ... ```
  3. Enable Debug Logging for Spark Execution: Use the cluster's Spark logging features to track the execution of your applyInPandas function. You might need to enable additional logging on your cluster or workspace. This can help debug issues related to the structure of the DataFrame.
  4. Adapt UDF for Databricks Runtime 13.3: Ensure compatibility with Databricks version 13.3, given that Python scalar UDFs and Pandas UDFs are supported from this version onwards. For handling grouped data, make sure that group keys and input data are carefully structured to avoid runtime errors.
  5. Validate Column Attributes Before Grouping: In the context of grouped execution, inaccuracies in column attributes can cause errors. As part of preprocessing, verify that the group_key column exists, and confirm its data type matches what your grouped operation expects.
  6. Troubleshooting the Schema Mismatch: Verify that the schema defined in your output_schema exactly matches the structure of the DataFrame returned by your function. If additional columns are expected, update your schema definition accordingly.
    For example: python output_schema = StructType([ StructField("id", IntegerType()), StructField("age_group", StringType()), # Include other fields expected in the output DataFrame ])
  7. Interactive Debugging: For iterative development, manually apply transformations to a sample Pandas DataFrame to ensure correctness before deploying the UDF. Start by loading a small representative sample from your Spark DataFrame using .toPandas() and test your transformations locally.
Since you mentioned using a sample code, these debugging strategies should help you validate your data transformations and inspect the attributes of your DataFrame effectively.
 
Cheers, Lou.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now