Re: pyspark.sql.connect.dataframe.DataFrame vs pys...

Gleydson404 · ‎08-21-2025

I have found a work around for this issue. Basically, I create a dummy_df and then I check if the dataframe I want to check has the same type as the dummy_df.

def get_dummy_df() -> DataFrame:
    """
    Generates a dummy DataFrame with a range of integers.

    This method creates a DataFrame containing integers starting from 0 up to (but not including) 2 
    using the current Spark session.

    Returns:
        DataFrame: A Spark DataFrame containing a single column with the values [0, 1].
    """
    spark_session = SparkSession.builder.appName(
            "dummy_df"
        ).getOrCreate()

    return spark_session.range(0, 2)

def is_spark_df(df_to_check: DataFrame) -> bool:
    """
    Checks if the provided object is a Spark DataFrame.
    
    This function compares the type of the provided DataFrame with a dummy DataFrame created 
    using the `get_dummy_df()` function. This is necessary because in Databricks, depending 
    on the cluster configuration, the DataFrame type can vary. If you import 
    `pyspark.sql.dataframe`, your type check may fail because Databricks can provide 
    `pyspark.sql.connect.dataframe`.

    Parameters:
    df_to_check (DataFrame): The DataFrame instance to check.

    Returns:
    bool: True if the object is a Spark DataFrame, False otherwise.
    
    For more information on this issue, please see: 
    https://community.databricks.com/t5/data-engineering/pyspark-sql-connect-dataframe-dataframe-vs-pyspark-sql-dataframe/td-p/71055
    """
    return type(df_to_check) == type(get_dummy_df())

Regards,

Gleydson C.