cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Prakash Hinduja ~ How do I create an empty DataFrame in Databricksโ€”are there multiple ways?

prakashhinduja2
New Contributor

Hello, I'm Prakash Hinduja, an Indian-born financial advisor and consultant based in Geneva, Switzerland (Swiss). My career is focused on guiding high-net-worth individuals and business leaders through the intricate world of global investment and wealth management. Leveraging my strong background in international finance, I craft bespoke strategies that have led clients to affectionately call me the Prakash Hinduja net worth booster.

Iโ€™m trying to create an empty DataFrame in Databricks and was wondering if there are multiple ways to do itโ€”especially with or without a predefined schema. What approaches have worked best for you? Appreciate any tips!

Regards

Prakash Hinduja Geneva, Switzerland (Swiss)

 

2 REPLIES 2

ilir_nuredini
Honored Contributor

Hello,

There are a couple of ways how you can define an empty spark dataframe, here are some of them:

1. Create an empty dataframe with a schema

schema = StructType([
StructField('name', StringType(), True),
StructField('age', IntegerType(), True)
])

empty_df = spark.createDataFrame([], schema)

2. Create an empty dataframe without specifying any cols

empty_df_without_cols = spark.createDataFrame([], StructType([]))

3. Creating empty RDD then converting it to dataframe (just fyi, this option won't work in free edition, because of the serverless compute)

schema = StructType([
StructField('name', StringType(), True),
StructField('age', IntegerType(), True)
])

emptyRDD = spark.sparkContext.emptyRDD()
empty_df1 = emptyRDD.toDF(schema)

Hope that helps.

Best, Ilir

ManojkMohan
Contributor III

Best Practices from Experience:

  • Use predefined schema if you know your column types upfrontโ€”prevents errors when appending new data.
  • For ad-hoc exploration, toDF or createDataFrame([], None) works fine.
  • Always check printSchema()โ€”it helps avoid silent type issues later in transformations.

Possible Scenarios:

1. Without a predefined schema (completely empty)
Pros:
Quick and simple.
Useful for placeholder DataFrames.
Cons:
Columns and types arenโ€™t defined, so adding data later can be cumbersome.

2. With a predefined schema
This is the more common and safer approach, especially if you plan to append data later.
Pros:
Ensures consistent column types.
Easy to append rows later using unionByName.

3. Using spark.createDataFrame with an empty RDD
This is essentially the same as the above, but sometimes preferred in pure Spark setups
Pros:
Works well in Spark-heavy pipelines.

4. Using toDF on an empty RDD
If you want to define only column names (types default to StringType)
Pros:
Lightweight if you donโ€™t care about strict types.
Cons:
All columns default to StringType, so type conversions may be needed later.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now