cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

while loading data from dataframe to spark sql table using .saveAstable() option, not working.

Neeraj_432
Visitor

hi , i am loading dataframe data into spark sql table using .saveastable() option.. scema is matching..but column names are diffirent in sql table. is it necessary to maintain the same column names in source and target ? how to handle it in real time, either modifying column names in dataframe or using insert into option in sql..

 

sales_df_cleaned.createOrReplaceTempView("sales")
spark.sql("select count(*) from sales").show()
insert into dev.spark_db.tbl_instax_sales
select * from sales
 
or 
 modifying column names  in dataframe and load in table ..
sales_df_cleaned.write.mode("overwrite").saveAsTable("dev.spark_db.tbl_instax_sales")
1 ACCEPTED SOLUTION

Accepted Solutions

iyashk-DB
Databricks Employee
Databricks Employee

For INSERT INTO โ€ฆ SELECT in Databricks SQL, mapping is by position unless you use the BY NAME clause or an explicit column list; BY NAME matches columns by name (including nested structs) and ignores order.
Ref Doc - https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-dml-insert-into

INSERT INTO dev.spark_db.tbl_instax_sales BY NAME
SELECT
  src_col_a AS target_col_a,
  src_col_b AS target_col_b,
  src_col_c AS target_col_c
FROM sales;

For df.write.saveAsTable append/overwrite, treat it as schema-on-write: ensure the DataFrameโ€™s columns (names and types) align with the target to avoid analysis errors; use append to add and overwrite to replace data in the table.

target_cols = spark.table("dev.spark_db.tbl_instax_sales").columns

sales_df_aligned = sales_df_cleaned.selectExpr(
    "src_col_a as target_col_a",
    "src_col_b as target_col_b",
    "src_col_c as target_col_c"
)

sales_df_aligned.write.mode("append").saveAsTable("dev.spark_db.tbl_instax_sales")

sales_df_aligned.write.mode("overwrite").saveAsTable("dev.spark_db.tbl_instax_sales")

There is no schema evolution clause for INSERT INTO; if you need schema evolution, use DataFrame write options or MERGE with schema evolution instead.
Ref Doc - https://docs.databricks.com/aws/en/delta/update-schema

View solution in original post

3 REPLIES 3

iyashk-DB
Databricks Employee
Databricks Employee

For INSERT INTO โ€ฆ SELECT in Databricks SQL, mapping is by position unless you use the BY NAME clause or an explicit column list; BY NAME matches columns by name (including nested structs) and ignores order.
Ref Doc - https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-dml-insert-into

INSERT INTO dev.spark_db.tbl_instax_sales BY NAME
SELECT
  src_col_a AS target_col_a,
  src_col_b AS target_col_b,
  src_col_c AS target_col_c
FROM sales;

For df.write.saveAsTable append/overwrite, treat it as schema-on-write: ensure the DataFrameโ€™s columns (names and types) align with the target to avoid analysis errors; use append to add and overwrite to replace data in the table.

target_cols = spark.table("dev.spark_db.tbl_instax_sales").columns

sales_df_aligned = sales_df_cleaned.selectExpr(
    "src_col_a as target_col_a",
    "src_col_b as target_col_b",
    "src_col_c as target_col_c"
)

sales_df_aligned.write.mode("append").saveAsTable("dev.spark_db.tbl_instax_sales")

sales_df_aligned.write.mode("overwrite").saveAsTable("dev.spark_db.tbl_instax_sales")

There is no schema evolution clause for INSERT INTO; if you need schema evolution, use DataFrame write options or MERGE with schema evolution instead.
Ref Doc - https://docs.databricks.com/aws/en/delta/update-schema

so which one do we need to prefer for large datasets , either renaming the column names in dataframe and loading data to spark table using .saveAstable() , or 

creating temp view for dataframe , and loading view data into spark SQL table ..

iyashk-DB
Databricks Employee
Databricks Employee

If your pipeline is mostly PySpark/Scala, rename columns in the DataFrame to match the target and use df.write.saveAsTable. If your pipeline is mostly SQL (e.g., on SQL Warehouses), use INSERT โ€ฆ BY NAME from a temp view (or table).
Performance is broadly similar for both paths on large datasets. But it is just that the INSERT doesnโ€™t handle schema evolution; for adding new columns, with pyspark way you get that benefit.