- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-21-2025 06:44 PM
hi , i am loading dataframe data into spark sql table using .saveastable() option.. scema is matching..but column names are diffirent in sql table. is it necessary to maintain the same column names in source and target ? how to handle it in real time, either modifying column names in dataframe or using insert into option in sql..
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-22-2025 12:36 AM
For INSERT INTO … SELECT in Databricks SQL, mapping is by position unless you use the BY NAME clause or an explicit column list; BY NAME matches columns by name (including nested structs) and ignores order.
Ref Doc - https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-dml-insert-into
INSERT INTO dev.spark_db.tbl_instax_sales BY NAME
SELECT
src_col_a AS target_col_a,
src_col_b AS target_col_b,
src_col_c AS target_col_c
FROM sales;
For df.write.saveAsTable append/overwrite, treat it as schema-on-write: ensure the DataFrame’s columns (names and types) align with the target to avoid analysis errors; use append to add and overwrite to replace data in the table.
target_cols = spark.table("dev.spark_db.tbl_instax_sales").columns
sales_df_aligned = sales_df_cleaned.selectExpr(
"src_col_a as target_col_a",
"src_col_b as target_col_b",
"src_col_c as target_col_c"
)
sales_df_aligned.write.mode("append").saveAsTable("dev.spark_db.tbl_instax_sales")
sales_df_aligned.write.mode("overwrite").saveAsTable("dev.spark_db.tbl_instax_sales")
There is no schema evolution clause for INSERT INTO; if you need schema evolution, use DataFrame write options or MERGE with schema evolution instead.
Ref Doc - https://docs.databricks.com/aws/en/delta/update-schema
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-22-2025 02:37 AM
so which one do we need to prefer for large datasets , either renaming the column names in dataframe and loading data to spark table using .saveAstable() , or
creating temp view for dataframe , and loading view data into spark SQL table ..
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-22-2025 03:02 AM
If your pipeline is mostly PySpark/Scala, rename columns in the DataFrame to match the target and use df.write.saveAsTable. If your pipeline is mostly SQL (e.g., on SQL Warehouses), use INSERT … BY NAME from a temp view (or table).
Performance is broadly similar for both paths on large datasets. But it is just that the INSERT doesn’t handle schema evolution; for adding new columns, with pyspark way you get that benefit.