Updating tables from SQL Server to Databricks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-20-2023 03:24 PM
Hi,
I have SQL Server tables which are the primary location for all live transactions happen and currently I read them through pyspark as dataframes and overwrite them everyday to have the latest copy of them in Databricks. The problem is it takes long time to copy all tables from SQL Server to Databricks and the tables don't have a primary key like an id, some tables have timestamps. The question is how can I read only the changes and write them to my Databricks tables in an efficient way without the need to rewrite the whole table again and again?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-22-2023 04:46 PM
Do you have any audit column in the sql server table which can be used for incremental loading?
Is Sql server table full load each time?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-22-2023 09:09 PM - edited 07-22-2023 09:10 PM
For the first question, some tables have pkid or timestamp columns, but some don't.
For the second question, some tables gets only appended to while others are truncated and created again (on SQL Server) but on Databricks currently we overwrite them everyday.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-22-2023 09:37 PM
Hi @YS1
Hope you are well. Just wanted to see if you were able to find an answer to your question and would you like to mark an answer as best? It would be really helpful for the other members too.
Cheers!

