Export data from databricks to prem
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-23-2025 06:05 AM
Hello everyone
I need to export some data to sql server management studio on premise.
I need to verify that the new data on databricks is aligned with the older data that we have on premise.
Is it possible to export data as an Excel sheet or .csv file?
Regards,
WYO
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-23-2025 10:53 AM - edited 01-23-2025 11:10 AM
Hi @WYO
If it is comparing against on-prem data, I would recommend an efficient approach as below. All you need is to establish a connection to on-prem database using the jdbc configuration as below.
jdbcHostname = "OnPremServerName"
jdbcPort = "portnumber"
jdbcDatabase = "DatabaseName"
dbServerUserName = "serviceaccountuser"
dbServerPassword = dbutils.secrets.get("scope",key="serviceaccountuserpwd")
# Service account credentials string for the on prem sql server
properties = {"user" : dbServerUserName,"password" : dbServerPassword }
# connection string for the on prem sql server
url = "jdbc:sqlserver://{0}:{1};database={2};trustServerCertificate=true;loginTimeout=30;hostNameInCertificate=*.database.windows.net;".format(jdbcHostname,jdbcPort,jdbcDatabase)
# query a table from the on prem sql server
query = "(SELECT * FROM [dbo].[table]) datasource"
# read to a dataframe from the on prem sql server
df_onprem = spark.read.jdbc(url=url, table=query, properties=properties)
#view data
display(df_onprem)
df_deltatable = spark.sql("select * from catalog.schema.table")
display(df_deltatable)
Step 3 - Compare both the dataframes
# Compare DataFrames for metadata/data mismatches
def compare_dataframes(df1, df2):
#schema comparison
if df1.schema != df2.schema:
print("Schemas are different.")
else:
print("Matching schemas")
#record count check
if df1.count() == df2.count():
print("Record counts match")
#data comparison
if df1.exceptAll(df2).isEmpty() and df2.exceptAll(df1).isEmpty():
print("DataFrames are identical.")
else:
print("DataFrames are not identical.")
# Show differences
print("Rows in df1 but not in df2:")
df1.exceptAll(df2).show()
print("Rows in df2 but not in df1:")
df2.exceptAll(df1).show()
compare_dataframes(df_onprem,df_deltatable)
My response is not getting through with formatted code (attachments fyr)
Let me know for anything, else, please mark this as a solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-23-2025 09:59 PM
Yes, you can export data from Databricks to an Excel sheet or a .csv file. Use the method to convert your Spark DataFrame to a Pandas DataFrame, then save it using for a .csv file or a library like for Excel. Once exported, you can compare the file with your on-premise SQL Server data to verify alignment.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-24-2025 01:22 AM
You can compare your databricks data with on-prem sql server data in two ways:
Firstly, you have to make connection between sql-server and databricks using volumes. Using volumes we can mount sql server data into databricks unity catalog
1.we can compare the data we have in databricks and sql-server data(volume data) in databricks itself.
2. writing databricks data into volume(in sql-server) and then comparing the data in sql-server.
Either way make the best practice to use volumes instead of jdbc/odbc or dbfs mount(deprecated) and manually exporting tables, which are not recommended.
If you found this helpful, mark it as solution
Regards,
Avinash N

