Databricks Notebook dataframe loading duplicate data in SQL table
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-21-2022 06:09 AM
Hi, I am trying to load data from datalake into SQL table using "SourceDataFrame.write" operation in a Notebook using apache spark.
This seems to be loading duplicates at random times. The logs don't give much information and I am not sure what else to look for. How can I investigate and find the root cause for this. Please let me know what more information I can provide for anyone to help.
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-24-2022 05:22 AM
can you elaborate a bit more on this notebook?
And also what databricks runtime version?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-26-2022 03:13 AM
hi @Werner Stinckens , This is a Apache spark notebook, which reads the contents of a file stored in Azure blob and loads into an on prem SQL table.
Databricks Runtime is 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12) with a Standard_DS3_v2 worker-driver type
The notebook reads the file content using below code
val SourceDataFrame = spark
.read
.option("header","false")
.option("delimiter", "|")
.schema(SourceSchemaStruct)
.csv(SourceFilename)
Then it writes the dataframe into a table with an overwrite mode
SourceDataFrame2
.write
.format("jdbc")
.mode("overwrite")
.option("driver", driverClass)
.option("url", jdbcUrl)
.option("dbtable", TargetTable)
.option("user", jdbcUsername)
.option("password", jdbcPassword)
.save()
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-03-2022 02:21 AM
can you add the truncate option?
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

