cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks Notebook dataframe loading duplicate data in SQL table

Priya_Mani
New Contributor II

Hi, I am trying to load data from datalake into SQL table using "SourceDataFrame.write" operation in a Notebook using apache spark.

This seems to be loading duplicates at random times. The logs don't give much information and I am not sure what else to look for. How can I investigate and find the root cause for this. Please let me know what more information I can provide for anyone to help.

Thanks!

3 REPLIES 3

-werners-
Esteemed Contributor III

can you elaborate a bit more on this notebook?

And also what databricks runtime version?

hi @Werner Stinckens​ , This is a Apache spark notebook, which reads the contents of a file stored in Azure blob and loads into an on prem SQL table.

Databricks Runtime is 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12) with a Standard_DS3_v2 worker-driver type

The notebook reads the file content using below code

val SourceDataFrame = spark 
            .read 
            .option("header","false") 
            .option("delimiter", "|") 
            .schema(SourceSchemaStruct) 
            .csv(SourceFilename)

Then it writes the dataframe into a table with an overwrite mode

SourceDataFrame2
      .write
      .format("jdbc")
      .mode("overwrite")
      .option("driver", driverClass)
      .option("url", jdbcUrl)
      .option("dbtable", TargetTable)
      .option("user", jdbcUsername)
      .option("password", jdbcPassword)
      .save()

-werners-
Esteemed Contributor III

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now