topic Re: Databricks Notebook dataframe loading duplicate data in SQL table in Data Engineering

Databricks Notebook dataframe loading duplicate data in SQL table

Priya_Mani — Fri, 21 Oct 2022 13:09:45 GMT

Hi, I am trying to load data from datalake into SQL table using "SourceDataFrame.write" operation in a Notebook using apache spark.

This seems to be loading duplicates at random times. The logs don't give much information and I am not sure what else to look for. How can I investigate and find the root cause for this. Please let me know what more information I can provide for anyone to help.

Thanks!

Re: Databricks Notebook dataframe loading duplicate data in SQL table

-werners- — Mon, 24 Oct 2022 12:22:00 GMT

can you elaborate a bit more on this notebook?

And also what databricks runtime version?

Re: Databricks Notebook dataframe loading duplicate data in SQL table

Priya_Mani — Wed, 26 Oct 2022 10:13:41 GMT

hi @Werner Stinckens , This is a Apache spark notebook, which reads the contents of a file stored in Azure blob and loads into an on prem SQL table.

Databricks Runtime is 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12) with a Standard_DS3_v2 worker-driver type

The notebook reads the file content using below code

val SourceDataFrame = spark 
            .read 
            .option("header","false") 
            .option("delimiter", "|") 
            .schema(SourceSchemaStruct) 
            .csv(SourceFilename)

Then it writes the dataframe into a table with an overwrite mode

SourceDataFrame2
      .write
      .format("jdbc")
      .mode("overwrite")
      .option("driver", driverClass)
      .option("url", jdbcUrl)
      .option("dbtable", TargetTable)
      .option("user", jdbcUsername)
      .option("password", jdbcPassword)
      .save()

Re: Databricks Notebook dataframe loading duplicate data in SQL table

-werners- — Thu, 03 Nov 2022 09:21:14 GMT

can you add the truncate option?

https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html