Databricks Community

Priya_Mani · ‎10-21-2022

Hi, I am trying to load data from datalake into SQL table using "SourceDataFrame.write" operation in a Notebook using apache spark.

This seems to be loading duplicates at random times. The logs don't give much information and I am not sure what else to look for. How can I investigate and find the root cause for this. Please let me know what more information I can provide for anyone to help.

Thanks!

-werners- · ‎10-24-2022

can you elaborate a bit more on this notebook?

And also what databricks runtime version?

Priya_Mani · ‎10-26-2022

hi @Werner Stinckens , This is a Apache spark notebook, which reads the contents of a file stored in Azure blob and loads into an on prem SQL table.

Databricks Runtime is 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12) with a Standard_DS3_v2 worker-driver type

The notebook reads the file content using below code

val SourceDataFrame = spark 
            .read 
            .option("header","false") 
            .option("delimiter", "|") 
            .schema(SourceSchemaStruct) 
            .csv(SourceFilename)

Then it writes the dataframe into a table with an overwrite mode

SourceDataFrame2
      .write
      .format("jdbc")
      .mode("overwrite")
      .option("driver", driverClass)
      .option("url", jdbcUrl)
      .option("dbtable", TargetTable)
      .option("user", jdbcUsername)
      .option("password", jdbcPassword)
      .save()