<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Databricks Notebook dataframe loading duplicate data in SQL table in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/databricks-notebook-dataframe-loading-duplicate-data-in-sql/m-p/26260#M18368</link>
    <description>&lt;P&gt;hi @Werner Stinckens​&amp;nbsp;, This is a Apache spark notebook, which reads the contents of a file stored in Azure blob and loads into an on prem SQL table. &lt;/P&gt;&lt;P&gt;Databricks Runtime is 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12) with a Standard_DS3_v2 worker-driver type&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The notebook reads the file content using below code&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;val SourceDataFrame = spark&amp;nbsp;
            .read&amp;nbsp;
            .option("header","false")&amp;nbsp;
            .option("delimiter", "|")&amp;nbsp;
            .schema(SourceSchemaStruct)&amp;nbsp;
            .csv(SourceFilename)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Then it writes the dataframe into a table with an overwrite mode &lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;SourceDataFrame2
      .write
      .format("jdbc")
      .mode("overwrite")
      .option("driver", driverClass)
      .option("url", jdbcUrl)
      .option("dbtable", TargetTable)
      .option("user", jdbcUsername)
      .option("password", jdbcPassword)
      .save()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Wed, 26 Oct 2022 10:13:41 GMT</pubDate>
    <dc:creator>Priya_Mani</dc:creator>
    <dc:date>2022-10-26T10:13:41Z</dc:date>
    <item>
      <title>Databricks Notebook dataframe loading duplicate data in SQL table</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-notebook-dataframe-loading-duplicate-data-in-sql/m-p/26257#M18365</link>
      <description>&lt;P&gt;Hi, I am trying to load data from datalake into SQL table using "SourceDataFrame.write" operation in a Notebook using apache spark.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;This seems to be loading duplicates at random times. The logs don't give much information and I am not sure what else to look for. How can I investigate and find the root cause for this. Please let me know what more information I can provide for anyone to help.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 21 Oct 2022 13:09:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-notebook-dataframe-loading-duplicate-data-in-sql/m-p/26257#M18365</guid>
      <dc:creator>Priya_Mani</dc:creator>
      <dc:date>2022-10-21T13:09:45Z</dc:date>
    </item>
    <item>
      <title>Re: Databricks Notebook dataframe loading duplicate data in SQL table</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-notebook-dataframe-loading-duplicate-data-in-sql/m-p/26258#M18366</link>
      <description>&lt;P&gt;can you elaborate a bit more on this notebook?&lt;/P&gt;&lt;P&gt;And also what databricks runtime version?&lt;/P&gt;</description>
      <pubDate>Mon, 24 Oct 2022 12:22:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-notebook-dataframe-loading-duplicate-data-in-sql/m-p/26258#M18366</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-10-24T12:22:00Z</dc:date>
    </item>
    <item>
      <title>Re: Databricks Notebook dataframe loading duplicate data in SQL table</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-notebook-dataframe-loading-duplicate-data-in-sql/m-p/26260#M18368</link>
      <description>&lt;P&gt;hi @Werner Stinckens​&amp;nbsp;, This is a Apache spark notebook, which reads the contents of a file stored in Azure blob and loads into an on prem SQL table. &lt;/P&gt;&lt;P&gt;Databricks Runtime is 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12) with a Standard_DS3_v2 worker-driver type&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The notebook reads the file content using below code&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;val SourceDataFrame = spark&amp;nbsp;
            .read&amp;nbsp;
            .option("header","false")&amp;nbsp;
            .option("delimiter", "|")&amp;nbsp;
            .schema(SourceSchemaStruct)&amp;nbsp;
            .csv(SourceFilename)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Then it writes the dataframe into a table with an overwrite mode &lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;SourceDataFrame2
      .write
      .format("jdbc")
      .mode("overwrite")
      .option("driver", driverClass)
      .option("url", jdbcUrl)
      .option("dbtable", TargetTable)
      .option("user", jdbcUsername)
      .option("password", jdbcPassword)
      .save()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 26 Oct 2022 10:13:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-notebook-dataframe-loading-duplicate-data-in-sql/m-p/26260#M18368</guid>
      <dc:creator>Priya_Mani</dc:creator>
      <dc:date>2022-10-26T10:13:41Z</dc:date>
    </item>
    <item>
      <title>Re: Databricks Notebook dataframe loading duplicate data in SQL table</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-notebook-dataframe-loading-duplicate-data-in-sql/m-p/26261#M18369</link>
      <description>&lt;P&gt;can you add the truncate option?&lt;/P&gt;&lt;P&gt;&lt;A href="https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html" target="test_blank"&gt;https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 03 Nov 2022 09:21:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-notebook-dataframe-loading-duplicate-data-in-sql/m-p/26261#M18369</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-11-03T09:21:14Z</dc:date>
    </item>
  </channel>
</rss>

