<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: ETL in Databricks in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/etl-in-databricks/m-p/27029#M18946</link>
    <description>&lt;P&gt;Hi @Kris Koirala​&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Just checking if you still have any follow-up questions? please let us know.&lt;/P&gt;</description>
    <pubDate>Mon, 11 Apr 2022 20:55:55 GMT</pubDate>
    <dc:creator>jose_gonzalez</dc:creator>
    <dc:date>2022-04-11T20:55:55Z</dc:date>
    <item>
      <title>ETL in Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/etl-in-databricks/m-p/27026#M18943</link>
      <description>&lt;P&gt;I use Azure Databricks for ETL. I read/write data from and to raw/stage/curate folders. I write dataframe to a path (eg: /mnt/datalake/curated/....). In final step I read data from the path, convert that to dataframe and write it to the Azure SQL DB/Azure Synapse DB.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I have seen people also create databases/tables with in Databricks itself using script something like this (&lt;B&gt;CREATE&lt;/B&gt; &lt;B&gt;TABLE&lt;/B&gt; &lt;B&gt;default&lt;/B&gt;.People &lt;B&gt;USING&lt;/B&gt; DELTA &lt;B&gt;LOCATION&lt;/B&gt; '/tmp/delta/People) , and read/write data from there.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I my case I don't create databases and tables with in databricks. What are the pros and cons of each approach. Where should I use one or the other? Any insight/documentations would be helpful. Thanks&lt;/P&gt;</description>
      <pubDate>Fri, 25 Feb 2022 21:03:13 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/etl-in-databricks/m-p/27026#M18943</guid>
      <dc:creator>KKo</dc:creator>
      <dc:date>2022-02-25T21:03:13Z</dc:date>
    </item>
    <item>
      <title>Re: ETL in Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/etl-in-databricks/m-p/27027#M18944</link>
      <description>&lt;P&gt;when writing to another DB (a classic RDBMS or a multi node one like Synapse, Snowflake, ...) you actually have to copy the data from your storage (data lake etc) to the database.&lt;/P&gt;&lt;P&gt;After the copy you can work with the database as you wish.&lt;/P&gt;&lt;P&gt;​&lt;/P&gt;&lt;P&gt;When you create a table in Databricks (Spark), you actually create a semantic view on top of the data in your storage.  So there is no copy necessary like with the DB scenario.&lt;/P&gt;&lt;P&gt;These 'tables' can then be queried using spark or some SQL tool like Databricks SQL, Azure Synapse Serverless, Presto, Trino, Dremio etc.&lt;/P&gt;&lt;P&gt;​&lt;/P&gt;&lt;P&gt;The pros and cons depend on your use case.  If you already have  DB and you really have to have the data in this db, well then you go for the copy scenario.&lt;/P&gt;&lt;P&gt;Databases can also be highly optimized (using indexes, materialized views, statistics, caching etc).&lt;/P&gt;&lt;P&gt;BUT the main con is that you have to copy the data and that often databases do not support unstructured data (json etc).&lt;/P&gt;&lt;P&gt;Also, a database has to be online to run queries and often has to be managed, which can be pricey&lt;/P&gt;&lt;P&gt;​&lt;/P&gt;&lt;P&gt;That is the main advantage of using ´big data sql tools' directly on top of your data lake: no extra data movement needed, there are tons of options in different price ranges. (Almost) No system management and you pay for what you use.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We use a mix of both: some data has to be copied to a physical database, other data is served by f.e. Azure Synapse Serverless or Databricks SQL.&lt;/P&gt;</description>
      <pubDate>Sat, 26 Feb 2022 09:29:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/etl-in-databricks/m-p/27027#M18944</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-02-26T09:29:37Z</dc:date>
    </item>
    <item>
      <title>Re: ETL in Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/etl-in-databricks/m-p/27028#M18945</link>
      <description>&lt;P&gt;As @Werner Stinckens​&amp;nbsp;said. Additionally you wrote ("I write dataframe to a path"), so that mentioned path you can just register as a table so you can always easy preview data and keep it nice organized in your databricks data section (hive metastore). All your script and data can stay as it is, registration is just as you mentioned CREATE TABLE... USING location&lt;/P&gt;</description>
      <pubDate>Sat, 26 Feb 2022 17:52:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/etl-in-databricks/m-p/27028#M18945</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2022-02-26T17:52:16Z</dc:date>
    </item>
    <item>
      <title>Re: ETL in Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/etl-in-databricks/m-p/27029#M18946</link>
      <description>&lt;P&gt;Hi @Kris Koirala​&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Just checking if you still have any follow-up questions? please let us know.&lt;/P&gt;</description>
      <pubDate>Mon, 11 Apr 2022 20:55:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/etl-in-databricks/m-p/27029#M18946</guid>
      <dc:creator>jose_gonzalez</dc:creator>
      <dc:date>2022-04-11T20:55:55Z</dc:date>
    </item>
  </channel>
</rss>

