<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Redefine ETL strategy with pypskar approach in Get Started Discussions</title>
    <link>https://community.databricks.com/t5/get-started-discussions/redefine-etl-strategy-with-pypskar-approach/m-p/66352#M7082</link>
    <description>&lt;P&gt;Hi,&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;According to&amp;nbsp;&lt;A href="https://docs.databricks.com/en/tables/partitions.html" target="_self"&gt;When to partition tables on Databricks :&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;SPAN&gt;Databricks recommends you do not partition tables that contain less than a terabyte of data.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;If you proceed with partitions,&amp;nbsp;please check if all partitions contain at least a gigabyte of data.&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;SPAN&gt;Instead of partitions, take a look at :&lt;/SPAN&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;SPAN&gt;&lt;A href="https://docs.databricks.com/en/delta/index.html" target="_self"&gt;Managing files and indexing data with Delta Lake :&amp;nbsp;&lt;/A&gt;&lt;/SPAN&gt;&amp;nbsp;&lt;A href="https://docs.databricks.com/en/sql/language-manual/delta-optimize.html" target="_self"&gt;OPTIMIZE&lt;/A&gt;, &lt;A href="https://docs.databricks.com/en/delta/data-skipping.html#what-is-z-ordering" target="_self"&gt;ZORDER&lt;/A&gt;, &lt;A href="https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-aux-analyze-table.html" target="_self"&gt;ANALYZE&lt;/A&gt; and &lt;A href="https://docs.databricks.com/en/sql/language-manual/delta-vacuum.html" target="_self"&gt;VACUUM&lt;/A&gt;&amp;nbsp;are the commands that might be super useful while you need to enhance the processing performance&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://docs.databricks.com/en/optimizations/predictive-optimization.html" target="_self"&gt;Predictive optimization for Delta Lake&lt;/A&gt;&amp;nbsp;: executes OPTIMIZE and VACUUM automatically for you (Unity Catalog + managed tables are the prerequisites for this feature as of now)&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;SPAN&gt;It is highly possible you do not need to rewrite the whole dataset but rather use the &lt;A href="https://docs.databricks.com/en/delta/merge.html" target="_self"&gt;MERGE&lt;/A&gt; operation.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;While running the processes that may work with the same partitions please make the separation as explicit as possible in the operation condition (see&amp;nbsp;&lt;A href="https://docs.databricks.com/en/optimizations/isolation-level.html#concurrentappendexception" target="_self"&gt;ConcurrentAppendException).&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;You can find &lt;A href="https://www.databricks.com/resources/demos/library?itm_data=demo_center#q=Delta&amp;amp;type=2031" target="_self"&gt;here&lt;/A&gt;&amp;nbsp;a repository with demos that can contain useful hints and that you can install in your workspace (maybe the one on&amp;nbsp;&lt;A href="https://www.databricks.com/resources/demos/tutorials/data-science-and-ai/delta-lake?itm_data=demo_center" target="_self"&gt;Delta Lake&lt;/A&gt;&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;would be the most relevant for you at the current stage ? click on the "View the Notebooks" button to access to codes and run the pip command to play with the content)&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Hope it helps,&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Best,&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 16 Apr 2024 14:08:58 GMT</pubDate>
    <dc:creator>artsheiko</dc:creator>
    <dc:date>2024-04-16T14:08:58Z</dc:date>
    <item>
      <title>Redefine ETL strategy with pypskar approach</title>
      <link>https://community.databricks.com/t5/get-started-discussions/redefine-etl-strategy-with-pypskar-approach/m-p/65706#M7080</link>
      <description>&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Hey everyone!&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;I&lt;/SPAN&gt;&lt;SPAN&gt;'ve some previous experience with Data Engineering, but totally new in Databricks and Delta Tables.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Starting this thread hoping to ask some questions and asking for help on how to design a process.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;So I have essentially 2 delta tables (same structure):&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;1) customer_stg (id int, segment string, name string, status string), partitioned by segment, located in my delta lake in /DATA/STG/CUSTOMER&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;2) customer_prep (id int, segment string, name string, status string), partitioned by segment, located in my delta lake in /DATA/PREP/CUSTOMER&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;segment can take values like TL, TI, TO, PA&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;So I have 4 ETL processes that load data into the same staging, and I have to deal with this scenario, and so that'&lt;/SPAN&gt;&lt;SPAN&gt;s why I have &lt;/SPAN&gt;&lt;SPAN&gt;both&lt;/SPAN&gt;&lt;SPAN&gt; tables partitioned &lt;/SPAN&gt;&lt;SPAN&gt;by&lt;/SPAN&gt; &lt;SPAN&gt;"segment"&lt;/SPAN&gt;&lt;SPAN&gt; (also &lt;/SPAN&gt;&lt;SPAN&gt;for&lt;/SPAN&gt;&lt;SPAN&gt; selecting &lt;/SPAN&gt;&lt;SPAN&gt;data&lt;/SPAN&gt; &lt;SPAN&gt;on&lt;/SPAN&gt;&lt;SPAN&gt; a different scenario).&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Also: they are running &lt;/SPAN&gt;&lt;SPAN&gt;in&lt;/SPAN&gt; &lt;SPAN&gt;sequence&lt;/SPAN&gt;&lt;SPAN&gt;, but &lt;/SPAN&gt;&lt;SPAN&gt;in&lt;/SPAN&gt;&lt;SPAN&gt; future they might run &lt;/SPAN&gt;&lt;SPAN&gt;in&lt;/SPAN&gt;&lt;SPAN&gt; parallel &lt;/SPAN&gt;&lt;SPAN&gt;-&lt;/SPAN&gt; &lt;SPAN&gt;and&lt;/SPAN&gt;&lt;SPAN&gt; this &lt;/SPAN&gt;&lt;SPAN&gt;is&lt;/SPAN&gt;&lt;SPAN&gt; important &lt;/SPAN&gt;&lt;SPAN&gt;as&lt;/SPAN&gt;&lt;SPAN&gt; you&lt;/SPAN&gt;&lt;SPAN&gt;'ll see.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;So, what I have so far:&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;1) Extract info for TL segment and save in DeltaLake&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;2) Read this info into a dataframe (tl_sgmnt_df) that has just 4 columns: ID, SEGMENT, NAME, STATUS (essentially same structure as staging/prepared tables)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;tl_sgmnt_df = orig_sgmnt_tl_data_df\&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; .withColumn("ID", F.col("orig_sgmnt_tl_data_df.ID") )\&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; .withColumn("SEGMENT", F.col("orig_sgmnt_tl_data_df.SEGMENT") )\&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; .withColumn("NAME", F.col("orig_sgmnt_tl_data_df.NAME")\&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; .withColumn("STATUS", F.col("N"))&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;3) Load Staging and Prepared delta tables into a dataframe each (customer_stg_delta_df, customer_prep_delta_df)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;customer_stg_delta_df = spark.read.format("delta").load(dl_url + "/DATA/STG/CUSTOMER")&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;customer_prep_delta_df = spark.read.format("delta").load(dl_url + "/DATA/PREP/CUSTOMER")&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;4) Run a left_anti join between dataframe and staging table, by grabbing ALL data in the dataframe (that has essentially all "old" info and "new" info).&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;new_and_oldstg_df = tl_sgmnt_df.alias('&lt;/SPAN&gt;&lt;SPAN&gt;tl_sgmnt_df&lt;/SPAN&gt;&lt;SPAN&gt;').join(customer_stg_delta_df.alias('&lt;/SPAN&gt;&lt;SPAN&gt;customer_stg_delta_df&lt;/SPAN&gt;&lt;SPAN&gt;'),(F.col('&lt;/SPAN&gt;&lt;SPAN&gt;tl_sgmnt_df.id&lt;/SPAN&gt;&lt;SPAN&gt;') == F.col('&lt;/SPAN&gt;&lt;SPAN&gt;customer_stg_delta_df.id&lt;/SPAN&gt;&lt;SPAN&gt;')),"left_anti" )&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;5) Next I overwrite staging table with the most recent data and old one (for all segments).&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;INSERT INTO PREP.CUST_PREP SELECT * FROM STG.CUST_STG&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;new_and_oldstg_df.write.format("delta").mode("overwrite").partitionBy("SEGMENT").save(dl_url + "/DATA/STG/CUSTOMER")&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;6) Next I append all info into the prepared table.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;Next, TI ETL pipeline runs and does exactly the same thing.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;TO and PA segments pipelines run and also do the same thing.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;However, like I said in the future they might run in parallel so I'&lt;/SPAN&gt;&lt;SPAN&gt;d &lt;/SPAN&gt;&lt;SPAN&gt;like&lt;/SPAN&gt; &lt;SPAN&gt;to&lt;/SPAN&gt;&lt;SPAN&gt; transform this processes more &lt;/SPAN&gt;&lt;SPAN&gt;robust&lt;/SPAN&gt; &lt;SPAN&gt;and&lt;/SPAN&gt; &lt;SPAN&gt;"safer"&lt;/SPAN&gt; &lt;SPAN&gt;by&lt;/SPAN&gt;&lt;SPAN&gt; working just just &lt;/SPAN&gt;&lt;SPAN&gt;with&lt;/SPAN&gt;&lt;SPAN&gt; each &lt;/SPAN&gt;&lt;SPAN&gt;partition&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;and&lt;/SPAN&gt;&lt;SPAN&gt; this &lt;/SPAN&gt;&lt;SPAN&gt;is&lt;/SPAN&gt;&lt;SPAN&gt; my &lt;/SPAN&gt;&lt;SPAN&gt;first&lt;/SPAN&gt;&lt;SPAN&gt; challenge.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Could you please help me &lt;/SPAN&gt;&lt;SPAN&gt;on&lt;/SPAN&gt;&lt;SPAN&gt; this?&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;I &lt;/SPAN&gt;&lt;SPAN&gt;first&lt;/SPAN&gt;&lt;SPAN&gt; need &lt;/SPAN&gt;&lt;SPAN&gt;to&lt;/SPAN&gt; &lt;SPAN&gt;delete&lt;/SPAN&gt;&lt;SPAN&gt; all info &lt;/SPAN&gt;&lt;SPAN&gt;from&lt;/SPAN&gt;&lt;SPAN&gt; staging &lt;/SPAN&gt;&lt;SPAN&gt;table&lt;/SPAN&gt;&lt;SPAN&gt; that &lt;/SPAN&gt;&lt;SPAN&gt;is&lt;/SPAN&gt;&lt;SPAN&gt; already &lt;/SPAN&gt;&lt;SPAN&gt;in&lt;/SPAN&gt;&lt;SPAN&gt; prepared &lt;/SPAN&gt;&lt;SPAN&gt;table&lt;/SPAN&gt; &lt;SPAN&gt;with&lt;/SPAN&gt; &lt;SPAN&gt;status&lt;/SPAN&gt; &lt;SPAN&gt;"R"&lt;/SPAN&gt;&lt;SPAN&gt;:&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;DELETE&lt;/SPAN&gt; &lt;SPAN&gt;FROM&lt;/SPAN&gt;&lt;SPAN&gt; STG.CUST_STG &lt;/SPAN&gt;&lt;SPAN&gt;where&lt;/SPAN&gt;&lt;SPAN&gt; ID &lt;/SPAN&gt;&lt;SPAN&gt;IN&lt;/SPAN&gt;&lt;SPAN&gt; (&lt;/SPAN&gt;&lt;SPAN&gt;SELECT&lt;/SPAN&gt;&lt;SPAN&gt; ID &lt;/SPAN&gt;&lt;SPAN&gt;FROM&lt;/SPAN&gt;&lt;SPAN&gt; PREP.CUST_PREP &lt;/SPAN&gt;&lt;SPAN&gt;WHERE&lt;/SPAN&gt;&lt;SPAN&gt; SEGMENT &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt; &lt;SPAN&gt;'TL'&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;Then&lt;/SPAN&gt;&lt;SPAN&gt;, based &lt;/SPAN&gt;&lt;SPAN&gt;on&lt;/SPAN&gt;&lt;SPAN&gt; the info loaded do a &lt;/SPAN&gt;&lt;SPAN&gt;"union"&lt;/SPAN&gt; &lt;SPAN&gt;between&lt;/SPAN&gt;&lt;SPAN&gt; the dataframe &lt;/SPAN&gt;&lt;SPAN&gt;and&lt;/SPAN&gt;&lt;SPAN&gt; staging &lt;/SPAN&gt;&lt;SPAN&gt;table&lt;/SPAN&gt;&lt;SPAN&gt; (making sure I&lt;/SPAN&gt;&lt;SPAN&gt;'m saving data that is not in prep table and at the same time having the new info:&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;tl_sgmnt_df.createOrReplaceTempView("tl_sgmnt_v")&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;INSERT INTO STG.CUST_STG SELECT * FROM tl_sgmnt_v&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;After this, I insert all data into the prep table:&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;INSERT INTO STG.CUST_STG SELECT * FROM tl_sgmnt_v&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;But I honestly don'&lt;/SPAN&gt;&lt;SPAN&gt;t know &lt;/SPAN&gt;&lt;SPAN&gt;if&lt;/SPAN&gt;&lt;SPAN&gt; this works &lt;/SPAN&gt;&lt;SPAN&gt;or&lt;/SPAN&gt; &lt;SPAN&gt;not&lt;/SPAN&gt;&lt;SPAN&gt; (I presume pyspark will &lt;/SPAN&gt;&lt;SPAN&gt;delete&lt;/SPAN&gt;&lt;SPAN&gt; info &lt;/SPAN&gt;&lt;SPAN&gt;from&lt;/SPAN&gt;&lt;SPAN&gt; delta &lt;/SPAN&gt;&lt;SPAN&gt;table&lt;/SPAN&gt;&lt;SPAN&gt; but &lt;/SPAN&gt;&lt;SPAN&gt;after&lt;/SPAN&gt;&lt;SPAN&gt; that it might re&lt;/SPAN&gt;&lt;SPAN&gt;-&lt;/SPAN&gt;&lt;SPAN&gt;write the entire files again), so I believe it&lt;/SPAN&gt;&lt;SPAN&gt;'s not efficient. Am I right? Of course I would need to replicate this to the other ETL processes.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;So I'&lt;/SPAN&gt;&lt;SPAN&gt;d essentially &lt;/SPAN&gt;&lt;SPAN&gt;like&lt;/SPAN&gt; &lt;SPAN&gt;to&lt;/SPAN&gt;&lt;SPAN&gt; change steps &lt;/SPAN&gt;&lt;SPAN&gt;4&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;5&lt;/SPAN&gt; &lt;SPAN&gt;and&lt;/SPAN&gt; &lt;SPAN&gt;6&lt;/SPAN&gt; &lt;SPAN&gt;to&lt;/SPAN&gt;&lt;SPAN&gt; work &lt;/SPAN&gt;&lt;SPAN&gt;with&lt;/SPAN&gt;&lt;SPAN&gt; partitions, &lt;/SPAN&gt;&lt;SPAN&gt;and&lt;/SPAN&gt; &lt;SPAN&gt;not&lt;/SPAN&gt; &lt;SPAN&gt;with&lt;/SPAN&gt;&lt;SPAN&gt; the entire &lt;/SPAN&gt;&lt;SPAN&gt;table&lt;/SPAN&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Also, I believe this &lt;/SPAN&gt;&lt;SPAN&gt;is&lt;/SPAN&gt;&lt;SPAN&gt; a very &lt;/SPAN&gt;&lt;SPAN&gt;basic&lt;/SPAN&gt;&lt;SPAN&gt; problem, I&lt;/SPAN&gt;&lt;SPAN&gt;'m just new into pyspark and there'&lt;/SPAN&gt;&lt;SPAN&gt;s maybe something I&lt;/SPAN&gt;&lt;SPAN&gt;'m missing here. I also noticed that there'&lt;/SPAN&gt;&lt;SPAN&gt;s a &lt;/SPAN&gt;&lt;SPAN&gt;DELETE&lt;/SPAN&gt; &lt;SPAN&gt;partition&lt;/SPAN&gt;&lt;SPAN&gt; command, but I want &lt;/SPAN&gt;&lt;SPAN&gt;to&lt;/SPAN&gt; &lt;SPAN&gt;keep&lt;/SPAN&gt;&lt;SPAN&gt; info &lt;/SPAN&gt;&lt;SPAN&gt;on&lt;/SPAN&gt;&lt;SPAN&gt; that &lt;/SPAN&gt;&lt;SPAN&gt;partition&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;If&lt;/SPAN&gt;&lt;SPAN&gt; I&lt;/SPAN&gt;&lt;SPAN&gt;'m not clear, please let me know and I'&lt;/SPAN&gt;&lt;SPAN&gt;ll &lt;/SPAN&gt;&lt;SPAN&gt;try&lt;/SPAN&gt; &lt;SPAN&gt;to&lt;/SPAN&gt;&lt;SPAN&gt; explain better!&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;Thank you &lt;/SPAN&gt;&lt;SPAN&gt;for&lt;/SPAN&gt;&lt;SPAN&gt; your help &lt;/SPAN&gt;&lt;SPAN&gt;and&lt;/SPAN&gt; &lt;SPAN&gt;time&lt;/SPAN&gt;&lt;SPAN&gt;!&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Sat, 06 Apr 2024 20:01:57 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/redefine-etl-strategy-with-pypskar-approach/m-p/65706#M7080</guid>
      <dc:creator>databird</dc:creator>
      <dc:date>2024-04-06T20:01:57Z</dc:date>
    </item>
    <item>
      <title>Re: Redefine ETL strategy with pypskar approach</title>
      <link>https://community.databricks.com/t5/get-started-discussions/redefine-etl-strategy-with-pypskar-approach/m-p/65707#M7081</link>
      <description>&lt;P&gt;Just a note:&lt;BR /&gt;I'm constantly having error when trying to post with correct code formatted text. My apologize, but I can't at all post correctly.&lt;/P&gt;</description>
      <pubDate>Sat, 06 Apr 2024 20:04:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/redefine-etl-strategy-with-pypskar-approach/m-p/65707#M7081</guid>
      <dc:creator>databird</dc:creator>
      <dc:date>2024-04-06T20:04:31Z</dc:date>
    </item>
    <item>
      <title>Re: Redefine ETL strategy with pypskar approach</title>
      <link>https://community.databricks.com/t5/get-started-discussions/redefine-etl-strategy-with-pypskar-approach/m-p/66352#M7082</link>
      <description>&lt;P&gt;Hi,&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;According to&amp;nbsp;&lt;A href="https://docs.databricks.com/en/tables/partitions.html" target="_self"&gt;When to partition tables on Databricks :&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;SPAN&gt;Databricks recommends you do not partition tables that contain less than a terabyte of data.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;If you proceed with partitions,&amp;nbsp;please check if all partitions contain at least a gigabyte of data.&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;SPAN&gt;Instead of partitions, take a look at :&lt;/SPAN&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;SPAN&gt;&lt;A href="https://docs.databricks.com/en/delta/index.html" target="_self"&gt;Managing files and indexing data with Delta Lake :&amp;nbsp;&lt;/A&gt;&lt;/SPAN&gt;&amp;nbsp;&lt;A href="https://docs.databricks.com/en/sql/language-manual/delta-optimize.html" target="_self"&gt;OPTIMIZE&lt;/A&gt;, &lt;A href="https://docs.databricks.com/en/delta/data-skipping.html#what-is-z-ordering" target="_self"&gt;ZORDER&lt;/A&gt;, &lt;A href="https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-aux-analyze-table.html" target="_self"&gt;ANALYZE&lt;/A&gt; and &lt;A href="https://docs.databricks.com/en/sql/language-manual/delta-vacuum.html" target="_self"&gt;VACUUM&lt;/A&gt;&amp;nbsp;are the commands that might be super useful while you need to enhance the processing performance&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://docs.databricks.com/en/optimizations/predictive-optimization.html" target="_self"&gt;Predictive optimization for Delta Lake&lt;/A&gt;&amp;nbsp;: executes OPTIMIZE and VACUUM automatically for you (Unity Catalog + managed tables are the prerequisites for this feature as of now)&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;SPAN&gt;It is highly possible you do not need to rewrite the whole dataset but rather use the &lt;A href="https://docs.databricks.com/en/delta/merge.html" target="_self"&gt;MERGE&lt;/A&gt; operation.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;While running the processes that may work with the same partitions please make the separation as explicit as possible in the operation condition (see&amp;nbsp;&lt;A href="https://docs.databricks.com/en/optimizations/isolation-level.html#concurrentappendexception" target="_self"&gt;ConcurrentAppendException).&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;You can find &lt;A href="https://www.databricks.com/resources/demos/library?itm_data=demo_center#q=Delta&amp;amp;type=2031" target="_self"&gt;here&lt;/A&gt;&amp;nbsp;a repository with demos that can contain useful hints and that you can install in your workspace (maybe the one on&amp;nbsp;&lt;A href="https://www.databricks.com/resources/demos/tutorials/data-science-and-ai/delta-lake?itm_data=demo_center" target="_self"&gt;Delta Lake&lt;/A&gt;&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;would be the most relevant for you at the current stage ? click on the "View the Notebooks" button to access to codes and run the pip command to play with the content)&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Hope it helps,&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Best,&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 16 Apr 2024 14:08:58 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/redefine-etl-strategy-with-pypskar-approach/m-p/66352#M7082</guid>
      <dc:creator>artsheiko</dc:creator>
      <dc:date>2024-04-16T14:08:58Z</dc:date>
    </item>
    <item>
      <title>Re: Redefine ETL strategy with pypskar approach</title>
      <link>https://community.databricks.com/t5/get-started-discussions/redefine-etl-strategy-with-pypskar-approach/m-p/66372#M7083</link>
      <description>&lt;P&gt;Thanks&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/23235"&gt;@artsheiko&lt;/a&gt;&amp;nbsp;!&amp;nbsp;&lt;/P&gt;&lt;P&gt;Well I definitely do not&amp;nbsp;meet all the requirements to build partitioned tables.&lt;/P&gt;&lt;P&gt;The biggest table I have so far has a miserable size of ~60MB just for one partition, and It will increase its space and records, but not enough to reach 1TB or even one 1GB (and this is probably the biggest source I'll have so far).&lt;/P&gt;&lt;P&gt;So, I'll need to review the approach to not have partitions.&lt;/P&gt;&lt;P&gt;Working with the Merge statement, seems the approach to follow.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Just a question about the demos you shared:&lt;BR /&gt;They're only available on a databricks environment, right?&lt;/P&gt;&lt;P&gt;Thanks for your help!&lt;/P&gt;</description>
      <pubDate>Tue, 16 Apr 2024 16:16:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/redefine-etl-strategy-with-pypskar-approach/m-p/66372#M7083</guid>
      <dc:creator>databird</dc:creator>
      <dc:date>2024-04-16T16:16:45Z</dc:date>
    </item>
    <item>
      <title>Re: Redefine ETL strategy with pypskar approach</title>
      <link>https://community.databricks.com/t5/get-started-discussions/redefine-etl-strategy-with-pypskar-approach/m-p/66374#M7084</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/103265"&gt;@databird&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;
&lt;P&gt;You can review the code of each demo by opening the content via "View the Notebooks" or by exploring the following repo :&amp;nbsp;&lt;A href="https://github.com/databricks-demos" target="_blank"&gt;https://github.com/databricks-demos&lt;/A&gt;&amp;nbsp;(you can try to search for "merge" to see all the occurrences, for example)&lt;/P&gt;
&lt;P&gt;To install the demo, indeed, you need a workspace - the installation process may bring not only the notebooks, but also workflows, DLTs and eventually dashboards. Another reason is that each demo is an independent asset. So, it should operate on top of some demo data - it's generated also during the installation.&lt;/P&gt;</description>
      <pubDate>Tue, 16 Apr 2024 16:21:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/redefine-etl-strategy-with-pypskar-approach/m-p/66374#M7084</guid>
      <dc:creator>artsheiko</dc:creator>
      <dc:date>2024-04-16T16:21:37Z</dc:date>
    </item>
  </channel>
</rss>

