<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: From a noob Databrickser... concerning Python programming in databricks in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/from-a-noob-databrickser-concerning-python-programming-in/m-p/32077#M23374</link>
    <description>&lt;P&gt;I am assuming the schema of all these files is same&lt;/P&gt;&lt;P&gt; If so how to process it depends what your comfortable with &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The steps that come to mind are &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;In the landing zone have a folder structure per  client &lt;/LI&gt;&lt;LI&gt;read all the parquet contract files  into delta input_file_name() may be useful to know which file your processing&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;(Contracts are per client with a type and start end date)&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Create a column for client name &lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Perform aggregations&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;how many different contract did the client have&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;---group by clientname and a  count&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;of which type&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;--group by clientname and distinct count on type&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;when was the dateof the first contract&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;--group by clientname and min date&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;and the last contract&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;--group by clientname and max date&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;how long have we been working with him.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;--difference between min and max&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Mon, 12 Sep 2022 03:39:13 GMT</pubDate>
    <dc:creator>PriyaAnanthram</dc:creator>
    <dc:date>2022-09-12T03:39:13Z</dc:date>
    <item>
      <title>From a noob Databrickser... concerning Python programming in databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/from-a-noob-databrickser-concerning-python-programming-in/m-p/32074#M23371</link>
      <description>&lt;P&gt;The following...&lt;/P&gt;&lt;P&gt;We,ve got clients working with us in contracts. Per client several contracts of a certain type with start- and end dates. &lt;/P&gt;&lt;P&gt;If I need aggregated info per client in one record like:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;how many different contract did the client have&lt;/LI&gt;&lt;LI&gt;of which type&lt;/LI&gt;&lt;LI&gt;when was the dateof the first contract&lt;/LI&gt;&lt;LI&gt;and the last contract&lt;/LI&gt;&lt;LI&gt;how long have we been working with him.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;The information from the sourcedatabase is uploaded in our DWH in Parquet files. &lt;/P&gt;&lt;P&gt;Should/can I use Python on Parquet to aggregrate this data? Looping over the source tables and create a table with the aggregated data?&lt;/P&gt;</description>
      <pubDate>Fri, 09 Sep 2022 13:54:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/from-a-noob-databrickser-concerning-python-programming-in/m-p/32074#M23371</guid>
      <dc:creator>MaverickF14</dc:creator>
      <dc:date>2022-09-09T13:54:04Z</dc:date>
    </item>
    <item>
      <title>Re: From a noob Databrickser... concerning Python programming in databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/from-a-noob-databrickser-concerning-python-programming-in/m-p/32075#M23372</link>
      <description>&lt;P&gt;There are many ways to do this and python is one.  If you have parquet files, you can also write sql easily against them.  Something such as&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;select count(*) 
from parquet.`path to parquet directory`&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;You don't need to make tables out of the parquet files, but you can.  &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;You can use regular python on databricks, but it won't be distributed so make sure to just use a single node cluster.  You can use pyspark too.  &lt;/P&gt;</description>
      <pubDate>Fri, 09 Sep 2022 16:09:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/from-a-noob-databrickser-concerning-python-programming-in/m-p/32075#M23372</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-09-09T16:09:23Z</dc:date>
    </item>
    <item>
      <title>Re: From a noob Databrickser... concerning Python programming in databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/from-a-noob-databrickser-concerning-python-programming-in/m-p/32076#M23373</link>
      <description>&lt;P&gt;Like @Joseph Kambourakis​&amp;nbsp;said, there are plenty of ways to do this. You can write pure Python or SQL. For me, it's easier to write SQL so I would first load this data into a Delta table and then write pure SQL.&lt;/P&gt;</description>
      <pubDate>Sun, 11 Sep 2022 07:28:09 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/from-a-noob-databrickser-concerning-python-programming-in/m-p/32076#M23373</guid>
      <dc:creator>BilalAslamDbrx</dc:creator>
      <dc:date>2022-09-11T07:28:09Z</dc:date>
    </item>
    <item>
      <title>Re: From a noob Databrickser... concerning Python programming in databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/from-a-noob-databrickser-concerning-python-programming-in/m-p/32077#M23374</link>
      <description>&lt;P&gt;I am assuming the schema of all these files is same&lt;/P&gt;&lt;P&gt; If so how to process it depends what your comfortable with &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The steps that come to mind are &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;In the landing zone have a folder structure per  client &lt;/LI&gt;&lt;LI&gt;read all the parquet contract files  into delta input_file_name() may be useful to know which file your processing&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;(Contracts are per client with a type and start end date)&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Create a column for client name &lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Perform aggregations&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;how many different contract did the client have&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;---group by clientname and a  count&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;of which type&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;--group by clientname and distinct count on type&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;when was the dateof the first contract&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;--group by clientname and min date&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;and the last contract&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;--group by clientname and max date&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;how long have we been working with him.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;--difference between min and max&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 12 Sep 2022 03:39:13 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/from-a-noob-databrickser-concerning-python-programming-in/m-p/32077#M23374</guid>
      <dc:creator>PriyaAnanthram</dc:creator>
      <dc:date>2022-09-12T03:39:13Z</dc:date>
    </item>
    <item>
      <title>Re: From a noob Databrickser... concerning Python programming in databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/from-a-noob-databrickser-concerning-python-programming-in/m-p/32078#M23375</link>
      <description>&lt;P&gt;Hey there @Blake Bleeker​&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We'd love to hear from you.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 24 Sep 2022 08:17:07 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/from-a-noob-databrickser-concerning-python-programming-in/m-p/32078#M23375</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-09-24T08:17:07Z</dc:date>
    </item>
    <item>
      <title>Re: From a noob Databrickser... concerning Python programming in databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/from-a-noob-databrickser-concerning-python-programming-in/m-p/32079#M23376</link>
      <description>&lt;P&gt;This is probably the easiest option. if it's something that's going to be used repeatedly. Alternatively maybe creating a temporary view if it's a one time thing. &lt;/P&gt;</description>
      <pubDate>Sun, 25 Sep 2022 20:08:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/from-a-noob-databrickser-concerning-python-programming-in/m-p/32079#M23376</guid>
      <dc:creator>Chris_Shehu</dc:creator>
      <dc:date>2022-09-25T20:08:42Z</dc:date>
    </item>
    <item>
      <title>Re: From a noob Databrickser... concerning Python programming in databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/from-a-noob-databrickser-concerning-python-programming-in/m-p/32080#M23377</link>
      <description>&lt;P&gt;Yeah, &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks for all the help! &lt;/P&gt;</description>
      <pubDate>Mon, 03 Oct 2022 07:54:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/from-a-noob-databrickser-concerning-python-programming-in/m-p/32080#M23377</guid>
      <dc:creator>MaverickF14</dc:creator>
      <dc:date>2022-10-03T07:54:47Z</dc:date>
    </item>
  </channel>
</rss>

