topic Re: From a noob Databrickser... concerning Python programming in databricks in Data Engineering

From a noob Databrickser... concerning Python programming in databricks

MaverickF14 — Fri, 09 Sep 2022 13:54:04 GMT

The following...

We,ve got clients working with us in contracts. Per client several contracts of a certain type with start- and end dates.

If I need aggregated info per client in one record like:

how many different contract did the client have
of which type
when was the dateof the first contract
and the last contract
how long have we been working with him.

The information from the sourcedatabase is uploaded in our DWH in Parquet files.

Should/can I use Python on Parquet to aggregrate this data? Looping over the source tables and create a table with the aggregated data?

Re: From a noob Databrickser... concerning Python programming in databricks

Anonymous — Fri, 09 Sep 2022 16:09:23 GMT

There are many ways to do this and python is one. If you have parquet files, you can also write sql easily against them. Something such as

select count(*) 
from parquet.`path to parquet directory`

You don't need to make tables out of the parquet files, but you can.

You can use regular python on databricks, but it won't be distributed so make sure to just use a single node cluster. You can use pyspark too.

Re: From a noob Databrickser... concerning Python programming in databricks

BilalAslamDbrx — Sun, 11 Sep 2022 07:28:09 GMT

Like @Joseph Kambourakis said, there are plenty of ways to do this. You can write pure Python or SQL. For me, it's easier to write SQL so I would first load this data into a Delta table and then write pure SQL.

Re: From a noob Databrickser... concerning Python programming in databricks

PriyaAnanthram — Mon, 12 Sep 2022 03:39:13 GMT

I am assuming the schema of all these files is same

If so how to process it depends what your comfortable with

The steps that come to mind are

In the landing zone have a folder structure per client
read all the parquet contract files into delta input_file_name() may be useful to know which file your processing

(Contracts are per client with a type and start end date)

Create a column for client name

Perform aggregations

how many different contract did the client have

---group by clientname and a count

of which type

--group by clientname and distinct count on type

when was the dateof the first contract

--group by clientname and min date

and the last contract

--group by clientname and max date

how long have we been working with him.

--difference between min and max

Re: From a noob Databrickser... concerning Python programming in databricks

Anonymous — Sat, 24 Sep 2022 08:17:07 GMT

Hey there @Blake Bleeker

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!

Re: From a noob Databrickser... concerning Python programming in databricks

Chris_Shehu — Sun, 25 Sep 2022 20:08:42 GMT

This is probably the easiest option. if it's something that's going to be used repeatedly. Alternatively maybe creating a temporary view if it's a one time thing.

Re: From a noob Databrickser... concerning Python programming in databricks

MaverickF14 — Mon, 03 Oct 2022 07:54:47 GMT

Yeah,

Thanks for all the help!