09-09-2022 06:54 AM
The following...
We,ve got clients working with us in contracts. Per client several contracts of a certain type with start- and end dates.
If I need aggregated info per client in one record like:
The information from the sourcedatabase is uploaded in our DWH in Parquet files.
Should/can I use Python on Parquet to aggregrate this data? Looping over the source tables and create a table with the aggregated data?
09-09-2022 09:09 AM
There are many ways to do this and python is one. If you have parquet files, you can also write sql easily against them. Something such as
select count(*)
from parquet.`path to parquet directory`
You don't need to make tables out of the parquet files, but you can.
You can use regular python on databricks, but it won't be distributed so make sure to just use a single node cluster. You can use pyspark too.
09-11-2022 12:28 AM
Like @Joseph Kambourakis said, there are plenty of ways to do this. You can write pure Python or SQL. For me, it's easier to write SQL so I would first load this data into a Delta table and then write pure SQL.
09-25-2022 01:08 PM
This is probably the easiest option. if it's something that's going to be used repeatedly. Alternatively maybe creating a temporary view if it's a one time thing.
09-11-2022 08:39 PM
I am assuming the schema of all these files is same
If so how to process it depends what your comfortable with
The steps that come to mind are
(Contracts are per client with a type and start end date)
Perform aggregations
---group by clientname and a count
--group by clientname and distinct count on type
--group by clientname and min date
--group by clientname and max date
--difference between min and max
09-24-2022 01:17 AM
Hey there @Blake Bleeker
Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.
We'd love to hear from you.
Thanks!
10-03-2022 12:54 AM
Yeah,
Thanks for all the help!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group