cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

From a noob Databrickser... concerning Python programming in databricks

MaverickF14
New Contributor II

The following...

We,ve got clients working with us in contracts. Per client several contracts of a certain type with start- and end dates.

If I need aggregated info per client in one record like:

  • how many different contract did the client have
  • of which type
  • when was the dateof the first contract
  • and the last contract
  • how long have we been working with him.

The information from the sourcedatabase is uploaded in our DWH in Parquet files.

Should/can I use Python on Parquet to aggregrate this data? Looping over the source tables and create a table with the aggregated data?

6 REPLIES 6

Anonymous
Not applicable

There are many ways to do this and python is one. If you have parquet files, you can also write sql easily against them. Something such as

select count(*) 
from parquet.`path to parquet directory`

You don't need to make tables out of the parquet files, but you can.

You can use regular python on databricks, but it won't be distributed so make sure to just use a single node cluster. You can use pyspark too.

BilalAslamDbrx
Databricks Employee
Databricks Employee

Like @Joseph Kambourakis​ said, there are plenty of ways to do this. You can write pure Python or SQL. For me, it's easier to write SQL so I would first load this data into a Delta table and then write pure SQL.

This is probably the easiest option. if it's something that's going to be used repeatedly. Alternatively maybe creating a temporary view if it's a one time thing.

PriyaAnanthram
Contributor III

I am assuming the schema of all these files is same

If so how to process it depends what your comfortable with

The steps that come to mind are

  • In the landing zone have a folder structure per client
  • read all the parquet contract files into delta input_file_name() may be useful to know which file your processing

(Contracts are per client with a type and start end date)

  • Create a column for client name

Perform aggregations

  • how many different contract did the client have

---group by clientname and a count

  • of which type

--group by clientname and distinct count on type

  • when was the dateof the first contract

--group by clientname and min date

  • and the last contract

--group by clientname and max date

  • how long have we been working with him.

--difference between min and max

Anonymous
Not applicable

Hey there @Blake Bleeker​ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

MaverickF14
New Contributor II

Yeah,

Thanks for all the help!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group