cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Best way to parse Google Analytics data in Databricks notebook

AnaMocanu
Contributor

I managed to extract the Google Analytics data via lakehouse federation and the Big Query connection but the events table values are in a weird JSON format

{"v":[{"v":{"f":[{"v":"ga_session_number"},{"v":{"f":[{"v":null},{"v":"2"},{"v":null},{"v":null}]}}]}},{"v":{"f":[{"v":"blabla"},{"v":{"f":[{"v":null},{"v":"1"},{"v":null},{"v":null}]}}]}},{"v":{"f":[{"v":"ga_session_id"},{"v":{"f":[{"v":null},{"v":"XXXX"},{"v":null},{"v":null}]}}]}}]}

Does anyone have a good technique for parsing this data, or do I need to manually parse all these columns manually?

Many thanks!

Ana

2 ACCEPTED SOLUTIONS

Accepted Solutions

daniel_sahal
Esteemed Contributor

@AnaMocanu 
I was using this function, with a little modifications on my end:
https://gist.github.com/shreyasms17/96f74e45d862f8f1dce0532442cc95b2

Maybe this will be helpful for you 🙂

View solution in original post

Thank you @daniel_sahal 

I decided to go with parsing the data from the json format, as I don't need too many columns and the elements in the list that I need will stay the same.

For example, when you're picking the first element in the list

df = df.withColumn('device_category', get_json_object(col("device"), "$.v.f[0].v")).alias("device_category")

 

View solution in original post

2 REPLIES 2

daniel_sahal
Esteemed Contributor

@AnaMocanu 
I was using this function, with a little modifications on my end:
https://gist.github.com/shreyasms17/96f74e45d862f8f1dce0532442cc95b2

Maybe this will be helpful for you 🙂

Thank you @daniel_sahal 

I decided to go with parsing the data from the json format, as I don't need too many columns and the elements in the list that I need will stay the same.

For example, when you're picking the first element in the list

df = df.withColumn('device_category', get_json_object(col("device"), "$.v.f[0].v")).alias("device_category")

 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now