cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Best way to parse Google Analytics data in Databricks notebook

AnaMocanu
New Contributor III

I managed to extract the Google Analytics data via lakehouse federation and the Big Query connection but the events table values are in a weird JSON format

{"v":[{"v":{"f":[{"v":"ga_session_number"},{"v":{"f":[{"v":null},{"v":"2"},{"v":null},{"v":null}]}}]}},{"v":{"f":[{"v":"blabla"},{"v":{"f":[{"v":null},{"v":"1"},{"v":null},{"v":null}]}}]}},{"v":{"f":[{"v":"ga_session_id"},{"v":{"f":[{"v":null},{"v":"XXXX"},{"v":null},{"v":null}]}}]}}]}

Does anyone have a good technique for parsing this data, or do I need to manually parse all these columns manually?

Many thanks!

Ana

2 ACCEPTED SOLUTIONS

Accepted Solutions

daniel_sahal
Esteemed Contributor

@AnaMocanu 
I was using this function, with a little modifications on my end:
https://gist.github.com/shreyasms17/96f74e45d862f8f1dce0532442cc95b2

Maybe this will be helpful for you 🙂

View solution in original post

Thank you @daniel_sahal 

I decided to go with parsing the data from the json format, as I don't need too many columns and the elements in the list that I need will stay the same.

For example, when you're picking the first element in the list

df = df.withColumn('device_category', get_json_object(col("device"), "$.v.f[0].v")).alias("device_category")

 

View solution in original post

2 REPLIES 2

daniel_sahal
Esteemed Contributor

@AnaMocanu 
I was using this function, with a little modifications on my end:
https://gist.github.com/shreyasms17/96f74e45d862f8f1dce0532442cc95b2

Maybe this will be helpful for you 🙂

Thank you @daniel_sahal 

I decided to go with parsing the data from the json format, as I don't need too many columns and the elements in the list that I need will stay the same.

For example, when you're picking the first element in the list

df = df.withColumn('device_category', get_json_object(col("device"), "$.v.f[0].v")).alias("device_category")

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group