Databricks

sujai_sparks · ‎02-24-2023

Let's say I have a delta table in Azure databricks that stores the staff details (denormalized).

I wanted to export the data in the JSON format and save it as a single file on a storage location. I need help with the databricks sql query to group/construct the data in the JSON format.

Here is the sample code and desired output:

Delta Table schema:

%sql
 
create table if not exists staff_details (
 
department_id int comment 'id of the department',
 
department_name string comment 'name of the department',
 
employee_id int comment 'employee id of the staff',
 
first_name string comment 'first name of the staff',
 
last_name string comment 'last name of the staff'
 
)
 
using delta

Script to populate the delta table:

%sql
 
insert into staff_details(department_id, department_name, employee_id, first_name, last_name)
values(1,'Dept-A',101,'Guru','Datt'), (1,'Dept-A',102,'Tom','Cruise'), (2,'Dept-B',201,'Angelina','Jolie')

Show records:

%sql
select * from staff_details order by department_id, employee_id

Desired output:

{
    "staff_details":[
        {
            "department_id":1,
            "department_name": "Dept-A",
            "staff_members": [
                {
                    "employee_id":101,
                    "first_name":"Guru",
                    "last_name":"Datt"
                },
                {
                    "employee_id":102,
                    "first_name":"Tom",
                    "last_name":"Cruise"
                }                
                
            ]
        },
        {
            "department_id":2,
            "department_name": "Dept-B",
            "staff_members": [
                {
                    "employee_id":201,
                    "first_name":"Angelina",
                    "last_name":"Jolie"
                }             
                
            ]
        }
    ]
}

I tried using the to_json() function, and also using manual string concatenation with group by etc, but none of that is working well.

Please help.

NateAnth · ‎02-24-2023

If you want to do this in SQL, here you go:

%sql
select department_id, department_name, collect_list(struct(employee_id, first_name, last_name)) as staff_members
from staff_details
group by department_id, department_name

Below is how to do the same in Pyspark:

from pyspark.sql.functions import *
 
df = spark.read.table("staff_details")
 
df1 = df.groupby("department_id", "department_name").agg(
    collect_list(struct(col("employee_id"), col("first_name"), col("last_name"))).alias("staff_members")
)

Both Give you identical dataframe structure to write out to JSON however you would like to:

View solution in original post

Ryan_Chynoweth · ‎02-24-2023

Why do you want to convert the records to JSON? Does it have to be SQL?

You could read the table with the DataFrame APIs and write a JSON file out. Please note that "coalesce" is only there so that it produces a single file. It would be something like this:

df = spark.read.table("YOUR TABLE") 
 
df.coalesce(1).write.format('json').save("/path/to/file")

Also, can you show your example with the to_json() function?

Anonymous · ‎02-24-2023

repartition(1) is slightly better than coalesce for getting down to one file.

Ryan_Chynoweth · ‎02-24-2023

Good call. Thanks @Joseph Kambourakis

sujai_sparks · ‎02-24-2023

I had tried this, but it creates in the delta format, viz., the json output file contains one row for each record in the table.

And the same thing happens if I use to_json as shown below. Since the examples in the databricks docs, I'm unable to construct a proper query:

Lastly, the intension of required json output as a file, is for the file based integration with other systems.

Hope that clarifies!

Ryan_Chynoweth · ‎02-24-2023

That is actually not a delta format. Spark writes data this way. The file you have highlighted is a JSON file.

Why do you need to have a json file?

sujai_sparks · ‎02-24-2023

On Microsoft SQL Server, the following TSQL query would produce the desired output. But, I'm unable to replicate the same in Databricks SQL 😞

SELECT DISTINCT department_id, department_name 
	,(
		SELECT employee_id
			,first_name
			,last_name
		FROM staff_details sdi
		WHERE sdi.department_id = sdo.department_id
		FOR JSON PATH
		) AS staff_members
FROM staff_details sdo
ORDER BY sdo.department_id
FOR JSON PATH, ROOT ('staff_details');

Anonymous · ‎02-24-2023

Delta is the default format, but if you put using JSON it will write a json file. The success and committed files are just the way that spark ensures you don't get partial writes.

Ryan_Chynoweth · ‎02-24-2023

I see. Here is a good example from another community post.

sujai_sparks · ‎02-24-2023

@Ryan Chynoweth , regarding your question "Why do you need to have a json file?", the intension of required JSON output as a file, is for the file based integration with other systems downstream.

NateAnth · ‎02-24-2023

If you want to do this in SQL, here you go:

%sql
select department_id, department_name, collect_list(struct(employee_id, first_name, last_name)) as staff_members
from staff_details
group by department_id, department_name

Below is how to do the same in Pyspark:

from pyspark.sql.functions import *
 
df = spark.read.table("staff_details")
 
df1 = df.groupby("department_id", "department_name").agg(
    collect_list(struct(col("employee_id"), col("first_name"), col("last_name"))).alias("staff_members")
)

Both Give you identical dataframe structure to write out to JSON however you would like to:

sujai_sparks · ‎02-24-2023

@Nathan Anthony , Thank you, Thank you so much!

Thank you @Ryan Chynoweth and @Joseph Kambourakis as well! All of you guys are great helping the community. There is a lot to learn from the community!

sujai_sparks · ‎02-24-2023

How do I mark this question as answered?

sujai_sparks · ‎02-24-2023

Final sql:

%sql
select
  collect_list(
    struct(department_id, department_name, staff_members)
  )
from
  (
    select
      department_id,
      department_name,
      collect_list(struct(employee_id, first_name, last_name)) as staff_members
    from
      staff_details
    group by
      department_id,
      department_name
  ) vt

And the output:

Tons of thanks to everyone!