Databricks Community

Spoon_Man · ‎07-22-2024

I'm new to databricks and I have a source data model that stores the data as Name-Value pairs (i.e. normalised) in two columns in the table.

EntityID	Name	Value
1	Field1	SomeValue1
1	Field2	SomeValue2
1	Field3	SomeValue3
2	Field1	SomeValue1
2	Field3	SomeValue3

The definition of the NAME fields are stored in another table (in this case holedetails) which describes the data type.

Previously, I would use the pivot command to flatten data from this type of data model, but now with the struct datatype in databricks, I would like to take advantage of this as it seems more efficient and a much easier format to query.

I would like transform the two columns into a struct column using a statement like below. I can prepare the data to be passed to the named_struct as a string, but it fails to parse as the named_struct expects 2 or more parameters and I'm building this dynamically.

        SELECT a.HOLEID
              ,a.PROJECTCODE
              ,named_struct(concat('(', concat_ws(',', collect_list(concat('"',NAME,'","',VALUE_MOD,'"'))),')'))
            from
      (
        SELECT a.HOLEID
              ,a.PROJECTCODE
              ,a.NAME 
              ,a.VALUE
              ,case b.DATATYPE
                  when 'Text' then VALUE
                  when 'Numeric' then concat("cast('",VALUE,"' as Float)")
              end as VALUE_MOD
        FROM rtx.raw_acq_aar.holedetails a
        join rtx.raw_acq_aar.holecode b
        on a.NAME = b.NAME
        where a.PROJECTCODE = 'SANDSTONE'
      ) a
      group by a.HOLEID
            ,a.PROJECTCODE

I get the following error when I run the above statement as the named_struct function is expected 2 or more arguments.

[WRONG_NUM_ARGS.WITHOUT_SUGGESTION] The `named_struct` requires 2n (n > 0) parameters but the actual number is 1. Please, refer to 'https://spark.apache.org/docs/latest/sql-ref-functions.html' for a fix. SQLSTATE: 42605

When I run something like below it works, suggesting I've prepared the dynamic string correctly, but I want to run this statement across millions of rows.

select named_struct("HoleStatus","CompletedReclaimed","FieldDataCheckedBy","Justin Ward","WaterTabIntersected","Wet","D_SurveyedBy","Joe Bloggs","CasingCemented","No","WatertableDepth_m","cast('71' as Float)","ApprovalType","DMIRS PoW","D_SurveyMethod","GPS","D_StateName","WA","Geologist","Justin Ward","D_Commodity","Lithium","TargetDepth_m","cast('150' as Float)","ApprovalID","115501","PlugDepth_m","cast('0.4' as Float)","D_Company","Australia","DrillCompany","Ranger","CasingLeft","Yes","Cemented","No","D_OrigHoleID","SAND_2023_RC_07","CasingDepth_m","cast('6' as Float)","CasingSize","150 mm","PrecollarDepth_m","cast('10' as Float)","HolePurpose","Geological","DepthBedrock_m","cast('0' as Float)","Organisation","RTX","D_Country","AU")

Any help would be greatly appreciated.

Spoon_Man · ‎07-22-2024

Having thought this through some more, I'll probably just create a struct field for each data type. I can use from_json to create the struct field.

  select HOLEID,
        PROJECTCODE,
        from_json(vf_string, 'MAP<STRING, STRING>') as vf_string
  from
    (
        SELECT HOLEID,
              PROJECTCODE,
              concat('{', concat_ws(',', collect_list(concat('"',a.NAME,'":"',a.VALUE,'"'))),'}') as vf_string
        FROM rtx.raw_acq_aar.holedetails a 
        join rtx.raw_acq_aar.holecode b
        on a.NAME = b.NAME
        where b.DATATYPE = 'Text'
        GROUP BY HOLEID, PROJECTCODE
    ) a

szymon_dybczak · ‎07-22-2024

Your first approach didn't work, because named_struct needs it's arguments on odd postition to be foldable.
So you can think of it in following way, at compile time compiler needs to "see" this value.
That's why even if you prepared proper expression it didnt work

Databricks Community

Converting an Entity Attribute Value Data Model to Struct datatype

Photos

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Share Your Feedback in Our Community Survey

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks