cancel
Showing results for 
Search instead for 
Did you mean: 
Warehousing & Analytics
Engage in discussions on data warehousing, analytics, and BI solutions within the Databricks Community. Share insights, tips, and best practices for leveraging data for informed decision-making.
cancel
Showing results for 
Search instead for 
Did you mean: 

Converting an Entity Attribute Value Data Model to Struct datatype

Spoon_Man
New Contributor II

I'm new to databricks and I have a source data model that stores the data as Name-Value pairs (i.e. normalised) in two columns in the table.

EntityIDNameValue
1Field1SomeValue1
1Field2SomeValue2
1Field3SomeValue3
2Field1SomeValue1
2Field3SomeValue3

The definition of the NAME fields are stored in another table (in this case holedetails) which describes the data type.

Previously, I would use the pivot command to flatten data from this type of data model, but now with the struct datatype in databricks, I would like to take advantage of this as it seems more efficient and a much easier format to query.

I would like transform the two columns into a struct column using a statement like below. I can prepare the data to be passed to the named_struct as a string, but it fails to parse as the named_struct expects 2 or more parameters and I'm building this dynamically.

 

 

 

 

        SELECT a.HOLEID
              ,a.PROJECTCODE
              ,named_struct(concat('(', concat_ws(',', collect_list(concat('"',NAME,'","',VALUE_MOD,'"'))),')'))
            from
      (
        SELECT a.HOLEID
              ,a.PROJECTCODE
              ,a.NAME 
              ,a.VALUE
              ,case b.DATATYPE
                  when 'Text' then VALUE
                  when 'Numeric' then concat("cast('",VALUE,"' as Float)")
              end as VALUE_MOD
        FROM rtx.raw_acq_aar.holedetails a
        join rtx.raw_acq_aar.holecode b
        on a.NAME = b.NAME
        where a.PROJECTCODE = 'SANDSTONE'
      ) a
      group by a.HOLEID
            ,a.PROJECTCODE

 

 

 

 

I get the following error when I run the above statement as the named_struct function is expected 2 or more arguments.

[WRONG_NUM_ARGS.WITHOUT_SUGGESTION] The `named_struct` requires 2n (n > 0) parameters but the actual number is 1. Please, refer to 'https://spark.apache.org/docs/latest/sql-ref-functions.html' for a fix. SQLSTATE: 42605 

When I run something like below it works, suggesting I've prepared the dynamic string correctly, but I want to run this statement across millions of rows.

 

 

 

 

select named_struct("HoleStatus","CompletedReclaimed","FieldDataCheckedBy","Justin Ward","WaterTabIntersected","Wet","D_SurveyedBy","Joe Bloggs","CasingCemented","No","WatertableDepth_m","cast('71' as Float)","ApprovalType","DMIRS PoW","D_SurveyMethod","GPS","D_StateName","WA","Geologist","Justin Ward","D_Commodity","Lithium","TargetDepth_m","cast('150' as Float)","ApprovalID","115501","PlugDepth_m","cast('0.4' as Float)","D_Company","Australia","DrillCompany","Ranger","CasingLeft","Yes","Cemented","No","D_OrigHoleID","SAND_2023_RC_07","CasingDepth_m","cast('6' as Float)","CasingSize","150 mm","PrecollarDepth_m","cast('10' as Float)","HolePurpose","Geological","DepthBedrock_m","cast('0' as Float)","Organisation","RTX","D_Country","AU")

 

 

 

 

Any help would be greatly appreciated.

2 REPLIES 2

Spoon_Man
New Contributor II

Having thought this through some more, I'll probably just create a struct field for each data type. I can use from_json to create the struct field.

  select HOLEID,
        PROJECTCODE,
        from_json(vf_string, 'MAP<STRING, STRING>') as vf_string
  from
    (
        SELECT HOLEID,
              PROJECTCODE,
              concat('{', concat_ws(',', collect_list(concat('"',a.NAME,'":"',a.VALUE,'"'))),'}') as vf_string
        FROM rtx.raw_acq_aar.holedetails a 
        join rtx.raw_acq_aar.holecode b
        on a.NAME = b.NAME
        where b.DATATYPE = 'Text'
        GROUP BY HOLEID, PROJECTCODE
    ) a 

 

szymon_dybczak
Contributor III

Your first approach didn't work, because named_struct needs it's arguments on odd postition to be foldable.
So you can think of it in following way, at compile time compiler needs to "see" this value. 
That's why even if you prepared proper expression it didnt work

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group