Converting an Entity Attribute Value Data Model to Struct datatype
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-22-2024 12:17 AM - edited 07-22-2024 01:28 AM
I'm new to databricks and I have a source data model that stores the data as Name-Value pairs (i.e. normalised) in two columns in the table.
EntityID | Name | Value |
1 | Field1 | SomeValue1 |
1 | Field2 | SomeValue2 |
1 | Field3 | SomeValue3 |
2 | Field1 | SomeValue1 |
2 | Field3 | SomeValue3 |
The definition of the NAME fields are stored in another table (in this case holedetails) which describes the data type.
Previously, I would use the pivot command to flatten data from this type of data model, but now with the struct datatype in databricks, I would like to take advantage of this as it seems more efficient and a much easier format to query.
I would like transform the two columns into a struct column using a statement like below. I can prepare the data to be passed to the named_struct as a string, but it fails to parse as the named_struct expects 2 or more parameters and I'm building this dynamically.
SELECT a.HOLEID
,a.PROJECTCODE
,named_struct(concat('(', concat_ws(',', collect_list(concat('"',NAME,'","',VALUE_MOD,'"'))),')'))
from
(
SELECT a.HOLEID
,a.PROJECTCODE
,a.NAME
,a.VALUE
,case b.DATATYPE
when 'Text' then VALUE
when 'Numeric' then concat("cast('",VALUE,"' as Float)")
end as VALUE_MOD
FROM rtx.raw_acq_aar.holedetails a
join rtx.raw_acq_aar.holecode b
on a.NAME = b.NAME
where a.PROJECTCODE = 'SANDSTONE'
) a
group by a.HOLEID
,a.PROJECTCODE
I get the following error when I run the above statement as the named_struct function is expected 2 or more arguments.
[WRONG_NUM_ARGS.WITHOUT_SUGGESTION] The `named_struct` requires 2n (n > 0) parameters but the actual number is 1. Please, refer to 'https://spark.apache.org/docs/latest/sql-ref-functions.html' for a fix. SQLSTATE: 42605
When I run something like below it works, suggesting I've prepared the dynamic string correctly, but I want to run this statement across millions of rows.
select named_struct("HoleStatus","CompletedReclaimed","FieldDataCheckedBy","Justin Ward","WaterTabIntersected","Wet","D_SurveyedBy","Joe Bloggs","CasingCemented","No","WatertableDepth_m","cast('71' as Float)","ApprovalType","DMIRS PoW","D_SurveyMethod","GPS","D_StateName","WA","Geologist","Justin Ward","D_Commodity","Lithium","TargetDepth_m","cast('150' as Float)","ApprovalID","115501","PlugDepth_m","cast('0.4' as Float)","D_Company","Australia","DrillCompany","Ranger","CasingLeft","Yes","Cemented","No","D_OrigHoleID","SAND_2023_RC_07","CasingDepth_m","cast('6' as Float)","CasingSize","150 mm","PrecollarDepth_m","cast('10' as Float)","HolePurpose","Geological","DepthBedrock_m","cast('0' as Float)","Organisation","RTX","D_Country","AU")
Any help would be greatly appreciated.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-22-2024 05:28 AM
Having thought this through some more, I'll probably just create a struct field for each data type. I can use from_json to create the struct field.
select HOLEID,
PROJECTCODE,
from_json(vf_string, 'MAP<STRING, STRING>') as vf_string
from
(
SELECT HOLEID,
PROJECTCODE,
concat('{', concat_ws(',', collect_list(concat('"',a.NAME,'":"',a.VALUE,'"'))),'}') as vf_string
FROM rtx.raw_acq_aar.holedetails a
join rtx.raw_acq_aar.holecode b
on a.NAME = b.NAME
where b.DATATYPE = 'Text'
GROUP BY HOLEID, PROJECTCODE
) a
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-22-2024 06:03 AM
Your first approach didn't work, because named_struct needs it's arguments on odd postition to be foldable.
So you can think of it in following way, at compile time compiler needs to "see" this value.
That's why even if you prepared proper expression it didnt work