cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

split parse_url output for the information

turagittech
Contributor

Hi All,

I have data in blobs which I am loading from blob store to Databricks delta tables. One of the blob types contains urls. From the Urls I want to extract knowledge from the path and query parts I can get those out easily with parse url. the problem is with extracting the bits from that. As I am writing this I am thinking I'll need a couple of udfs as things like split while useful; are then still hard and that the part I am getting stuck with I can do the split part but the only reliable one is the -1 value, I'd like all of them. Same as the output of the query is as you would expect still not clean and ends up with label=value pairs.

If anyone has any tips, best functions etc that would be awesome.

I assume I'm not the first who has chomped urls for the information. I am using sql to make it manageable by people with sql experience in my absence

SELECT content,  
split_part(parse_url(content:Url, 'PATH'), '/', -1) as feature2,
split_part(parse_url(content:Url, 'PATH'), '/', 2) as feature1,
parse_url(content:Url, 'QUERY') as url,
split(parse_url(content:Url, 'QUERY'), '&') as query_values
FROM <table>
1 REPLY 1

Isi
Honored Contributor III

Hello @turagittech ,

Honestly, it all depends on how complex your URLs can get.

UDFs will always be more flexible but less performant than native SQL functions.

That said, if your team mainly works with SQL, trying to solve it natively in Databricks SQL is definitely a valid option.

For the path part, it really depends on how much detail you actually need to keep — whether you just need the last element or the full structure.

But for the query string part, I’d recommend building a map of key-value pairs, which works really well in SQL:

map_from_arrays(
  transform(split(parse_url(url, 'QUERY'), '&'), x -> split_part(x, '=', 1)),
  transform(split(parse_url(url, 'QUERY'), '&'), x -> split_part(x, '=', 2))
) AS query_map

 

Captura de pantalla 2025-10-08 a las 1.57.34.png

If you could share an example of a more complex URL, we could probably build something more tailored that handles both the path and query parts more dynamically. 

Hope this helps, 🙂

Isi