Databricks Community

jakubk · ‎08-07-2022

I'm using azure databricks

I have a custom table valued function which takes a URL as a parameter and outputs a single row table with certain elements from the URL extracted/labelled

(i get search activity URLs and when in a specific format I can retrieve values from the url path and ?parameters which i put into separate columns for analysis)

it works but it's extremely slow and I think it's due to the split(parse_url(url,'PATH')) call I do messing up the query optimiser. I can't do any kind of logic ops on the output[] inside the function either

here's the function

%sql
 CREATE
 OR REPLACE FUNCTION urlSplitter(url STRING) RETURNS TABLE(
   Url STRING,
   UrlPath STRING,
   UrlParams STRING,
   SearchState STRING,
   SearchRegion STRING,
   SearchParkName STRING,
   PropertyCode STRING,
   NoOfTokens INT
 ) RETURN with src as (
   SELECT
   -- prepare the url into the separate components for further processing
     url,
     parse_url(url, "PATH") as UrlPath,
     split(parse_url(url, "PATH"), '/') as PathArray,
     split(parse_url(url, "PATH"), '/')[1] as PathArray1,
     split(parse_url(url, "PATH"), '/')[2] as PathArray2,
     split(parse_url(url, "PATH"), '/')[3] as PathArray3,
     split(parse_url(url, "PATH"), '/')[4] as PathArray4,
     parse_url(url, "QUERY") as UrlParams,
     parse_url(url, "QUERY", "propertyCode") as PropertyCode
 )
 SELECT
   s.Url,
   s.UrlPath,
   s.UrlParams,
   PathArray[2],
   PathArray[3],
   PathArray[4],
   s.PropertyCode,
   size(PathArray) -2 AS NoOfTokens
 from
   src s

Now, the weird thing is if i swap the PathArray[x] calls in the final select to the PathArrayX columns I create in the initial CTE it all goes really fast as long as I dont try to use PathArrayX in a predicate or case statement or whatever

I like to use staged CTEs so I can test parts in isolation and not have repeated function calls, plus this is a very simplified version of what I'm doing even though it exhibits the behaviour - I do more complex stuff later and dont want have to paste nested split(parse_url(etc etc)) functions all over the place in case statements etc

The 'slow' function works fine for literals passed to it but falls apart when referencing a normal table even with a predicate which retrieves a single row based on a unique id (table is a few gig, partitioned, optimized, zordered etc)

Pulling the code out of the function and doing it in a 'raw' sql statement is fast

slow plan

 == Physical Plan ==
 AdaptiveSparkPlan isFinalPlan=false
 +- CollectLimit 100
    +- Project [ActivityId#41282, ActivityURL#41289, Url#41327, UrlPath#41328, UrlParams#41334, SearchState#41307, SearchRegion#41308, SearchParkName#41309, PropertyCode#41335, NoOfTokens#41311]
       +- SortMergeJoin [coalesce(ActivityUrl#41289, ), isnull(ActivityUrl#41289)], [coalesce(ActivityUrl#41364, ), isnull(ActivityUrl#41364)], Inner
          :- Sort [coalesce(ActivityUrl#41289, ) ASC NULLS FIRST, isnull(ActivityUrl#41289) ASC NULLS FIRST], false, 0
          :  +- ColumnarToRow
          :     +- PhotonResultStage
          :        +- PhotonShuffleExchangeSource
          :           +- PhotonShuffleMapStage
          :              +- PhotonShuffleExchangeSink hashpartitioning(coalesce(ActivityUrl#41289, ), isnull(ActivityUrl#41289), 200)
          :                 +- PhotonProject [ActivityID#41282, ActivityURL#41289]
          :                    +- PhotonAdapter
          :                       +- FileScan parquet *[ActivityID#41282,ActivityURL#41289,PartitionKey#41303] Batched: true, DataFilters: [], Format: Parquet, Location: PreparedDeltaFileIndex(1 paths)[abfss://*.dfs.core.windows.net/*..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<ActivityID:int,ActivityURL:string>
          +- Sort [coalesce(ActivityUrl#41364, ) ASC NULLS FIRST, isnull(ActivityUrl#41364) ASC NULLS FIRST], false, 0
             +- Exchange hashpartitioning(coalesce(ActivityUrl#41364, ), isnull(ActivityUrl#41364), 200), ENSURE_REQUIREMENTS, [id=#19269]
                +- Project [Url#41327, UrlPath#41328, UrlParams#41334, PathArray#41329[2] AS SearchState#41307, PathArray#41329[3] AS SearchRegion#41308, PathArray#41329[4] AS SearchParkName#41309, PropertyCode#41335, (size(PathArray#41329, true) - 2) AS NoOfTokens#41311, ActivityUrl#41364]
                   +- Project [ActivityUrl#41289 AS url#41327, parse_url(ActivityUrl#41289, PATH, false) AS UrlPath#41328, split(parse_url(ActivityUrl#41289, PATH, false), /, -1) AS PathArray#41329, parse_url(ActivityUrl#41289, QUERY, false) AS UrlParams#41334, parse_url(ActivityUrl#41289, QUERY, propertyCode, false) AS PropertyCode#41335, ActivityUrl#41289 AS ActivityUrl#41364]
                      +- ColumnarToRow
                         +- PhotonResultStage
                            +- PhotonGroupingAgg(keys=[ActivityUrl#41289], functions=[])
                               +- PhotonShuffleExchangeSource
                                  +- PhotonShuffleMapStage
                                     +- PhotonShuffleExchangeSink hashpartitioning(ActivityUrl#41289, 200)
                                        +- PhotonGroupingAgg(keys=[ActivityUrl#41289], functions=[])
                                           +- PhotonProject [ActivityURL#41289]
                                              +- PhotonAdapter
                                                 +- FileScan parquet *[ActivityURL#41289,PartitionKey#41303] Batched: true, DataFilters: [], Format: Parquet, Location: PreparedDeltaFileIndex(1 paths)[abfss://*.dfs.core.windows.net/*..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<ActivityURL:string>

fast plan

 == Physical Plan ==
 CollectLimit 100
 +- Project [ActivityId#41482, ActivityURL#41489, ActivityUrl#41489 AS Url#41504, parse_url(ActivityUrl#41489, PATH, false) AS UrlPath#41505, parse_url(ActivityUrl#41489, QUERY, false) AS UrlParams#41506, split(parse_url(ActivityUrl#41489, PATH, false), /, 4)[2] AS SearchState#41507, split(parse_url(ActivityUrl#41489, PATH, false), /, 5)[3] AS SearchRegion#41508, split(parse_url(ActivityUrl#41489, PATH, false), /, 6)[4] AS SearchParkName#41509, parse_url(ActivityUrl#41489, QUERY, propertyCode, false) AS PropertyCode#41510, (size(split(parse_url(ActivityUrl#41489, PATH, false), /, -1), true) - 2) AS NoOfTokens#41511]
    +- *(1) ColumnarToRow
       +- PhotonResultStage
          +- PhotonAdapter
             +- FileScan parquet *[ActivityID#41482,ActivityURL#41489,PartitionKey#41503] Batched: true, DataFilters: [], Format: Parquet, Location: PreparedDeltaFileIndex(1 paths)[abfss://*.dfs.core.windows.net/*, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<ActivityID:int,ActivityURL:string>
    
    
 == Photon Explanation ==
 Photon does not fully support the query because:
  Unsupported expression(s): parse_url(ActivityUrl#41489, PATH, false), parse_url(ActivityUrl#41489, QUERY, false), parse_url(ActivityUrl#41489, PATH, false), parse_url(ActivityUrl#41489, PATH, false), parse_url(ActivityUrl#41489, PATH, false), parse_url(ActivityUrl#41489, QUERY, propertyCode, false), parse_url(ActivityUrl#41489, PATH, false)
 reference node:
  Project [ActivityId#41482, ActivityURL#41489, ActivityUrl#41489 AS Url#41504, parse_url(ActivityUrl#41489, PATH, false) AS UrlPath#41505, parse_url(ActivityUrl#41489, QUERY, false) AS UrlParams#41506, split(parse_url(ActivityUrl#41489, PATH, false), /, 4)[2] AS SearchState#41507, split(parse_url(ActivityUrl#41489, PATH, false), /, 5)[3] AS SearchRegion#41508, split(parse_url(ActivityUrl#41489, PATH, false), /, 6)[4] AS SearchParkName#41509, parse_url(ActivityUrl#41489, QUERY, propertyCode, false) AS PropertyCode#41510, (size(split(parse_url(ActivityUrl#41489, PATH, false), /, -1), true) - 2) AS NoOfTokens#41511]

is this a bug, alternative approaches?

Would writing this in python and registering it make a difference? (i'd rather not as the %sql fn is under source control managed along with the other sql objects)

the usage is

select a.url, u.*

from activities a, lateral urlsplitter(a.url) u

where id = 1