How to make backreferences in regexp_replace repl string work correctly in Databricks SQL?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-26-2022 08:15 AM
Both of these work in Spark SQL:
regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '$1')
regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '${one}')
However, neither work in Databricks SQL. I found that this works for the numeric backreference but I can't find any way to make named backreferences work in Databricks. The problem seems to be around the repl string of regexp_replace.
regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '\$1')
However, when used in the regex itself, it only works without the backslash:
select 'aabc' RLIKE '^(\\w)$1' dollar, 'aabc' RLIKE '^(\\w)\$1' escdollar -- dollar = true and escdollar = false
select 'aabc' RLIKE '^(?<one>\\w)${one}' name -- also works
Does anyone know of a way to get named backreferences to work in repl string? Or better yet, to get Spark SQL syntax to work properly in Databricks?
- Labels:
-
Databricks SQL
-
Regex
-
Spark sql
-
SQL
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-11-2022 07:04 AM
Hello @Stephen Wilcoxon
Could you please share the expected output in Spark SQL?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-22-2023 03:09 PM
I apologize - I apparently lost track of this...
In Spark (but not in Databricks), both of these:
regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '$1')
regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '${one}')
result in:
14567890abc
In general, I've found Databricks to work at least as well as Spark but it fails to work correctly in this situation.
In Databricks, you have to use these variations (which are not standard Spark):
regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '\$1')
regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '\$\{one}')
(note that the named capture requires escaping both the $ and the { )
Upon further testing, $# and ${name} do not work in RLIKE after all (in Databricks at least - have not tested Spark). For example:
regexp_extract('aabc', '^(\\w)$1', 0)
returns "a" (not "aa") but '^(\\w)\1' works correctly.
![](/skins/images/582998B45490C7019731A5B3A872C751/responsive_peak/images/icon_anonymous_message.png)
![](/skins/images/582998B45490C7019731A5B3A872C751/responsive_peak/images/icon_anonymous_message.png)