cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to make backreferences in regexp_replace repl string work correctly in Databricks SQL?

Twilight
New Contributor III

Both of these work in Spark SQL:

regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '$1')
regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '${one}')

However, neither work in Databricks SQL. I found that this works for the numeric backreference but I can't find any way to make named backreferences work in Databricks. The problem seems to be around the repl string of regexp_replace.

regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '\$1')

However, when used in the regex itself, it only works without the backslash:

select 'aabc' RLIKE '^(\\w)$1' dollar, 'aabc' RLIKE '^(\\w)\$1' escdollar   -- dollar = true and escdollar = false
select 'aabc' RLIKE '^(?<one>\\w)${one}' name   -- also works

Does anyone know of a way to get named backreferences to work in repl string? Or better yet, to get Spark SQL syntax to work properly in Databricks?

2 REPLIES 2

User16764241763
Honored Contributor

Hello @Stephen Wilcoxon​ 

Could you please share the expected output in Spark SQL?

I apologize - I apparently lost track of this...

In Spark (but not in Databricks), both of these:

regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '$1')
regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '${one}')

result in:

14567890abc

In general, I've found Databricks to work at least as well as Spark but it fails to work correctly in this situation.

In Databricks, you have to use these variations (which are not standard Spark):

regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '\$1')
regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '\$\{one}')

(note that the named capture requires escaping both the $ and the { )

Upon further testing, $# and ${name} do not work in RLIKE after all (in Databricks at least - have not tested Spark). For example:

regexp_extract('aabc', '^(\\w)$1', 0)

returns "a" (not "aa") but '^(\\w)\1' works correctly.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!