cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How to make backreferences in regexp_replace repl string work correctly in Databricks SQL?

Twilight
New Contributor III

Both of these work in Spark SQL:

regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '$1')
regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '${one}')

However, neither work in Databricks SQL. I found that this works for the numeric backreference but I can't find any way to make named backreferences work in Databricks. The problem seems to be around the repl string of regexp_replace.

regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '\$1')

However, when used in the regex itself, it only works without the backslash:

select 'aabc' RLIKE '^(\\w)$1' dollar, 'aabc' RLIKE '^(\\w)\$1' escdollar   -- dollar = true and escdollar = false
select 'aabc' RLIKE '^(?<one>\\w)${one}' name   -- also works

Does anyone know of a way to get named backreferences to work in repl string? Or better yet, to get Spark SQL syntax to work properly in Databricks?

2 REPLIES 2

User16764241763
Honored Contributor

Hello @Stephen Wilcoxon​ 

Could you please share the expected output in Spark SQL?

I apologize - I apparently lost track of this...

In Spark (but not in Databricks), both of these:

regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '$1')
regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '${one}')

result in:

14567890abc

In general, I've found Databricks to work at least as well as Spark but it fails to work correctly in this situation.

In Databricks, you have to use these variations (which are not standard Spark):

regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '\$1')
regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '\$\{one}')

(note that the named capture requires escaping both the $ and the { )

Upon further testing, $# and ${name} do not work in RLIKE after all (in Databricks at least - have not tested Spark). For example:

regexp_extract('aabc', '^(\\w)$1', 0)

returns "a" (not "aa") but '^(\\w)\1' works correctly.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.