cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How to make backreferences in regexp_replace repl string work correctly in Databricks SQL?

Twilight
New Contributor III

Both of these work in Spark SQL:

regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '$1')
regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '${one}')

However, neither work in Databricks SQL. I found that this works for the numeric backreference but I can't find any way to make named backreferences work in Databricks. The problem seems to be around the repl string of regexp_replace.

regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '\$1')

However, when used in the regex itself, it only works without the backslash:

select 'aabc' RLIKE '^(\\w)$1' dollar, 'aabc' RLIKE '^(\\w)\$1' escdollar   -- dollar = true and escdollar = false
select 'aabc' RLIKE '^(?<one>\\w)${one}' name   -- also works

Does anyone know of a way to get named backreferences to work in repl string? Or better yet, to get Spark SQL syntax to work properly in Databricks?

2 REPLIES 2

User16764241763
Honored Contributor

Hello @Stephen Wilcoxonโ€‹ 

Could you please share the expected output in Spark SQL?

I apologize - I apparently lost track of this...

In Spark (but not in Databricks), both of these:

regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '$1')
regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '${one}')

result in:

14567890abc

In general, I've found Databricks to work at least as well as Spark but it fails to work correctly in this situation.

In Databricks, you have to use these variations (which are not standard Spark):

regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '\$1')
regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '\$\{one}')

(note that the named capture requires escaping both the $ and the { )

Upon further testing, $# and ${name} do not work in RLIKE after all (in Databricks at least - have not tested Spark). For example:

regexp_extract('aabc', '^(\\w)$1', 0)

returns "a" (not "aa") but '^(\\w)\1' works correctly.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group