cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Bug - Databricks requires extra escapes in repl string in regexp_replace (compared to Spark)

Twilight
New Contributor III

In Spark (but not Databricks), these work:

regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '$3$2$1')
regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '${three}${two}${one}')

In Databricks, you have to use these instead:

regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '\$3\$2\$1')
regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '\$\{three}\$\{two}\$\{one}')

All regexes above result in 3214567890abc

1 ACCEPTED SOLUTION

Accepted Solutions

Twilight
New Contributor III

On further querying and reading, I think this IS a bug. Photon specifies:

"Photon is an ANSI-compliant engine designed to be compatible with modern Apache Spark APIs and just works with your existing code — SQL, Python, R, Scala and Java — no rewrite required."

At least for SQL regexp_replace, this is not the case.

@Suteja Kanuri​ Any response or further thoughts?

View solution in original post

5 REPLIES 5

Anonymous
Not applicable

@Stephen Wilcoxon​ : No, it is not a bug. Databricks uses a different flavor of regular expression syntax than Apache Spark. In particular, Databricks uses Java's regular expression syntax, whereas Apache Spark uses Scala's regular expression syntax. Java and Scala have some differences in their regular expression syntax, which may require different escaping of certain characters.

For example, in Java regular expressions, the backslash character itself needs to be escaped with another backslash. So, if you want to replace a string with a backslash followed by the letter 'n', you need to use four backslashes in the replacement string: "\\n". In Scala regular expressions, you only need two backslashes: "\n".

Therefore, if you are porting regular expressions from Spark to Databricks, you may need to adjust the regular expression and/or the replacement string to match Databricks' syntax.

Hope this helps.

Twilight
New Contributor III

That's an extremely nonintuitive change. Given that Databricks at least appears based on Spark and Scala (Spark and Scala versions are specified in cluster config with no mention of Java/JVM versions), why would Databricks choose to change the Spark regex behavior? If changing the regex behavior, why choose Java - why not something superior like Perl (or even PCRE)?

For SQL, it would be 1000x more useful if Databricks matched Spark. Is there any chance that this decision will be revisited?

At an absolute minimum, can this be made extremely prevalent in the docs? Given your answer, I just tried Googling and absolutely nothing comes up about differences in SQL regex (or SQL syntax at all) between Databricks and Spark.

Twilight
New Contributor III

On further querying and reading, I think this IS a bug. Photon specifies:

"Photon is an ANSI-compliant engine designed to be compatible with modern Apache Spark APIs and just works with your existing code — SQL, Python, R, Scala and Java — no rewrite required."

At least for SQL regexp_replace, this is not the case.

@Suteja Kanuri​ Any response or further thoughts?

Anonymous
Not applicable

@Stephen Wilcoxon​ : I shall raise this as a bug. Meanwhile since you have found a way to work with this discrepancy, anything else that we I can help you with?

EWhitley
New Contributor III

For example, in Java regular expressions, the backslash character itself needs to be escaped with another backslash. So, if you want to replace a string with a backslash followed by the letter 'n', you need to use four backslashes in the replacement string: "\\n".

Just wanted to say thank you for posting that. I was struggling with the same thing today and didn't think to escape the backslashes. Fixed things completely.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group