Databricks Community

Twilight · ‎02-27-2023

In Spark (but not Databricks), these work:

regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '$3$2$1')
regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '${three}${two}${one}')

In Databricks, you have to use these instead:

regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '\$3\$2\$1')
regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '\$\{three}\$\{two}\$\{one}')

All regexes above result in 3214567890abc

Twilight · ‎03-13-2023

On further querying and reading, I think this IS a bug. Photon specifies:

"Photon is an ANSI-compliant engine designed to be compatible with modern Apache Spark APIs and just works with your existing code — SQL, Python, R, Scala and Java — no rewrite required."

At least for SQL regexp_replace, this is not the case.

@Suteja Kanuri Any response or further thoughts?

View solution in original post

Anonymous · ‎03-13-2023

@Stephen Wilcoxon : No, it is not a bug. Databricks uses a different flavor of regular expression syntax than Apache Spark. In particular, Databricks uses Java's regular expression syntax, whereas Apache Spark uses Scala's regular expression syntax. Java and Scala have some differences in their regular expression syntax, which may require different escaping of certain characters.

For example, in Java regular expressions, the backslash character itself needs to be escaped with another backslash. So, if you want to replace a string with a backslash followed by the letter 'n', you need to use four backslashes in the replacement string: "\\n". In Scala regular expressions, you only need two backslashes: "\n".

Therefore, if you are porting regular expressions from Spark to Databricks, you may need to adjust the regular expression and/or the replacement string to match Databricks' syntax.

Hope this helps.

Twilight · ‎03-13-2023

That's an extremely nonintuitive change. Given that Databricks at least appears based on Spark and Scala (Spark and Scala versions are specified in cluster config with no mention of Java/JVM versions), why would Databricks choose to change the Spark regex behavior? If changing the regex behavior, why choose Java - why not something superior like Perl (or even PCRE)?

For SQL, it would be 1000x more useful if Databricks matched Spark. Is there any chance that this decision will be revisited?

At an absolute minimum, can this be made extremely prevalent in the docs? Given your answer, I just tried Googling and absolutely nothing comes up about differences in SQL regex (or SQL syntax at all) between Databricks and Spark.

Twilight · ‎03-13-2023

On further querying and reading, I think this IS a bug. Photon specifies:

"Photon is an ANSI-compliant engine designed to be compatible with modern Apache Spark APIs and just works with your existing code — SQL, Python, R, Scala and Java — no rewrite required."

At least for SQL regexp_replace, this is not the case.

@Suteja Kanuri Any response or further thoughts?

Anonymous · ‎03-14-2023

@Stephen Wilcoxon : I shall raise this as a bug. Meanwhile since you have found a way to work with this discrepancy, anything else that we I can help you with?

EWhitley · ‎01-18-2024

For example, in Java regular expressions, the backslash character itself needs to be escaped with another backslash. So, if you want to replace a string with a backslash followed by the letter 'n', you need to use four backslashes in the replacement string: "\\n".

Just wanted to say thank you for posting that. I was struggling with the same thing today and didn't think to escape the backslashes. Fixed things completely.

Databricks Community

Bug - Databricks requires extra escapes in repl string in regexp_replace (compared to Spark)

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon