โ02-27-2023 09:01 AM
In Spark (but not Databricks), these work:
regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '$3$2$1')
regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '${three}${two}${one}')
In Databricks, you have to use these instead:
regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '\$3\$2\$1')
regexp_replace('1234567890abc', '^(?<one>\\w)(?<two>\\w)(?<three>\\w)', '\$\{three}\$\{two}\$\{one}')
All regexes above result in 3214567890abc
โ03-13-2023 07:54 AM
On further querying and reading, I think this IS a bug. Photon specifies:
"Photon is an ANSI-compliant engine designed to be compatible with modern Apache Spark APIs and just works with your existing code โ SQL, Python, R, Scala and Java โ no rewrite required."
At least for SQL regexp_replace, this is not the case.
@Suteja Kanuriโ Any response or further thoughts?
โ03-13-2023 04:54 AM
@Stephen Wilcoxonโ : No, it is not a bug. Databricks uses a different flavor of regular expression syntax than Apache Spark. In particular, Databricks uses Java's regular expression syntax, whereas Apache Spark uses Scala's regular expression syntax. Java and Scala have some differences in their regular expression syntax, which may require different escaping of certain characters.
For example, in Java regular expressions, the backslash character itself needs to be escaped with another backslash. So, if you want to replace a string with a backslash followed by the letter 'n', you need to use four backslashes in the replacement string: "\\n". In Scala regular expressions, you only need two backslashes: "\n".
Therefore, if you are porting regular expressions from Spark to Databricks, you may need to adjust the regular expression and/or the replacement string to match Databricks' syntax.
Hope this helps.
โ03-13-2023 07:45 AM
That's an extremely nonintuitive change. Given that Databricks at least appears based on Spark and Scala (Spark and Scala versions are specified in cluster config with no mention of Java/JVM versions), why would Databricks choose to change the Spark regex behavior? If changing the regex behavior, why choose Java - why not something superior like Perl (or even PCRE)?
For SQL, it would be 1000x more useful if Databricks matched Spark. Is there any chance that this decision will be revisited?
At an absolute minimum, can this be made extremely prevalent in the docs? Given your answer, I just tried Googling and absolutely nothing comes up about differences in SQL regex (or SQL syntax at all) between Databricks and Spark.
โ03-13-2023 07:54 AM
On further querying and reading, I think this IS a bug. Photon specifies:
"Photon is an ANSI-compliant engine designed to be compatible with modern Apache Spark APIs and just works with your existing code โ SQL, Python, R, Scala and Java โ no rewrite required."
At least for SQL regexp_replace, this is not the case.
@Suteja Kanuriโ Any response or further thoughts?
โ03-14-2023 01:06 AM
@Stephen Wilcoxonโ : I shall raise this as a bug. Meanwhile since you have found a way to work with this discrepancy, anything else that we I can help you with?
โ01-18-2024 11:20 AM
For example, in Java regular expressions, the backslash character itself needs to be escaped with another backslash. So, if you want to replace a string with a backslash followed by the letter 'n', you need to use four backslashes in the replacement string: "\\n".
Just wanted to say thank you for posting that. I was struggling with the same thing today and didn't think to escape the backslashes. Fixed things completely.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group