Databricks Community

eimis_pacheco · ‎02-16-2022

Hi community,

We have the need of removing more than 4 byte characters using pyspark in databricks since these are not supported by amazon Redshift. Does someone know how can I accomplish this?

Thank you very much in advance

Regards

Shalabh007 · ‎11-29-2022

assuming you are having a string type column in pyspark dataframe, one possible way could be

identify total number of characters for each value in column (say
identify no of bytes taken by each character (say b)
use substring() function to select first n characters where n = floor(4 / b)

Databricks Community

How to remove more than 4 byte characters using pyspark in databricks?

Congratulations Databricks Partners! You're Now Officially Recognized in the Databricks Community

Solution Accelerator Series | Measure Ad Effectiveness With Multi-Touch Attribution

Govern AI Spend at Scale: A Data-Driven Approach to AI Governance | Webinar

Databricks AMER Learning Festival | Virtual Training

Introducing the Genie Hub: Ask Questions, Share Builds, and Master Conversational Analytics