cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How to remove more than 4 byte characters using pyspark in databricks?

eimis_pacheco
Contributor

Hi community,

We have the need of removing more than 4 byte characters using pyspark in databricks since these are not supported by amazon Redshift. Does someone know how can I accomplish this?

Thank you very much in advance

Regards

1 REPLY 1

Shalabh007
Honored Contributor

assuming you are having a string type column in pyspark dataframe, one possible way could be

  1. identify total number of characters for each value in column (say
  2. identify no of bytes taken by each character (say b)
  3. use substring() function to select first n characters where n = floor(4 / b)
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.