How to remove more than 4 byte characters using pyspark in databricks?
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-16-2022 08:49 PM
Hi community,
We have the need of removing more than 4 byte characters using pyspark in databricks since these are not supported by amazon Redshift. Does someone know how can I accomplish this?
Thank you very much in advance
Regards
Labels:
- Labels:
-
Pyspark
1 REPLY 1
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-29-2022 03:00 AM
assuming you are having a string type column in pyspark dataframe, one possible way could be
- identify total number of characters for each value in column (say
- identify no of bytes taken by each character (say b)
- use substring() function to select first n characters where n = floor(4 / b)

