Removing non-ascii and special character in pyspar...

RohiniMathur · ‎09-23-2019

i am running spark 2.4.4 with python 2.7 and IDE is pycharm.

The Input file (.csv) contain encoded value in some column like given below.

File data looks

COL1,COL2,COL3,COL4

CM, 503004, (d$όνυ$F|'.h*Λ!ψμ=(.ξ; ,.ʽ|!3-2-704

The output i am trying to get is

CM,503004,,3-2-704 ---- all encoded and ascii value removed.

code i tried :

from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Python Spark").getOrCreate() df = spark.read.csv("filepath\Customers_v01.csv",header=True,sep=","); myres = df.rdd.map(lambda x: x[1].encode().decode('utf-8')) print(myres.collect())

but this is giving only

503004 -- printing only col2 value.

Please share your suggestion , is it possible to fix the issue in pyspark.

Thanks a lot

Removing non-ascii and special character in pyspark