Data ingestion issue with THAI data

hardeeksharma — Fri, 31 Jan 2025 11:15:57 GMT

I have a use case where my file has data in Thai characters. The source location is azure blob storage, here files are stored in text format. I am using the following code to read the file, but when I am downloading the data from catalog it encloses data in quotes which I don't want.

input_df = ( spark.read.format("text") .option("ignoreLeadingWhiteSpace", "false") .option("ignoreTrailingWhiteSpace", "false") .option("encoding", encoding) .option("keepUndefinedRows", True) .load(file_path) .withColumn("decoded_text", expr(f"regexp_replace(decode(value, '{encoding}'), '^\"|\"$', '')")) .drop("value") .withColumnRenamed("decoded_text", "value") )

Re: Data ingestion issue with THAI data

Lakshay — Fri, 31 Jan 2025 18:48:16 GMT

Do the quotes exist in original data?

topic Data ingestion issue with THAI data in Data Engineering

Data ingestion issue with THAI data

Re: Data ingestion issue with THAI data