Data ingestion issue with THAI data

hardeeksharma · ‎01-31-2025

I have a use case where my file has data in Thai characters. The source location is azure blob storage, here files are stored in text format. I am using the following code to read the file, but when I am downloading the data from catalog it encloses data in quotes which I don't want.

input_df = (
            spark.read.format("text")
            .option("ignoreLeadingWhiteSpace", "false")
            .option("ignoreTrailingWhiteSpace", "false")
            .option("encoding", encoding)
            .option("keepUndefinedRows", True)
            .load(file_path)
            .withColumn("decoded_text", expr(f"regexp_replace(decode(value, '{encoding}'), '^\"|\"$', '')"))
            .drop("value")
            .withColumnRenamed("decoded_text", "value")
        )

Lakshay · ‎01-31-2025

Do the quotes exist in original data?