Databricks Community

hardeeksharma · ‎01-31-2025

I have a use case where my file has data in Thai characters. The source location is azure blob storage, here files are stored in text format. I am using the following code to read the file, but when I am downloading the data from catalog it encloses data in quotes which I don't want.

input_df = (
            spark.read.format("text")
            .option("ignoreLeadingWhiteSpace", "false")
            .option("ignoreTrailingWhiteSpace", "false")
            .option("encoding", encoding)
            .option("keepUndefinedRows", True)
            .load(file_path)
            .withColumn("decoded_text", expr(f"regexp_replace(decode(value, '{encoding}'), '^\"|\"$', '')"))
            .drop("value")
            .withColumnRenamed("decoded_text", "value")
        )

Lakshay · ‎01-31-2025

Do the quotes exist in original data?

Databricks Community

Data ingestion issue with THAI data

Join Us as a Local Community Builder!

Announcing Backfill Runs in Lakeflow Jobs for Higher Quality Downstream Data

🚀 New: Databricks Interactive Architecture Design Workshops

Introducing Community Pulse — Your Weekly Databricks Roundup!

Solution Accelerator Series | #5 - Automating Product Review Summarization with LLMs

Databricks DevConnect I Washington D.C.