Databricks Community

Yutaro · ‎03-20-2025

I’m using Databricks’ COPY INTO to load data from a CSV file into a Delta table. My input CSV looks like this:

CSV file

column1(string),column2(string)

"[\,\,111\,222\,]","012\"34"

After running COPY INTO, my Delta table currently contains:

column1(string) column2(string)

[\,\,111\,222\,]

012"34

However, I’d like to remove all backslashes so that the table ends up as:

column1(string) column2(string)

[,,111,222,]

012"34

What is the most efficient way to strip out backslashes as part of the COPY INTO operation (without a separate UPDATE or extra write)?

Please excuse any grammatical errors, as I’m not very proficient in English.

Brahmareddy · ‎03-21-2025

Hi Yutaro,

You're doing great, and your question is very clear! In your case, the most efficient way to remove backslashes during the COPY INTO operation is to first load the raw CSV data into a temporary or staging Delta table, and then insert the cleaned data into your final table using a SELECT statement with regexp_replace to strip out the backslashes. For example, after loading into the temp table, you can run INSERT INTO final_table SELECT regexp_replace(column1, '\\\\', ''), regexp_replace(column2, '\\\\', '') FROM temp_table;. This approach avoids the need for a separate UPDATE or multiple writes, and it gives you full control over cleaning the data as it’s loaded. Let me know if you want help automating this or doing it with PySpark too!

Regards,

Brahma

View solution in original post

Brahmareddy · ‎03-21-2025

Hi Yutaro,

You're doing great, and your question is very clear! In your case, the most efficient way to remove backslashes during the COPY INTO operation is to first load the raw CSV data into a temporary or staging Delta table, and then insert the cleaned data into your final table using a SELECT statement with regexp_replace to strip out the backslashes. For example, after loading into the temp table, you can run INSERT INTO final_table SELECT regexp_replace(column1, '\\\\', ''), regexp_replace(column2, '\\\\', '') FROM temp_table;. This approach avoids the need for a separate UPDATE or multiple writes, and it gives you full control over cleaning the data as it’s loaded. Let me know if you want help automating this or doing it with PySpark too!

Regards,

Brahma