How can I efficiently remove backslashes during a COPY INTO load in Databricks?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thursday
I’m using Databricks’ COPY INTO to load data from a CSV file into a Delta table. My input CSV looks like this:
After running COPY INTO, my Delta table currently contains:
[\,\,111\,222\,] | 012"34 |
However, I’d like to remove all backslashes so that the table ends up as:
[,,111,222,] | 012"34 |
What is the most efficient way to strip out backslashes as part of the COPY INTO operation (without a separate UPDATE or extra write)?
Please excuse any grammatical errors, as I’m not very proficient in English.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Friday
Hi Yutaro,
You're doing great, and your question is very clear! In your case, the most efficient way to remove backslashes during the COPY INTO operation is to first load the raw CSV data into a temporary or staging Delta table, and then insert the cleaned data into your final table using a SELECT statement with regexp_replace to strip out the backslashes. For example, after loading into the temp table, you can run INSERT INTO final_table SELECT regexp_replace(column1, '\\\\', ''), regexp_replace(column2, '\\\\', '') FROM temp_table;. This approach avoids the need for a separate UPDATE or multiple writes, and it gives you full control over cleaning the data as it’s loaded. Let me know if you want help automating this or doing it with PySpark too!
Regards,
Brahma

