cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How can I efficiently remove backslashes during a COPY INTO load in Databricks?

Yutaro
New Contributor III

I’m using Databricks’ COPY INTO to load data from a CSV file into a Delta table. My input CSV looks like this:

 

 
CSV file
column1(string),column2(string)
"[\,\,111\,222\,]","012\"34"

After running COPY INTO, my Delta table currently contains:

column1(string) column2(string)
[\,\,111\,222\,]012"34

However, I’d like to remove all backslashes so that the table ends up as:

column1(string) column2(string)
[,,111,222,]012"34

What is the most efficient way to strip out backslashes as part of the COPY INTO operation (without a separate UPDATE or extra write)?

Please excuse any grammatical errors, as I’m not very proficient in English.

1 ACCEPTED SOLUTION

Accepted Solutions

Brahmareddy
Honored Contributor II

Hi Yutaro,

You're doing great, and your question is very clear! In your case, the most efficient way to remove backslashes during the COPY INTO operation is to first load the raw CSV data into a temporary or staging Delta table, and then insert the cleaned data into your final table using a SELECT statement with regexp_replace to strip out the backslashes. For example, after loading into the temp table, you can run INSERT INTO final_table SELECT regexp_replace(column1, '\\\\', ''), regexp_replace(column2, '\\\\', '') FROM temp_table;. This approach avoids the need for a separate UPDATE or multiple writes, and it gives you full control over cleaning the data as it’s loaded. Let me know if you want help automating this or doing it with PySpark too! 

Regards,

Brahma

View solution in original post

1 REPLY 1

Brahmareddy
Honored Contributor II

Hi Yutaro,

You're doing great, and your question is very clear! In your case, the most efficient way to remove backslashes during the COPY INTO operation is to first load the raw CSV data into a temporary or staging Delta table, and then insert the cleaned data into your final table using a SELECT statement with regexp_replace to strip out the backslashes. For example, after loading into the temp table, you can run INSERT INTO final_table SELECT regexp_replace(column1, '\\\\', ''), regexp_replace(column2, '\\\\', '') FROM temp_table;. This approach avoids the need for a separate UPDATE or multiple writes, and it gives you full control over cleaning the data as it’s loaded. Let me know if you want help automating this or doing it with PySpark too! 

Regards,

Brahma

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now