cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

An unidentified special character is added in outbound file when transformed in databricks. Please help with suggestion?

JK2021
New Contributor III

Data from external source is copied to ADLS, which further gets picked up by databricks, then this massaged data is put in the outbound file . A special character ? (question mark in black diamond) is seen in some fields in outbound file which may break existing code is not identified.

1 ACCEPTED SOLUTION

Accepted Solutions

Prabakar
Esteemed Contributor III
Esteemed Contributor III
10 REPLIES 10

Prabakar
Esteemed Contributor III
Esteemed Contributor III

Hi @Jazmine Kochan​ , what type of data is being copied? Does the data have any Unicode characters or symbols like ç ã,...?

JK2021
New Contributor III

Hi Prabakar,

Thanks for promt response.

It is a text file with customer data.

I have not seen such characters in the data but in text entry fields, this kind of data could be entered by client.

JK2021
New Contributor III

So yes, text could contain such characters.

Prabakar
Esteemed Contributor III
Esteemed Contributor III

So the cause of the issue is those Unicode characters. I believe there should be a fix for this. I shall check and get back here.

JK2021
New Contributor III

Thanks much!

JK2021
New Contributor III

Hi Prabakar

Could it be developer's code - which could be adding this special character?

Prabakar
Esteemed Contributor III
Esteemed Contributor III

This needs encoding. you can try encoding the output while reading the file.

.option("encoding", "UTF-16LE")

Please refer to the below:

https://docs.microsoft.com/en-us/azure/databricks/kb/data-sources/json-unicode

https://community.databricks.com/s/question/0D53f00001HKHnfCAH/issues-with-utf16-files-and-unicode-c...

JK2021
New Contributor III

Do i need to encode and decode too?? Currently incorrect data is displayed @Prabakar Ammeappin​ 

-werners-
Esteemed Contributor III

Are you sure it is Databricks which puts the special character in place?

It could also have happened during the copy of the external system to ADLS.

If you use Azure Data Factory f.e. you have to define the encoding (UTF-8 or UTF-16, ...)

JK2021
New Contributor III

Hi

Yes we checked all the files in the flow. It is output file from Databricks in which question mark character is seen at beginning of some lines in text fields.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.