cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

An unidentified special character is added in outbound file when transformed in databricks. Please help with suggestion?

JK2021
New Contributor III

Data from external source is copied to ADLS, which further gets picked up by databricks, then this massaged data is put in the outbound file . A special character ? (question mark in black diamond) is seen in some fields in outbound file which may break existing code is not identified.

1 ACCEPTED SOLUTION

Accepted Solutions

Prabakar
Databricks Employee
Databricks Employee
10 REPLIES 10

Prabakar
Databricks Employee
Databricks Employee

Hi @Jazmine Kochan​ , what type of data is being copied? Does the data have any Unicode characters or symbols like ç ã,...?

JK2021
New Contributor III

Hi Prabakar,

Thanks for promt response.

It is a text file with customer data.

I have not seen such characters in the data but in text entry fields, this kind of data could be entered by client.

JK2021
New Contributor III

So yes, text could contain such characters.

Prabakar
Databricks Employee
Databricks Employee

So the cause of the issue is those Unicode characters. I believe there should be a fix for this. I shall check and get back here.

JK2021
New Contributor III

Thanks much!

JK2021
New Contributor III

Hi Prabakar

Could it be developer's code - which could be adding this special character?

Prabakar
Databricks Employee
Databricks Employee

This needs encoding. you can try encoding the output while reading the file.

.option("encoding", "UTF-16LE")

Please refer to the below:

https://docs.microsoft.com/en-us/azure/databricks/kb/data-sources/json-unicode

https://community.databricks.com/s/question/0D53f00001HKHnfCAH/issues-with-utf16-files-and-unicode-c...

JK2021
New Contributor III

Do i need to encode and decode too?? Currently incorrect data is displayed @Prabakar Ammeappin​ 

-werners-
Esteemed Contributor III

Are you sure it is Databricks which puts the special character in place?

It could also have happened during the copy of the external system to ADLS.

If you use Azure Data Factory f.e. you have to define the encoding (UTF-8 or UTF-16, ...)

JK2021
New Contributor III

Hi

Yes we checked all the files in the flow. It is output file from Databricks in which question mark character is seen at beginning of some lines in text fields.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group