cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Lakeflow Pipelines Trying to Read accented file with spark.readStream but failure

AmarKap
New Contributor

Trying to read a accented file(French characters) but the spark.readStream function is not working and special characters turn into something strange(ex. �)

 
            spark.readStream
            .format("cloudfiles")
            .option("cloudFiles.format", "text")
            .option("encoding", "ISO-8859-1")

Tried both ISO-8859-1 and UTF-8. 
Tried with and without  .option("cloudFiles.format", "text")
Files do not contains .txt extension

 

1 REPLY 1

K_Anudeep
Databricks Employee
Databricks Employee

Hello @AmarKap ,

When Spark decodes CP1252 bytes as UTF-8/ISO-8859-1, you’ll see the replacement char like 

Can you read the file as :

df = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "text")
.option("encoding", "windows-1252") # or "CP1252"
.load("s3://.../path")) 

Anudeep