cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Issues loading files csv files that contain BOM (Byte Order Mark) character

harraz
New Contributor III

I keep getting and error when creating dataframe or steam from certain CSV files where the header contains BOM (Byte Order Mark) character 

This is the error message:

AnalysisException: [RequestId=e09c7c8d-2399-4d6a-84ae-216e6a9f8f6e ErrorClass=INVALID_PARAMETER_VALUE.INVALID_FIELD_LENGTH] CreateTable column_1.name too long. Maximum length is 255 characters.

Below is the full error message

File /databricks/spark/python/pyspark/sql/streaming/readwriter.py:1481, in DataStreamWriter.toTable(self, tableName, format, outputMode, partitionBy, queryName, **options)
   1479 if queryName is not None:
   1480     self.queryName(queryName)
-> 1481 return self._sq(self._jwrite.toTable(tableName))
 
File /databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1322, in JavaMember.__call__(self, *args)
   1316 command = proto.CALL_COMMAND_NAME +\
   1317     self.command_header +\
   1318     args_command +\
   1319     proto.END_COMMAND_PART
   1321 answer = self.gateway_client.send_command(command)
-> 1322 return_value = get_return_value(
   1323     answer, self.gateway_client, self.target_id, self.name)
   1325 for temp_arg in temp_args:
   1326     if hasattr(temp_arg, "_detach"):

And below is the code that normally works if the files header does not have the BOM. Notice I tried adding the encoding option but that doesn't work.

from pyspark.sql.functions import input_file_name, current_timestamp, lit
 
def startRawSteam(schema):
 
    # Configure Auto Loader 
    streaming_query = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "csv")
    .option("cloudFiles.schemaLocation", raw_checkpoint_path)
    .option("sep", ",")
    .option("inferSchema", "true")
    .option("lineSep", "\r\n")  # Specify the Windows-style EOL character (CRLF)
    .option("header", "false")
    .option("encoding", "UTF-8-sig")
    .option("pathGlobfilter", file_pattern)
    .load(f"{external_location}")
    .select("*", lit("unspecified").alias("hospital"), input_file_name().alias("source_file"), current_timestamp().alias("processing_time"))
    .writeStream
    .option("checkpointLocation", raw_checkpoint_path)
    .trigger(availableNow=True)
    .toTable(raw_table_name))
 
    return streaming_query

Let me know if you have a solution for this

I think i need to modify the files but I don't know how to do that in databricks, I don't want download the files to my local machine since they are large and I'm not allowed to. The files are in s3 bucket and I would like to fix them there or in a databricks using dbutils

Any help is appreciated

1 REPLY 1

Anonymous
Not applicable

Hi @mohamed harrazโ€‹ 

Great to meet you, and thanks for your question!

Let's see if your peers in the community have an answer to your question. Thanks.