Databricks Community

harraz · ‎06-22-2023

I keep getting and error when creating dataframe or steam from certain CSV files where the header contains BOM (Byte Order Mark) character

This is the error message:

AnalysisException: [RequestId=e09c7c8d-2399-4d6a-84ae-216e6a9f8f6e ErrorClass=INVALID_PARAMETER_VALUE.INVALID_FIELD_LENGTH] CreateTable column_1.name too long. Maximum length is 255 characters.

Below is the full error message

File /databricks/spark/python/pyspark/sql/streaming/readwriter.py:1481, in DataStreamWriter.toTable(self, tableName, format, outputMode, partitionBy, queryName, **options)
   1479 if queryName is not None:
   1480     self.queryName(queryName)
-> 1481 return self._sq(self._jwrite.toTable(tableName))
 
File /databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1322, in JavaMember.__call__(self, *args)
   1316 command = proto.CALL_COMMAND_NAME +\
   1317     self.command_header +\
   1318     args_command +\
   1319     proto.END_COMMAND_PART
   1321 answer = self.gateway_client.send_command(command)
-> 1322 return_value = get_return_value(
   1323     answer, self.gateway_client, self.target_id, self.name)
   1325 for temp_arg in temp_args:
   1326     if hasattr(temp_arg, "_detach"):

And below is the code that normally works if the files header does not have the BOM. Notice I tried adding the encoding option but that doesn't work.

from pyspark.sql.functions import input_file_name, current_timestamp, lit
 
def startRawSteam(schema):
 
    # Configure Auto Loader 
    streaming_query = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "csv")
    .option("cloudFiles.schemaLocation", raw_checkpoint_path)
    .option("sep", ",")
    .option("inferSchema", "true")
    .option("lineSep", "\r\n")  # Specify the Windows-style EOL character (CRLF)
    .option("header", "false")
    .option("encoding", "UTF-8-sig")
    .option("pathGlobfilter", file_pattern)
    .load(f"{external_location}")
    .select("*", lit("unspecified").alias("hospital"), input_file_name().alias("source_file"), current_timestamp().alias("processing_time"))
    .writeStream
    .option("checkpointLocation", raw_checkpoint_path)
    .trigger(availableNow=True)
    .toTable(raw_table_name))
 
    return streaming_query

Let me know if you have a solution for this

I think i need to modify the files but I don't know how to do that in databricks, I don't want download the files to my local machine since they are large and I'm not allowed to. The files are in s3 bucket and I would like to fix them there or in a databricks using dbutils

Any help is appreciated