I keep getting and error when creating dataframe or steam from certain CSV files where the header contains BOM (Byte Order Mark) character
This is the error message:
AnalysisException: [RequestId=e09c7c8d-2399-4d6a-84ae-216e6a9f8f6e ErrorClass=INVALID_PARAMETER_VALUE.INVALID_FIELD_LENGTH] CreateTable column_1.name too long. Maximum length is 255 characters.
Below is the full error message
File /databricks/spark/python/pyspark/sql/streaming/readwriter.py:1481, in DataStreamWriter.toTable(self, tableName, format, outputMode, partitionBy, queryName, **options)
1479 if queryName is not None:
1480 self.queryName(queryName)
-> 1481 return self._sq(self._jwrite.toTable(tableName))
File /databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1322, in JavaMember.__call__(self, *args)
1316 command = proto.CALL_COMMAND_NAME +\
1317 self.command_header +\
1318 args_command +\
1319 proto.END_COMMAND_PART
1321 answer = self.gateway_client.send_command(command)
-> 1322 return_value = get_return_value(
1323 answer, self.gateway_client, self.target_id, self.name)
1325 for temp_arg in temp_args:
1326 if hasattr(temp_arg, "_detach"):
And below is the code that normally works if the files header does not have the BOM. Notice I tried adding the encoding option but that doesn't work.
from pyspark.sql.functions import input_file_name, current_timestamp, lit
def startRawSteam(schema):
# Configure Auto Loader
streaming_query = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("cloudFiles.schemaLocation", raw_checkpoint_path)
.option("sep", ",")
.option("inferSchema", "true")
.option("lineSep", "\r\n") # Specify the Windows-style EOL character (CRLF)
.option("header", "false")
.option("encoding", "UTF-8-sig")
.option("pathGlobfilter", file_pattern)
.load(f"{external_location}")
.select("*", lit("unspecified").alias("hospital"), input_file_name().alias("source_file"), current_timestamp().alias("processing_time"))
.writeStream
.option("checkpointLocation", raw_checkpoint_path)
.trigger(availableNow=True)
.toTable(raw_table_name))
return streaming_query
Let me know if you have a solution for this
I think i need to modify the files but I don't know how to do that in databricks, I don't want download the files to my local machine since they are large and I'm not allowed to. The files are in s3 bucket and I would like to fix them there or in a databricks using dbutils
Any help is appreciated