I am attempting to use autoloader to add a number of csv files to a delta table. The underlying csv files have spaces in the attribute names though (i.e. 'Account Number' instead of 'AccountNumber'). When I run my autoload, I get the following error message:
Found invalid character(s) among ' ,;{}()\n\t=' in the column names of your schema.
My autoload methodology is:
(spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("cloudFiles.schemaLocation", checkpoint_directory)
.option("header", "True")
.load(data_source)
.writeStream
.format("delta")
.trigger(once=True)
.option("checkpointLocation", checkpoint_directory)
.option("mergeSchema", "true")
.table(table_name))
The proposed solution via the error message is to change the column mapping mode to 'name' mapping in the table properties to allow for spaces:
ALTER TABLE <table_name> SET TBLPROPERTIES (
'delta.minReaderVersion' = '2',
'delta.minWriterVersion' = '5',
'delta.columnMapping.mode' = 'name'
)
The issue I am having is that the table is not created until autoloader runs, so I cannot alter the table properties. So my question is in two parts:
A. Can I create a blank delta table so that I can alter the table properties prior to autoloading running? For example, I could run:
CREATE OR REPLACE TABLE IF NOT EXISTS table_name
which would then allow me to alter the table, then run autoloader so that I can ingest csvs with spaces in the attribute names.
B. Is there an alternate method (potentially aliasing during the spark.readStream) that would allow me to ingest these csvs with spaces in the attribute names without needing to have an existing table with adjusted table properties?