Bug Report: SDP (DLT) with autoloader not passing through pipe delimiter/separator

ChrisLawford_n1
Contributor II

I am noticing a difference between using autoloader in an interactive notebook vs using it in a Spark Declarative Pipeline (DLT Pipeline). This issue seems to be very similar to this other unanswered post from a few years ago. Bug report: the delimiter option does not work when run on DLT 
The CSV I am trying to ingest uses the pipe character (|) as its separator.

```

    df = (spark.readStream
        .format("cloudFiles")
        .option("cloudFiles.format", "csv")
        .option("delimiter", "|")
        .option("cloudFiles.schemaLocation", f"/Volumes/{location}/schema")
        .load(f"/Volumes/{path to csv}")
        .writeStream
        .option("checkpointLocation", f"/Volumes/{location}/checkpoint")
        .trigger(availableNow=True)
        .toTable(f"{table_location}")
    )
```
^ running this in a notebook serverless has no issues.

``` python
@Dlt.table(
        name=f"{bronze_raw_schema}.{table_name}",
    )
def load_table():
      return spark.readStream.format("cloudFiles")
        .format("cloudFiles")
        .option("cloudFiles.format", "csv")
        .option("delimiter", "|")
        .option("header", "true")
        .option("cloudFiles.includeExistingFiles", "true")
        .load(f"/Volumes/{path to csv}")
```

When I run the above in a SDP (DLT) pipeline I get the following error. Which appears to be failing to delimit on the pipe (|) character and is thus hitting a length limit on the header column names.

com.databricks.sql.managedcatalog.UnityCatalogServiceException: [ErrorClass=INVALID_PARAMETER_VALUE] Invalid input: RPC UpdateTable Field managedcatalog.ColumnInfo.name: At columns.0: name "HiPortSystemCode|HiportCode|CashAccountCode|ValuationDate|ContractDate|CashTransactionId|DeletedIndicator|CashTransactionTypeCode|AccountingDate|EntryDate|SettlementDate|OriginalAccountingDate|SecurityCode|AnalysisTypeCode|TransferCashAccountCode|Transact..." too long. Maximum length is 255 characters.

nikhilj0421
Databricks Employee
Databricks Employee

Hi @ChrisLawford_n1, can you try doing a full refresh of this table? The old schema may still be in the schema location. 

ChrisLawford_n1
Contributor II

Hey,
Okay thanks @nikhilj0421. I have now solved the issue but not with a full refresh of the table. I had tried this previously and even deleted the DLT pipeline hoping that would provide me the clean slate if this lingering schema was an issue but when re-running the pipeline it still failed with the same issue. The solution was to change the output schema(in my case to {schema_name}_2) of the table/pipeline and run again. I was then able to again alter the schema back to the original schema_name and this time everything worked without a problem.

Overall I have learned that DLT pipelines must store more information than I previously thought when deleting them or full refreshing them. 

View solution in original post