cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Bug Report: SDP (DLT) with autoloader not passing through pipe delimiter/separator

ChrisLawford_n1
Contributor

I am noticing a difference between using autoloader in an interactive notebook vs using it in a Spark Declarative Pipeline (DLT Pipeline). This issue seems to be very similar to this other unanswered post from a few years ago. Bug report: the delimiter option does not work when run on DLT 
The CSV I am trying to ingest uses the pipe character (|) as its separator.

```

    df = (spark.readStream
        .format("cloudFiles")
        .option("cloudFiles.format", "csv")
        .option("delimiter", "|")
        .option("cloudFiles.schemaLocation", f"/Volumes/{location}/schema")
        .load(f"/Volumes/{path to csv}")
        .writeStream
        .option("checkpointLocation", f"/Volumes/{location}/checkpoint")
        .trigger(availableNow=True)
        .toTable(f"{table_location}")
    )
```
^ running this in a notebook serverless has no issues.

``` python
@Dlt.table(
        name=f"{bronze_raw_schema}.{table_name}",
    )
def load_table():
      return spark.readStream.format("cloudFiles")
        .format("cloudFiles")
        .option("cloudFiles.format", "csv")
        .option("delimiter", "|")
        .option("header", "true")
        .option("cloudFiles.includeExistingFiles", "true")
        .load(f"/Volumes/{path to csv}")
```

When I run the above in a SDP (DLT) pipeline I get the following error. Which appears to be failing to delimit on the pipe (|) character and is thus hitting a length limit on the header column names.

com.databricks.sql.managedcatalog.UnityCatalogServiceException: [ErrorClass=INVALID_PARAMETER_VALUE] Invalid input: RPC UpdateTable Field managedcatalog.ColumnInfo.name: At columns.0: name "HiPortSystemCode|HiportCode|CashAccountCode|ValuationDate|ContractDate|CashTransactionId|DeletedIndicator|CashTransactionTypeCode|AccountingDate|EntryDate|SettlementDate|OriginalAccountingDate|SecurityCode|AnalysisTypeCode|TransferCashAccountCode|Transact..." too long. Maximum length is 255 characters.

1 ACCEPTED SOLUTION

Accepted Solutions

ChrisLawford_n1
Contributor

Hey,
Okay thanks @nikhilj0421. I have now solved the issue but not with a full refresh of the table. I had tried this previously and even deleted the DLT pipeline hoping that would provide me the clean slate if this lingering schema was an issue but when re-running the pipeline it still failed with the same issue. The solution was to change the output schema(in my case to {schema_name}_2) of the table/pipeline and run again. I was then able to again alter the schema back to the original schema_name and this time everything worked without a problem.

Overall I have learned that DLT pipelines must store more information than I previously thought when deleting them or full refreshing them. 

View solution in original post

2 REPLIES 2

nikhilj0421
Databricks Employee
Databricks Employee

Hi @ChrisLawford_n1, can you try doing a full refresh of this table? The old schema may still be in the schema location. 

ChrisLawford_n1
Contributor

Hey,
Okay thanks @nikhilj0421. I have now solved the issue but not with a full refresh of the table. I had tried this previously and even deleted the DLT pipeline hoping that would provide me the clean slate if this lingering schema was an issue but when re-running the pipeline it still failed with the same issue. The solution was to change the output schema(in my case to {schema_name}_2) of the table/pipeline and run again. I was then able to again alter the schema back to the original schema_name and this time everything worked without a problem.

Overall I have learned that DLT pipelines must store more information than I previously thought when deleting them or full refreshing them.