cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Autoloader creates columns not present in the source

ks1248
New Contributor III

I have been exploring Autoloader to ingest gzipped JSON files from an S3 source.

The notebook fails in the first run due to schema mismatch, after re-running the notebook, the schema evolves and the ingestion runs successfully.

On analysing the schema for the delta table created as a result of the ingestion, I found there are two new columns `id` and `optionsDefaults`.

These columns are not there in the original data, nor do they contain any value and are just nulls.

Is there something I might be missing out on...?

1 ACCEPTED SOLUTION

Accepted Solutions

ks1248
New Contributor III

Hi @Debayan Mukherjee​ , @Kaniz Fatma​ 

Thank you for replying to my question.

I was able to figure out the issue. I was creating the schema and checkpoint folders in the same path as the source location for the autoloader. This caused the schema to change every time the autoloader notebook ran as the source data now included schema and checkpoint metadata as well.

I fixed this by providing a location for schema and checkpoint different from the source location.

View solution in original post

4 REPLIES 4

Debayan
Esteemed Contributor III
Esteemed Contributor III

Hi, Could you please provide a screenshot (before and after) and also, if possible, notebook content?

Kaniz
Community Manager
Community Manager

Hi @Keshav Saini​, We haven’t heard from you since the last response from @Debayan Mukherjee​ , and I was checking back to see if his suggestions helped you.

Or else, If you have any solution, please share it with the community, as it can be helpful to others.

Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

ks1248
New Contributor III

Hi @Debayan Mukherjee​ , @Kaniz Fatma​ 

Thank you for replying to my question.

I was able to figure out the issue. I was creating the schema and checkpoint folders in the same path as the source location for the autoloader. This caused the schema to change every time the autoloader notebook ran as the source data now included schema and checkpoint metadata as well.

I fixed this by providing a location for schema and checkpoint different from the source location.

Kaniz
Community Manager
Community Manager

Hi @Keshav Saini​, I sincerely appreciate your help with the question you've posted. Thank you for being a valuable member of our community.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.