Databricks Community

RoelofvS · ‎10-25-2024

I am working through the current version of the standard AutoLoader demo, i.e.

dbdemos.install('auto-loader')

I.e. data gets read into a dataframe, but never written to a target table.
Notebook is "01-Auto-loader-schema-evolution-Ingestion"

Compute is a "15.4 LTS ML (includes Apache Spark 3.5.0, Scala 2.12)"

The demo works fine until schema evolution occurs the first time, i.e. there is a file

.../dbdemos_autoloader/raw_data/inferred_schema/_schemas/0

containing the evolved schema.

When I add another column that should be evolved, e.g. "new_column2" and rerun "

display(get_stream())", the new column does noit get reflected.

But if I delete .../_schemas/0, file 0 gets recreated and the new column reflects.
Alternatively, if I point

.option("cloudFiles.schemaLocation", f"{volume_folder}/inferred_schema")

to a new destination, e.g.

.option("cloudFiles.schemaLocation", f"{volume_folder}/inferred_schema2")

then a new _schemas/0 gets created with the expected added column present.
As summary, I never see a version "1" file getting created.
On the terminal, I can manually create a "1" file, e.g. "touch 1" - just as a test for Linux permissions.

Any thoughts on why evolution never g

Brahmareddy · ‎10-28-2024

Hi @RoelofvS,

How are you doing today? As per my understanding, You just make sure cloudFiles.schemaEvolutionMode is set to addNewColumns to enable automatic schema updates for new columns. If schema versions aren't updating in the same location, try pointing to a new schema location to reset schema tracking temporarily, though this shouldn’t be needed under normal conditions. Check file access permissions in the inferred_schema directory as permissions issues could prevent schema updates. Running the demo on a fresh cluster or with a different schema location can help identify if the environment is impacting schema evolution. Lastly, consider testing with a different Databricks runtime version to rule out runtime-specific issues.

Give a try and let me know.

Regards,

Brahma

RoelofvS · ‎10-29-2024

Hello Brahma,
Thank you for your response. To answer your suggestions:

1) cloudFiles.schemaEvolutionMode: It is default behaviour, but I have also added it explicitly
.option("cloudFiles.schemaEvolutionMode", "addNewColumns")

2) "try pointing to a new schema location to reset schema" - I have repointed it as a test, and a new file 0 gets created. I have also just renamed 0 to zero in the terminal, and a new file 0 gets created. In both cases schema evolution picked up the new column(s).

3) "file access permissions" they are always rwxrwxrwx for the files, and drwxrwxrwx fir the directories.

4) Other: IO have also tried with fresh cluster, and with different locations. I have tested with the latest runtime version via "use_current_cluster=True", and also with the cluster version that it creates itself.

Extra info:

It definitely reads the latest version of the evolution file. I have edited 0 (or 1) with vi, and changed the first line "v1" to "v2". An error gets thrown about not being happy with "v2". But also a second error with "UnknownFieldException" that is expected in the demo. This error does not get raised in my normal testing.

I managed to get evolution to work as expected, but once only, This involved renaming 0 to zero, adding new column, copying the new 0 to 1, adding new columns, and after that, just adding new columns with no fiddling inbetween. But I reset the demo and could not get it working again.

I wonder if anyone else has the demo up and running, and could confirm whether they get the same issue or not.
Basically the frames called are:
f2 to reset the demo, with $reset_all_data=true
f11 to do the initial inference
Then playing with
f16 to add a new column name each time
f17 to load and display the dataframe to check whether the new column got picked up after a "UnknownFieldException" message.

Kind regards - Roelof

Brahmareddy · ‎10-29-2024

Hi @RoelofvS,

I have gone through your response and here is my suggestion below.

Make sure to allow for a slight delay or checkpoint refresh after each schema change to ensure Auto Loader registers updates fully. Given that renaming 0 to 1 prompted schema evolution once, try incremental versioning by creating successive files (0, 1, etc.) after each change. Additionally, consider restarting the stream whenever you make schema modifications, as this can help Auto Loader refresh the schema properly. Setting an explicit schema with .schema() for the initial load may also stabilize the evolution process by providing a structured base. For more insights, enable detailed logging to trace schema evolution steps and check for any timing or metadata conflicts.

Hope this helps!

Good day.

Regards,

Brahma