11-09-2021 04:55 AM
11-09-2021 04:58 AM
Hi @ dataEngineer! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will get back to you soon. Thanks.
11-09-2021 05:57 AM
one option is you can delete the underlying delta file or add mergeschema true while you are writing the delta table.
11-09-2021 06:01 AM
Thanks for quick response i added option mergeschema true still unable to load the data
11-10-2021 01:37 AM
It looks like mergeschema is not active.
You can try to set the parameter active for the session with spark.databricks.delta.schema.automerge.enabled. That is for the whole sparksession.
A better solution is to pass a schema to your csv file (column names and types).
mergeSchema is pretty interesting , but it does not always work or give the results you want.
11-11-2021 04:10 AM
Next thing you can do is if its a full load goto the underlying storage and delete the physical delta table and do a full reload. if you dont have too much time to research on options
11-10-2021 09:58 PM
Thanks Werners for your Reply,
How to pass schema(ColumnName && Types) to CSV file ??
11-10-2021 11:47 PM
That can be done directly in the SQL (with the COPY INTO function) or by using dataframes (classic way).
As you started out with SQL:
-- The example below loads CSV files without headers on ADLS Gen2 using COPY INTO.
-- By casting the data and renaming the columns, you can put the data in the schema you want
COPY INTO delta.`abfss://container@storageAccount.dfs.core.windows.net/deltaTables/target`
FROM (
SELECT _c0::bigint key, _c1::int index, _c2 textData
FROM 'abfss://container@storageAccount.dfs.core.windows.net/base/path'
)
FILEFORMAT = CSV
PATTERN = 'folder1/file_[a-g].csv'
(Snippet from the Azure Docs)
There is also an 'inferSchema' option which will try to determine the schema itself (by reading the data twice). But the quality of the result varies, it might not give you the result you expected (doubleType instead of decimalType etc) and it is slower (because you read 2 times).
Using dataframes is similar, but you read the csv using pyspark/scala with a manually defined schema or inferSchema
(https://sparkbyexamples.com/spark/spark-read-csv-file-into-dataframe/)
11-11-2021 04:55 AM
Thanks wernes for your quick response.
11-13-2021 10:54 PM
_c0 etc are indeed column names. If you read a csv and do not define column names (or you read the file without header) those are the names.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group