cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Hi All, I am trying to read a csv file from datalake and loading data into sql table using Copyinto. am facing an issue   Here i created one table wit...

dataEngineer3
New Contributor II

Hi All,

I am trying to read a csv file from datalake and loading data into sql table using Copyinto.

am facing an issue

image 

Here i created one table with 6 columns same as data in csv file.

but unable to load the data.

can anyone helpme on this

9 REPLIES 9

Kaniz
Community Manager
Community Manager

Hi @ dataEngineer! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will get back to you soon. Thanks.

Sebastian
Contributor

one option is you can delete the underlying delta file or add mergeschema true while you are writing the delta table.

Thanks for quick response i added option mergeschema true still unable to load the data

-werners-
Esteemed Contributor III

It looks like mergeschema is not active.

You can try to set the parameter active for the session with spark.databricks.delta.schema.automerge.enabled. That is for the whole sparksession.

A better solution is to pass a schema to your csv file (column names and types).

mergeSchema is pretty interesting , but it does not always work or give the results you want.

Next thing you can do is if its a full load goto the underlying storage and delete the physical delta table and do a full reload. if you dont have too much time to research on options

dataEngineer3
New Contributor II

Thanks Werners for your Reply,

How to pass schema(ColumnName && Types) to CSV file ??

-werners-
Esteemed Contributor III

That can be done directly in the SQL (with the COPY INTO function) or by using dataframes (classic way).

As you started out with SQL:

-- The example below loads CSV files without headers on ADLS Gen2 using COPY INTO.
-- By casting the data and renaming the columns, you can put the data in the schema you want
COPY INTO delta.`abfss://container@storageAccount.dfs.core.windows.net/deltaTables/target`
FROM (
  SELECT _c0::bigint key, _c1::int index, _c2 textData
  FROM 'abfss://container@storageAccount.dfs.core.windows.net/base/path'
)
FILEFORMAT = CSV
PATTERN = 'folder1/file_[a-g].csv'

(Snippet from the Azure Docs)

There is also an 'inferSchema' option which will try to determine the schema itself (by reading the data twice). But the quality of the result varies, it might not give you the result you expected (doubleType instead of decimalType etc) and it is slower (because you read 2 times).

Using dataframes is similar, but you read the csv using pyspark/scala with a manually defined schema or inferSchema

(https://sparkbyexamples.com/spark/spark-read-csv-file-into-dataframe/)

Thanks wernes for your quick response.

  1. Here i am having table with 6 columns but how to pass these 6 column names in select command SELECT _c0::bigint key, _c1::int index, _c2 textData
  2. you mean _c0,_C1 are the columns names ??

-werners-
Esteemed Contributor III

_c0 etc are indeed column names. If you read a csv and do not define column names (or you read the file without header) those are the names.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.