cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Hi All, I am trying to read a csv file from datalake and loading data into sql table using Copyinto. am facing an issue   Here i created one table wit...

dataEngineer3
New Contributor II

Hi All,

I am trying to read a csv file from datalake and loading data into sql table using Copyinto.

am facing an issue

image 

Here i created one table with 6 columns same as data in csv file.

but unable to load the data.

can anyone helpme on this

9 REPLIES 9

Kaniz_Fatma
Community Manager
Community Manager

Hi @ dataEngineer! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will get back to you soon. Thanks.

Sebastian
Contributor

one option is you can delete the underlying delta file or add mergeschema true while you are writing the delta table.

Thanks for quick response i added option mergeschema true still unable to load the data

-werners-
Esteemed Contributor III

It looks like mergeschema is not active.

You can try to set the parameter active for the session with spark.databricks.delta.schema.automerge.enabled. That is for the whole sparksession.

A better solution is to pass a schema to your csv file (column names and types).

mergeSchema is pretty interesting , but it does not always work or give the results you want.

Next thing you can do is if its a full load goto the underlying storage and delete the physical delta table and do a full reload. if you dont have too much time to research on options

dataEngineer3
New Contributor II

Thanks Werners for your Reply,

How to pass schema(ColumnName && Types) to CSV file ??

-werners-
Esteemed Contributor III

That can be done directly in the SQL (with the COPY INTO function) or by using dataframes (classic way).

As you started out with SQL:

-- The example below loads CSV files without headers on ADLS Gen2 using COPY INTO.
-- By casting the data and renaming the columns, you can put the data in the schema you want
COPY INTO delta.`abfss://container@storageAccount.dfs.core.windows.net/deltaTables/target`
FROM (
  SELECT _c0::bigint key, _c1::int index, _c2 textData
  FROM 'abfss://container@storageAccount.dfs.core.windows.net/base/path'
)
FILEFORMAT = CSV
PATTERN = 'folder1/file_[a-g].csv'

(Snippet from the Azure Docs)

There is also an 'inferSchema' option which will try to determine the schema itself (by reading the data twice). But the quality of the result varies, it might not give you the result you expected (doubleType instead of decimalType etc) and it is slower (because you read 2 times).

Using dataframes is similar, but you read the csv using pyspark/scala with a manually defined schema or inferSchema

(https://sparkbyexamples.com/spark/spark-read-csv-file-into-dataframe/)

Thanks wernes for your quick response.

  1. Here i am having table with 6 columns but how to pass these 6 column names in select command SELECT _c0::bigint key, _c1::int index, _c2 textData
  2. you mean _c0,_C1 are the columns names ??

-werners-
Esteemed Contributor III

_c0 etc are indeed column names. If you read a csv and do not define column names (or you read the file without header) those are the names.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group