cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Using FROM_CSV giving unexpected results

seanstachff
Visitor

Hello, I am trying to use from_csv in the sql warehouse, but I am getting unexpected results:

As a small example I am running:

 

WITH your_table AS (
  SELECT 'a,b,c\n1,"hello, world",3.14\n2,"goodbye, world",2.71' AS csv_column
)

SELECT from_csv(csv_column, 'a INT, b STRING, c DOUBLE', map('header', 'true','multiLine','true')) AS parsed_csv
FROM your_table

 

 which is returning one row with the object:

 

object
a: null
b: "b"
c: null

 

It looks like it's still reading the header and not understanding the linebreak correctly. I would expect it to return an array of objects with two rows following the schema. I've tried a bunch of different combinations of mapping with linesep, header, etc. But I just can not get it to work how I would expect.

 

Does anyone have any advice on this? 

3 REPLIES 3

TakuyaOmi
New Contributor III

Hi, @seanstachff 

Is the following result what you are expecting?

parsed_csv
{"a":1,"b":"hello, world","c":3.14}
{"a":2,"b":"goodbye, world","c":2.71}

If so, you can achieve it with the SQL below.

%sql

WITH your_table AS (
  SELECT 'a,b,c\n1,"hello, world",3.14\n2,"goodbye, world",2.71' AS csv_column
)

SELECT 
  from_csv(
    row, 
    'a INT, b STRING, c DOUBLE', 
    map('header', 'false', 'multiLine', 'false')
  ) AS parsed_csv
FROM your_table
LATERAL VIEW explode(slice(split(csv_column, '\n'), 2, size(split(csv_column, '\n')) - 1)) AS row;

If you’re working with table-formatted data where each column exists as a separate field, there are other methods available as well.

seanstachff
Visitor

Thanks, I didn't read the docs clear enough and see that csv_data is for one row. I'm a big confused as to why it has a header then. My only issue with this is how could you handle the possibility of it having a linebreak in a string value on the csv row

TakuyaOmi
New Contributor III

Is this the kind of result you are expecting?

  • The data is interpreted as multiple rows.
  • The CSV contains multiple data rows (records), each split according to the schema.
  • Even when line breaks are part of a column value, they are not mistaken as row delimiters and are handled correctly as part of the dataset.

Is it something like the result shown in the image?

TakuyaOmi_0-1733363129023.png

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group