Databricks

patdev · ‎01-15-2023

Hello all,

I have medical field data file and one of the field is the text field with huge data not the big problem is databrick does not support text data type so how can i bring the data over. i tried conversion, cast in various way but so far not successful.

I tried importing data (about 2.5 million records) but once migration over to delta table it becomes 85 mil records because it breaks the text field in multiple rows...

what is the proper way!! please guide.

Chaitanya_Raju · ‎01-15-2023

Hi @Pat Dev

You created the column type as a string in the delta table, and when importing the data to that table, the values of that particular column are getting split into multiple rows, is this happening??

What is the maximum length of the value in that string column?

Happy Learning!!

patdev · ‎01-16-2023

Hello,

thank you for replying,.

i tried importing as file from csv to delta table! and also defining table and then import file to table.

the column named note is text format in the postgres. i have done export from there to csv and them importing to delta.

since data type text in the source, the length is not same for all records. so i tried importing as varchar(8000) or nvarchar(max) for databrick which would be string. but still the data is splitting in to multiple rows making it not usable to query.

what would be proper way or solution.

thanks

sher · ‎01-15-2023

could you share the sample code snippet?

patdev · ‎01-16-2023

Hello,

this is the process i have followed:

1 export from postgres as csv file

2 migrate the file to secure s3 location

3 use the following code to create table and move data:

code:

%sql

SET spark.databricks.delta.schema.autoMerge.enabled = true;

create table if not exists catlog.schema.tablename;

COPY INTO catlog.schema.tablename

FROM (SELECT * FROM 's3://bucket/test.csv')

FILEFORMAT = CSV

FORMAT_OPTIONS ('mergeSchema' = 'true', 'header' = 'true')

the one of the column named text is text format which contains large data but when it tries to import in to delta table it creates multiple rows which also makes it unusable to query.

thanks

Anonymous · ‎01-16-2023

Please add the option "multiLine", "true" and that should fix it. You may also need to escape quotes.

patdev · ‎01-18-2023

Hi,

that had helped and was able to bring the data over, however, there might be limit on the databrick about how much it can bring because most of the data in that column not has ... in the end i think it might means reached limit?

also, noticed that in table details: it stats that the schema contains a very long nested field and cannot be stored in the catalog.

what can be done???

is there a limit on the databrick datatype?

Anonymous · ‎01-16-2023

You're having trouble with the csv file. If there is a return character in the text, it thinks it's a new line and new record. You'll want to use the option multiline = True.

Here are the docs https://spark.apache.org/docs/latest/sql-data-sources-csv.html