cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

text datatype not supported and data having huge data in text filed how to bring it over

patdev
New Contributor III

Hello all,

I have medical field data file and one of the field is the text field with huge data not the big problem is databrick does not support text data type so how can i bring the data over. i tried conversion, cast in various way but so far not successful.

I tried importing data (about 2.5 million records) but once migration over to delta table it becomes 85 mil records because it breaks the text field in multiple rows...

what is the proper way!! please guide.

9 REPLIES 9

Chaitanya_Raju
Honored Contributor

Hi @Pat Dev​ 

You created the column type as a string in the delta table, and when importing the data to that table, the values of that particular column are getting split into multiple rows, is this happening??

What is the maximum length of the value in that string column?

Happy Learning!!

Hello,

thank you for replying,.

i tried importing as file from csv to delta table! and also defining table and then import file to table.

the column named note is text format in the postgres. i have done export from there to csv and them importing to delta.

since data type text in the source, the length is not same for all records. so i tried importing as varchar(8000) or nvarchar(max) for databrick which would be string. but still the data is splitting in to multiple rows making it not usable to query.

what would be proper way or solution.

thanks

sher
Valued Contributor II

could you share the sample code snippet?

patdev
New Contributor III

Hello,

this is the process i have followed:

1 export from postgres as csv file

2 migrate the file to secure s3 location

3 use the following code to create table and move data:

code:

%sql

SET spark.databricks.delta.schema.autoMerge.enabled = true;

create table if not exists catlog.schema.tablename;

COPY INTO catlog.schema.tablename

FROM (SELECT * FROM 's3://bucket/test.csv')

FILEFORMAT = CSV

FORMAT_OPTIONS ('mergeSchema' = 'true', 'header' = 'true')

the one of the column named text is text format which contains large data but when it tries to import in to delta table it creates multiple rows which also makes it unusable to query.

thanks

Anonymous
Not applicable

Please add the option "multiLine", "true" and that should fix it. You may also need to escape quotes.

patdev
New Contributor III

Hi,

that had helped and was able to bring the data over, however, there might be limit on the databrick about how much it can bring because most of the data in that column not has ... in the end i think it might means reached limit?

also, noticed that in table details: it stats that the schema contains a very long nested field and cannot be stored in the catalog.

what can be done???

is there a limit on the databrick datatype?

Anonymous
Not applicable

You're having trouble with the csv file. If there is a return character in the text, it thinks it's a new line and new record. You'll want to use the option multiline = True.

Here are the docs https://spark.apache.org/docs/latest/sql-data-sources-csv.html

patdev
New Contributor III

Hello,

thanks, i did notice the export has text qualifire as " so what i need to set?

patdev
New Contributor III

Setting escapeQuotes to false has helped to bring huge text data in colomn.

thanks

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.