01-15-2023 01:01 PM
Hello all,
I have medical field data file and one of the field is the text field with huge data not the big problem is databrick does not support text data type so how can i bring the data over. i tried conversion, cast in various way but so far not successful.
I tried importing data (about 2.5 million records) but once migration over to delta table it becomes 85 mil records because it breaks the text field in multiple rows...
what is the proper way!! please guide.
01-15-2023 05:59 PM
Hi @Pat Dev
You created the column type as a string in the delta table, and when importing the data to that table, the values of that particular column are getting split into multiple rows, is this happening??
What is the maximum length of the value in that string column?
Happy Learning!!
01-16-2023 05:42 AM
Hello,
thank you for replying,.
i tried importing as file from csv to delta table! and also defining table and then import file to table.
the column named note is text format in the postgres. i have done export from there to csv and them importing to delta.
since data type text in the source, the length is not same for all records. so i tried importing as varchar(8000) or nvarchar(max) for databrick which would be string. but still the data is splitting in to multiple rows making it not usable to query.
what would be proper way or solution.
thanks
01-15-2023 07:56 PM
could you share the sample code snippet?
01-16-2023 11:56 AM
Hello,
this is the process i have followed:
1 export from postgres as csv file
2 migrate the file to secure s3 location
3 use the following code to create table and move data:
code:
%sql
SET spark.databricks.delta.schema.autoMerge.enabled = true;
create table if not exists catlog.schema.tablename;
COPY INTO catlog.schema.tablename
FROM (SELECT * FROM 's3://bucket/test.csv')
FILEFORMAT = CSV
FORMAT_OPTIONS ('mergeSchema' = 'true', 'header' = 'true')
the one of the column named text is text format which contains large data but when it tries to import in to delta table it creates multiple rows which also makes it unusable to query.
thanks
01-16-2023 12:02 PM
Please add the option "multiLine", "true" and that should fix it. You may also need to escape quotes.
01-18-2023 03:40 PM
Hi,
that had helped and was able to bring the data over, however, there might be limit on the databrick about how much it can bring because most of the data in that column not has ... in the end i think it might means reached limit?
also, noticed that in table details: it stats that the schema contains a very long nested field and cannot be stored in the catalog.
what can be done???
is there a limit on the databrick datatype?
01-16-2023 10:33 AM
You're having trouble with the csv file. If there is a return character in the text, it thinks it's a new line and new record. You'll want to use the option multiline = True.
Here are the docs https://spark.apache.org/docs/latest/sql-data-sources-csv.html
01-16-2023 01:33 PM
Hello,
thanks, i did notice the export has text qualifire as " so what i need to set?
01-21-2023 11:39 AM
Setting escapeQuotes to false has helped to bring huge text data in colomn.
thanks
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group