Partially upload data of 1.2GB
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-22-2024 10:42 PM
Hello Team,
I have file contain in txt format of 1.2gb file.I am trying to upload the data into ms sql server database table.Only 10% of the data able to upload it.
example :
Total records in a file : 51303483
Number of records inserted :10224430
I am using pyspark in databricks.But in the file there is no junk and blank space are present.But still not able to load 100% data.
Please advise any workaround.
Regards
Rohit
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-23-2024 12:03 AM
@RohitKulkarni wrote:But still not able to load 100% data
What do you mean by that? Do you get any error messages or so? How do you try to read this file in detail?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-23-2024 12:10 AM
I have read the file with below script
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-23-2024 12:18 AM
I'm a bit confused to be honest. This is neither PySpark nor really Databricks specific code, this is simply Python.
Do you want to migrate it PySpark and make real use of your Spark infrastructure?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-23-2024 12:26 AM
Honestly if you have read the description correctly. Will be able to come to know what i am trying to achieve .
I am having text file of 1.2GB of data.I am trying to load the data into Azure sql database.
There are total rows :51303483
I am able to read and load the rows :10224430
I am trying to use the Python script and got failed .
After that i am trying to load via pyspark still getting failed
Thanks in adavce
Rohit
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-23-2024 12:23 AM
@Witold @RohitKulkarni I am also confused now , according to the use case , what this python code is doing here , your requirement is simple anyways but could you please elaborate a little bit more .
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-23-2024 12:35 AM - edited 07-23-2024 12:36 AM
To be able to help you, we would need to understand at least the following:
- How does your data look like, i.e what's inside this txt file? Is this structure or semi-structured data?
- You only mention that it fails, without giving any specific details. If you say that only 10% of the initial data is visible in your SQL database, you should be able to easily tell, which data worked and which not. Can you share both examples with us?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-23-2024 01:09 AM
It has structured data :
PAT_KEY|ICD_VERSION|ICD_CODE|ICD_PRI_SEC|ICD_POA|
|516351692|10|M12.123|A|Y
Regards
Rohit
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-23-2024 03:13 AM - edited 07-23-2024 03:14 AM
Why don't you simply use spark to process it? Like:
df = spark.read.option('sep','|').option('header', True).format('csv').load(file_path)
Since it appears that you also have a schema you can avoid inferring it and pass it explicitly:
df = spark.read.option('sep','|').option('header', True).schema('PAT_KEY INT, CD_VERSION INT, CD_CODE STRING, CD_PRI_SEC STRING, CD_POA STRING').format('csv').load(file_path)
Then you process it and write it wherever and in which format you prefer.
If some of your data is corrupted, i.e. not according to the schema, you might want to look into Auto Loader and its rescue data feature.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-23-2024 04:42 AM
I already tried with same syntax read.option. But still i am facing the same issue
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-23-2024 10:55 PM
@RohitKulkarni please explain your case propery then , we hve given you a sample answer not acc to your scenario.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-28-2024 01:53 AM
There was a access lines in the document. Because of this there was partially loading.
Thanks for the support