07-22-2024 10:42 PM
Hello Team,
I have file contain in txt format of 1.2gb file.I am trying to upload the data into ms sql server database table.Only 10% of the data able to upload it.
example :
Total records in a file : 51303483
Number of records inserted :10224430
I am using pyspark in databricks.But in the file there is no junk and blank space are present.But still not able to load 100% data.
Please advise any workaround.
Regards
Rohit
07-23-2024 12:03 AM
@RohitKulkarni wrote:But still not able to load 100% data
What do you mean by that? Do you get any error messages or so? How do you try to read this file in detail?
07-23-2024 12:10 AM
I have read the file with below script
07-23-2024 12:18 AM
I'm a bit confused to be honest. This is neither PySpark nor really Databricks specific code, this is simply Python.
Do you want to migrate it PySpark and make real use of your Spark infrastructure?
07-23-2024 12:26 AM
Honestly if you have read the description correctly. Will be able to come to know what i am trying to achieve .
I am having text file of 1.2GB of data.I am trying to load the data into Azure sql database.
There are total rows :51303483
I am able to read and load the rows :10224430
I am trying to use the Python script and got failed .
After that i am trying to load via pyspark still getting failed
Thanks in adavce
Rohit
07-23-2024 12:23 AM
@Witold @RohitKulkarni I am also confused now , according to the use case , what this python code is doing here , your requirement is simple anyways but could you please elaborate a little bit more .
07-23-2024 12:35 AM - edited 07-23-2024 12:36 AM
To be able to help you, we would need to understand at least the following:
07-23-2024 01:09 AM
It has structured data :
PAT_KEY|ICD_VERSION|ICD_CODE|ICD_PRI_SEC|ICD_POA|
|516351692|10|M12.123|A|Y
Regards
Rohit
07-23-2024 03:13 AM - edited 07-23-2024 03:14 AM
Why don't you simply use spark to process it? Like:
df = spark.read.option('sep','|').option('header', True).format('csv').load(file_path)
Since it appears that you also have a schema you can avoid inferring it and pass it explicitly:
df = spark.read.option('sep','|').option('header', True).schema('PAT_KEY INT, CD_VERSION INT, CD_CODE STRING, CD_PRI_SEC STRING, CD_POA STRING').format('csv').load(file_path)
Then you process it and write it wherever and in which format you prefer.
If some of your data is corrupted, i.e. not according to the schema, you might want to look into Auto Loader and its rescue data feature.
07-23-2024 04:42 AM
I already tried with same syntax read.option. But still i am facing the same issue
07-23-2024 10:55 PM
@RohitKulkarni please explain your case propery then , we hve given you a sample answer not acc to your scenario.
08-28-2024 01:53 AM
There was a access lines in the document. Because of this there was partially loading.
Thanks for the support
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group