cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Partially upload data of 1.2GB

RohitKulkarni
Contributor II

Hello Team,

I have file contain in txt format of 1.2gb file.I am trying to upload the data into ms sql server database table.Only 10% of the data able to upload it.

example :

Total records in a file : 51303483

Number of records inserted :10224430

I am using pyspark in databricks.But in the file there is no junk and blank space are present.But still not able to load 100% data.

Please advise any workaround.

 

Regards

Rohit

 

 

 

 

11 REPLIES 11

Witold
Contributor III

@RohitKulkarni wrote:

But still not able to load 100% data


 What do you mean by that? Do you get any error messages or so? How do you try to read this file in detail?

I have read the file with below script 

import os

def split_file_by_size(file_path, chunk_size_mb=500😞
    base_name, ext = os.path.splitext(file_path)
    chunk_size = chunk_size_mb * 1024 * 1024  # Convert MB to bytes
    part_number = 1
    buffer = ''
    total_records = 0
    file_records = {}

    with open(file_path, 'r') as infile:
        while True:
            chunk = infile.readlines(chunk_size)
            if not chunk:
                if buffer:
                    part_file_name = f"{base_name}.part{part_number}{ext}"
                    record_count = write_and_count(buffer, part_file_name)
                    if record_count > 0:
                        file_records[part_file_name] = record_count
                break
            buffer += ''.join(chunk)
            lines = buffer.splitlines()
            if len(buffer) >= chunk_size:
                # Find the last complete record
                partial_chunk = ''.join(chunk)
                if '\n' in partial_chunk:
                    last_complete_line = partial_chunk.rfind('\n') + 1
                    part_file_name = f"{base_name}.part{part_number}{ext}"
                    current_chunk = buffer[:last_complete_line]
                    record_count = write_and_count(current_chunk, part_file_name)
                    if record_count > 0:
                        file_records[part_file_name] = record_count
                    buffer = buffer[last_complete_line:]
                    part_number += 1

    if buffer:
        part_file_name = f"{base_name}.part{part_number}{ext}"
        record_count = write_and_count(buffer, part_file_name)
        if record_count > 0:
            file_records[part_file_name] = record_count


    total_records = sum(file_records.values())
    for file_name, count in file_records.items():
        print(f"{file_name}: {count} records")
    print(f"Total records: {total_records}")

def write_and_count(data, file_path😞
    # Only write if data is not empty
    if data.strip():
        with open(file_path, 'w') as outfile:
            outfile.write(data)
        return count_records(file_path)
    return 0

def count_records(file_path😞
    with open(file_path, 'r') as infile:
        return sum(1 for line in infile if line.strip())  # Count non-empty lines


container_name = 'marketaccess'  # replace with your actual container name
input_file_path = f"/dbfs/mnt/{container_name}/pn_2022_paticd_diag.txt"


split_file_by_size(input_file_path)
 
It is not giving any error. The data is getting loaded Partially.
 
Regards
Rohit 

Witold
Contributor III

I'm a bit confused to be honest. This is neither PySpark nor really Databricks specific code, this is simply Python.

Do you want to migrate it PySpark and make real use of your Spark infrastructure?

Honestly if you have read the description correctly. Will be able to come to know what i am trying to achieve .

I am having text file of 1.2GB of data.I am trying to load the data into Azure sql database.

There are total rows :51303483

I am able to read and load the rows :10224430

I am trying to use the Python script and got failed .

After that i am trying to load via pyspark still getting failed

 

Thanks in adavce

Rohit

 

Rishabh-Pandey
Esteemed Contributor

@Witold @RohitKulkarni  I am also confused now , according to the use case , what this python code is doing here  , your requirement is simple anyways but could you please elaborate a little bit more .

Rishabh Pandey

Witold
Contributor III

@RohitKulkarni 

To be able to help you, we would need to understand at least the following:

  • How does your data look like, i.e what's inside this txt file? Is this structure or semi-structured data?
  • You only mention that it fails, without giving any specific details. If you say that only 10% of the initial data is visible in your SQL database, you should be able to easily tell, which data worked and which not. Can you share both examples with us?

It has structured data :

PAT_KEY|ICD_VERSION|ICD_CODE|ICD_PRI_SEC|ICD_POA| 

|516351692|10|M12.123|A|Y

Regards

Rohit

Witold
Contributor III

Why don't you simply use spark to process it? Like:

 

df = spark.read.option('sep','|').option('header', True).format('csv').load(file_path)

 

 Since it appears that you also have a schema you can avoid inferring it and pass it explicitly:

 

df = spark.read.option('sep','|').option('header', True).schema('PAT_KEY INT, CD_VERSION INT, CD_CODE STRING, CD_PRI_SEC STRING, CD_POA STRING').format('csv').load(file_path)

 

Then you process it and write it wherever and in which format you prefer.

If some of your data is corrupted, i.e. not according to the schema, you might want to look into Auto Loader and its rescue data feature.

I already tried with same syntax read.option. But still i am facing the same issue

@RohitKulkarni please  explain your case propery then , we hve given you a sample answer not acc to your scenario.

Rishabh Pandey

RohitKulkarni
Contributor II

There was a access lines in the document. Because of this there was partially loading.

Thanks for the support

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group