Databricks Community

RohitKulkarni · ‎07-22-2024

Hello Team,

I have file contain in txt format of 1.2gb file.I am trying to upload the data into ms sql server database table.Only 10% of the data able to upload it.

example :

Total records in a file : 51303483

Number of records inserted :10224430

I am using pyspark in databricks.But in the file there is no junk and blank space are present.But still not able to load 100% data.

Please advise any workaround.

Regards

Rohit

Witold · ‎07-23-2024

@RohitKulkarni wrote:
But still not able to load 100% data

What do you mean by that? Do you get any error messages or so? How do you try to read this file in detail?

RohitKulkarni · ‎07-23-2024

I have read the file with below script

import os

def split_file_by_size(file_path, chunk_size_mb=500😞

base_name, ext = os.path.splitext(file_path)

chunk_size = chunk_size_mb * 1024 * 1024 # Convert MB to bytes

part_number = 1

buffer = ''

total_records = 0

file_records = {}

with open(file_path, 'r') as infile:

while True:

chunk = infile.readlines(chunk_size)

if not chunk:

if buffer:

part_file_name = f"{base_name}.part{part_number}{ext}"

record_count = write_and_count(buffer, part_file_name)

if record_count > 0:

file_records[part_file_name] = record_count

break

buffer += ''.join(chunk)

lines = buffer.splitlines()

if len(buffer) >= chunk_size:

# Find the last complete record

partial_chunk = ''.join(chunk)

if '\n' in partial_chunk:

last_complete_line = partial_chunk.rfind('\n') + 1

part_file_name = f"{base_name}.part{part_number}{ext}"

current_chunk = buffer[:last_complete_line]

record_count = write_and_count(current_chunk, part_file_name)

if record_count > 0:

file_records[part_file_name] = record_count

buffer = buffer[last_complete_line:]

part_number += 1

if buffer:

part_file_name = f"{base_name}.part{part_number}{ext}"

record_count = write_and_count(buffer, part_file_name)

if record_count > 0:

file_records[part_file_name] = record_count

total_records = sum(file_records.values())

for file_name, count in file_records.items():

print(f"{file_name}: {count} records")

print(f"Total records: {total_records}")

def write_and_count(data, file_path😞

# Only write if data is not empty

if data.strip():

with open(file_path, 'w') as outfile:

outfile.write(data)

return count_records(file_path)

return 0

def count_records(file_path😞

with open(file_path, 'r') as infile:

return sum(1 for line in infile if line.strip()) # Count non-empty lines

container_name = 'marketaccess' # replace with your actual container name

input_file_path = f"/dbfs/mnt/{container_name}/pn_2022_paticd_diag.txt"

split_file_by_size(input_file_path)

It is not giving any error. The data is getting loaded Partially.

Regards

Rohit

Witold · ‎07-23-2024

I'm a bit confused to be honest. This is neither PySpark nor really Databricks specific code, this is simply Python.

Do you want to migrate it PySpark and make real use of your Spark infrastructure?

RohitKulkarni · ‎07-23-2024

Honestly if you have read the description correctly. Will be able to come to know what i am trying to achieve .

I am having text file of 1.2GB of data.I am trying to load the data into Azure sql database.

There are total rows :51303483

I am able to read and load the rows :10224430

I am trying to use the Python script and got failed .

After that i am trying to load via pyspark still getting failed

Thanks in adavce

Rohit

Rishabh-Pandey · ‎07-23-2024

@Witold @RohitKulkarni I am also confused now , according to the use case , what this python code is doing here , your requirement is simple anyways but could you please elaborate a little bit more .

Rishabh Pandey

Witold · ‎07-23-2024

@RohitKulkarni

To be able to help you, we would need to understand at least the following:

How does your data look like, i.e what's inside this txt file? Is this structure or semi-structured data?
You only mention that it fails, without giving any specific details. If you say that only 10% of the initial data is visible in your SQL database, you should be able to easily tell, which data worked and which not. Can you share both examples with us?

RohitKulkarni · ‎07-23-2024

It has structured data :

|516351692|10|M12.123|A|Y

Regards

Rohit

Witold · ‎07-23-2024

Why don't you simply use spark to process it? Like:

df = spark.read.option('sep','|').option('header', True).format('csv').load(file_path)

Since it appears that you also have a schema you can avoid inferring it and pass it explicitly:

df = spark.read.option('sep','|').option('header', True).schema('PAT_KEY INT, CD_VERSION INT, CD_CODE STRING, CD_PRI_SEC STRING, CD_POA STRING').format('csv').load(file_path)

Then you process it and write it wherever and in which format you prefer.

If some of your data is corrupted, i.e. not according to the schema, you might want to look into Auto Loader and its rescue data feature.