topic Re: Not able to read the file content completely using head in Data Engineering

Not able to read the file content completely using head

saichandu_25 — Fri, 17 May 2024 07:15:54 GMT

Hi,

We want to read the file content of the file and encode the content into base64. For that we have used below code

file_path = "/path/to/your/file.csv"

file_content = dbutils.fs.head(file_path, 512000000)

encode_content = base64.b64encode(file_content.encode()).decode()

print(encode_content)

File has 1700 records but using head we are getting only 232 records.

But using above code file content is getting skipped for some bytes and we are not able to read the full data and encode it. Could you please provide the solution for this.

Re: Not able to read the file content completely using head

-werners- — Fri, 17 May 2024 07:37:12 GMT

the head function only returns a part of the file, that is what it does. The maxbytes you can pass has an upper limit of 64K (head(file: java.lang.String, maxBytes: int = 65536): java.lang.String).
You can read the file using spark (spark.read.csv) or plain python(using pandas or with open <file>), scala (using scala.io.Source

Re: Not able to read the file content completely using head

saichandu_25 — Fri, 17 May 2024 07:43:20 GMT

Thanks for the update. Actual We want to read multiple file formats and we want to read the file content irrespective of file format so thats why we have head.

With open is not working in notebook. How can we make that work?

Re: Not able to read the file content completely using head

-werners- — Fri, 17 May 2024 07:45:29 GMT

that is a built-in python function so it should work in a python notebook. You can also use pandas btw.
If you use a scala notebook you should use a scala/java library.
For SQL notebooks: use python/scala 🙂

Re: Not able to read the file content completely using head

saichandu_25 — Fri, 17 May 2024 08:00:57 GMT

If we use below code it is throwing error as file_path is not correct

file_path = "/dbfs/path/to/your/file.csv"

filesystem with open(file_path, 'rb') as f: content = f.read()

Re: Not able to read the file content completely using head

-werners- — Fri, 17 May 2024 08:12:15 GMT

you can use Volumes instead of dbfs:
https://docs.databricks.com/en/connect/unity-catalog/volumes.html#what-path-is-used-for-accessing-files-in-a-volume

Re: Not able to read the file content completely using head

saichandu_25 — Tue, 21 May 2024 17:58:39 GMT

Hi, How can we read the 500MB or 1GB files using with open method in Databricks notebook?

Also if we need to read GB files how many worker nodes needed?

Re: Not able to read the file content completely using head

-werners- — Wed, 22 May 2024 10:01:17 GMT

For data that size using spark might be a good idea (although pure python would probably still work if the files are reasonable in size, 500MB might still work).
The amount of workers depends on if you will be using spark, or pure python. Python code will run on the driver so the amount of workers is irrelevant.
Spark however creates a task per file. And a task uses a cpu.
Here is a blog that gives you an idea how it works.

Re: Not able to read the file content completely using head

saichandu_25 — Thu, 23 May 2024 11:25:52 GMT

Actually We want to read the files irrespective of its format.and push the files to Github Thats why we are going with 'with open' method but if we use with open method its not giving proper results after copying to Github.We need one solution to read large files

Re: Not able to read the file content completely using head

-werners- — Thu, 23 May 2024 14:05:15 GMT

I am curious what the use case if for wanting to load large files into github, which is a code repo.
Depending on the file format different parsing is necessary. you could foresee logic for that in your program.