Databricks Community

saichandu_25 · ‎05-17-2024

Hi,

We want to read the file content of the file and encode the content into base64. For that we have used below code

file_path = "/path/to/your/file.csv"

file_content = dbutils.fs.head(file_path, 512000000)

encode_content = base64.b64encode(file_content.encode()).decode()

print(encode_content)

File has 1700 records but using head we are getting only 232 records.

But using above code file content is getting skipped for some bytes and we are not able to read the full data and encode it. Could you please provide the solution for this.

-werners- · ‎05-17-2024

the head function only returns a part of the file, that is what it does. The maxbytes you can pass has an upper limit of 64K (head(file: java.lang.String, maxBytes: int = 65536): java.lang.String).
You can read the file using spark (spark.read.csv) or plain python(using pandas or with open <file>), scala (using scala.io.Source

saichandu_25 · ‎05-17-2024

Thanks for the update. Actual We want to read multiple file formats and we want to read the file content irrespective of file format so thats why we have head.

With open is not working in notebook. How can we make that work?

-werners- · ‎05-17-2024

that is a built-in python function so it should work in a python notebook. You can also use pandas btw.
If you use a scala notebook you should use a scala/java library.
For SQL notebooks: use python/scala 🙂

saichandu_25 · ‎05-17-2024

If we use below code it is throwing error as file_path is not correct

file_path = "/dbfs/path/to/your/file.csv"

filesystem with open(file_path, 'rb') as f: content = f.read()

-werners- · ‎05-17-2024

you can use Volumes instead of dbfs:
https://docs.databricks.com/en/connect/unity-catalog/volumes.html#what-path-is-used-for-accessing-fi...

saichandu_25 · ‎05-21-2024

Hi, How can we read the 500MB or 1GB files using with open method in Databricks notebook?

Also if we need to read GB files how many worker nodes needed?

-werners- · ‎05-22-2024

For data that size using spark might be a good idea (although pure python would probably still work if the files are reasonable in size, 500MB might still work).
The amount of workers depends on if you will be using spark, or pure python. Python code will run on the driver so the amount of workers is irrelevant.
Spark however creates a task per file. And a task uses a cpu.
Here is a blog that gives you an idea how it works.

saichandu_25 · ‎05-23-2024

Actually We want to read the files irrespective of its format.and push the files to Github Thats why we are going with 'with open' method but if we use with open method its not giving proper results after copying to Github.We need one solution to read large files

-werners- · ‎05-23-2024

I am curious what the use case if for wanting to load large files into github, which is a code repo.
Depending on the file format different parsing is necessary. you could foresee logic for that in your program.

Databricks Community

Not able to read the file content completely using head

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon