05-17-2024 12:15 AM
Hi,
We want to read the file content of the file and encode the content into base64. For that we have used below code
file_path = "/path/to/your/file.csv"
file_content = dbutils.fs.head(file_path, 512000000)
encode_content = base64.b64encode(file_content.encode()).decode()
print(encode_content)
File has 1700 records but using head we are getting only 232 records.
But using above code file content is getting skipped for some bytes and we are not able to read the full data and encode it. Could you please provide the solution for this.
05-17-2024 12:37 AM
the head function only returns a part of the file, that is what it does. The maxbytes you can pass has an upper limit of 64K (head(file: java.lang.String, maxBytes: int = 65536): java.lang.String).
You can read the file using spark (spark.read.csv) or plain python(using pandas or with open <file>), scala (using scala.io.Source
05-17-2024 12:43 AM
Thanks for the update. Actual We want to read multiple file formats and we want to read the file content irrespective of file format so thats why we have head.
With open is not working in notebook. How can we make that work?
05-17-2024 12:45 AM
that is a built-in python function so it should work in a python notebook. You can also use pandas btw.
If you use a scala notebook you should use a scala/java library.
For SQL notebooks: use python/scala 🙂
05-17-2024 01:00 AM
If we use below code it is throwing error as file_path is not correct
file_path = "/dbfs/path/to/your/file.csv"
filesystem with open(file_path, 'rb') as f: content = f.read()
05-17-2024 01:12 AM
you can use Volumes instead of dbfs:
https://docs.databricks.com/en/connect/unity-catalog/volumes.html#what-path-is-used-for-accessing-fi...
05-21-2024 10:54 AM - edited 05-21-2024 10:58 AM
Hi, How can we read the 500MB or 1GB files using with open method in Databricks notebook?
Also if we need to read GB files how many worker nodes needed?
05-22-2024 03:01 AM
For data that size using spark might be a good idea (although pure python would probably still work if the files are reasonable in size, 500MB might still work).
The amount of workers depends on if you will be using spark, or pure python. Python code will run on the driver so the amount of workers is irrelevant.
Spark however creates a task per file. And a task uses a cpu.
Here is a blog that gives you an idea how it works.
05-23-2024 04:25 AM
Actually We want to read the files irrespective of its format.and push the files to Github Thats why we are going with 'with open' method but if we use with open method its not giving proper results after copying to Github.We need one solution to read large files
05-23-2024 07:05 AM
I am curious what the use case if for wanting to load large files into github, which is a code repo.
Depending on the file format different parsing is necessary. you could foresee logic for that in your program.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group