cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Not able to read the file content completely using head

saichandu_25
New Contributor III

Hi,

We want to read the file content of the file and encode the content into base64. For that we have used below code

 

file_path = "/path/to/your/file.csv"

file_content = dbutils.fs.head(file_path, 512000000)

encode_content = base64.b64encode(file_content.encode()).decode()

print(encode_content)

File has 1700 records but using head we are getting only 232 records.

But using above code file content is getting skipped for some bytes and we are not able to read the full data and encode it. Could you please provide the solution for this.

9 REPLIES 9

-werners-
Esteemed Contributor III

the head function only returns a part of the file, that is what it does.  The maxbytes you can pass has an upper limit of 64K (head(file: java.lang.String, maxBytes: int = 65536): java.lang.String).
You can read the file using spark (spark.read.csv) or plain python(using pandas or with open <file>), scala (using scala.io.Source

Thanks for the update. Actual We want to read multiple file formats and we want to read the file content irrespective of file format so thats why we have head.

With open is not working in notebook. How can we make that work?

-werners-
Esteemed Contributor III

that is a built-in python function so it should work in a python notebook. You can also use pandas btw.
If you use a scala notebook you should use a scala/java library.
For SQL notebooks: use python/scala 🙂

If we use below code it is throwing error as file_path is not correct

file_path = "/dbfs/path/to/your/file.csv"

 filesystem with open(file_path, 'rb') as f: content = f.read()

-werners-
Esteemed Contributor III

Hi, How can we read the 500MB or 1GB files using with open method in Databricks notebook?

Also if we need to read GB files how many worker nodes needed?

-werners-
Esteemed Contributor III

For data that size using spark might be a good idea (although pure python would probably still work if the files are reasonable in size, 500MB might still work).
The amount of workers depends on if you will be using spark, or pure python. Python code will run on the driver so the amount of workers is irrelevant.
Spark however creates a task per file.  And a task uses a cpu.
Here is a blog that gives you an idea how it works.

Actually We want to read the files irrespective of its format.and push the files to Github Thats why we are going with 'with open' method but if we use with open method its not giving proper results after copying to Github.We need one solution  to read large files

 

-werners-
Esteemed Contributor III

I am curious what the use case if for wanting to load large files into github, which is a code repo.
Depending on the file format different parsing is necessary.  you could foresee logic for that in your program.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group