cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Read polars from recently created csv file

victorNilsson
New Contributor II

More and more python packages transition to use polars instead of e.g. pandas. There is a problem with this in databricks when trying to read a csv file with it using pl.read_csv("filename.csv") when the file has been created in the same notebook cell as when it is trying to be read. Everytime I try to read a newly created csv file using polars i get OS error, but it works totally fine it I read it in the next cell.

 victorNilsson_1-1756734657452.png

Potential solutions could be to use e.g. 

df = pl.read_csv(csv_file, use_pyarrow=True)

But it is not always that one can change the read_csv parameters in third party packages. 

Is there a way to load a newly created csv-file using polars with pl.read_csv("file.csv") without changing the method call itself?

3 REPLIES 3

Pilsner
Contributor III

Hello @victorNilsson 

I have tried to replicate this issue on my end, but unfortunately was unsuccessful as it worked the first time for me. I have, however, still tried to search for a solution.

I believe the issue you are getting could be linked to the sequencing of events. Databricks may not have finished flushing your csv through to the underlying database before it attempts to load it using Polars. Manually running the commands in two separate cells enforces the sequence, hence you don't get the same issue. To get around this, you can try to force the sync to finish before proceeding by using the following code:


import polars as pl
import os

data = {"a": [1, 2], "b": [3, 4]}

csv_file = "test.csv"

with open(csv_file, "w", encoding="utf-8") as f:
    f.writelines(["a,b\n", "hello,world\n"])
    f.flush()
    os.fsync(f.fileno())

df = pl.read_csv(csv_file)

display(df)

From my understanding, the f.flush() function forces the data to move from python, into the systems OS. After that the os.fsync(f.fileno()) ensures the data is actually written to disk and not just kept in RAM. 

An alternative solution, could be to implement a delay by simply using "import time" and "time.sleep(1)", however this is not as robust.

As I mentioned I was not able to replicate your issue so unfortunately cannot test their effectiveness, however they have both successfully ran in my environment (using a serverless cluster). I hope this helps - Please let me know how you get on.

Regards - Pilsner

victorNilsson
New Contributor II

Hi, thanks for taking your time to check this. 

I also tried serverless and then I was unable to recreate the problem. 

I've narrowed down the problem to a fuse mount problem. Polars seems to use mmap in shared mode, which seems to be unreliable at best. A minimal way to test this is 

import mmap
csv_file = "test.csv"

with open(csv_file, "w", encoding="utf8") as f:
    f.writelines(["a,c\n", "hello,world\n"])

with open(csv_file, 'r+') as f:
    mm = mmap.mmap(f.fileno(), 0, mmap.MAP_SHARED, prot=mmap.PROT_READ)


Which gives the exact same problem. Running in private mode works just fine though

import mmap
csv_file = "test.csv"

with open(csv_file, "w", encoding="utf8") as f:
    f.writelines(["a,c\n", "hello,world\n"])

with open(csv_file, 'r+') as f:
    mm = mmap.mmap(f.fileno(), 0, mmap.MAP_PRIVATE, prot=mmap.PROT_READ)


I tried your two suggestions but it still didnt work unfortunatelly:/

 

import mmap
import os
import time
csv_file = "test.csv"

with open(csv_file, "w", encoding="utf8") as f:
    f.writelines(["a,c\n", "hello,world\n"])
    f.flush()
    os.fsync(f.fileno())

time.sleep(10)

with open(csv_file, 'r+') as f:
    mm = mmap.mmap(f.fileno(), 0, mmap.MAP_SHARED, prot=mmap.PROT_READ)

 

Pilsner
Contributor III

Hello @victorNilsson 

Thank you for letting me know how to replicate the issue, I was able to get the same error this time. 

I've given the problem another go and think I have been able to fix it by specifying the output path as "/tmp/test.csv". By writing to "/tmp" I'm hoping that this will still work as you initially intended: the file should not persist across clusters or sessions.

I tested this approach on both your latest and original code and it seemed to run successfully for both.

Screenshot 2025-09-04 134309.png

Based on my research, I believe this is why you were getting your issue:

By not defining a path, the notebook was trying to write to the default DBFS. DBFS is a distributed filesystem, and writing to this means the file is stored in cloud-backed storage instead of a local disk. I believe my initial solution failed because the function os.fsync requires lower-level disk features (that term may mean more to you than me but I understand it to mean that direct interaction with hardware operations is required). This also explains why mmap was causing the errors, as this too requires the same features.

The reason it was working across two cell is still likely related to the sequencing. Once a cell has finished the file is fully written to DBFS allowing it to then be re-opened in the following cell. 

I hope this helps with your issue, please let me know how you get on.

Regards - Pilsner



Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now