topic Re: Save file to /tmp in Data Engineering

Save file to /tmp

thiagoawstest — Wed, 10 Jul 2024 14:01:11 GMT

Hello, I have python code that collects data in json, and sends it to an S3 bucket, everything works fine. But when there is a lot of data, it causes memory overflow.

So I want to save locally, for example in /tmp or dbfs:/tmp and after sending it to S3, but when saving it says that the directory or file does not exist, as if the file is generated but cannot be found.

If I mount UC Vulumes, then it works.

Are there any restrictions? I'm mounting everything via unity catalog, not via dbfs.

Thanks.

Re: Save file to /tmp

RishabhTiwari07 — Thu, 18 Jul 2024 16:45:21 GMT

Hi @thiagoawstest ,

Thank you for reaching out to our community! We're here to help you.

To ensure we provide you with the best support, could you please take a moment to review the response and choose the one that best answers your question? Your feedback not only helps us assist you better but also benefits other community members who may have similar questions in the future.

If you found the answer helpful, consider giving it a kudo. If the response fully addresses your question, please mark it as the accepted solution. This will help us close the thread and ensure your question is resolved.

We appreciate your participation and are here to assist you further if you need it!

Thanks,

Rishabh

Re: Save file to /tmp

JimBiard — Fri, 16 May 2025 02:10:28 GMT

I am experiencing the same problem. I create a file in /tmp and can verify that it exists. But when an attempt is made to open the file using pyspark, the file is not found. I noticed that the path I used to create the file is /tmp/foobar.parquet and the path being reported as not found is dbfs:/tmp/foobar.parquet.

Re: Save file to /tmp

JimBiard — Fri, 16 May 2025 15:51:41 GMT

I found what my problem was. I used pandas to save my parquet file to /tmp. It stored it in the compute node local file system /tmp folder. When I passed the same path to pyspark to load the file, it prepended 'dbfs:' to the path. The file wasn't in dbfs:/tmp, so the call failed. I prepended 'file:' to the path name that I passed to pyspark and the call succeeded.