cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

"with open" not working with Shared Access Cluster on mounted location

mathijs-fish
New Contributor III

Hi All,

For an application that we are building, we need a encoding detector/utf-8 enforcer. For this, we used the python library chardet in combination with "with open". We open a file from a mounted adls location (we use a legacy hive-metastore)

When we were using Non Isolation Shared clusters, it was working fine, but because of security reasons, we have to change to Shared Access clusters. However, now the encoding detector is not working anymore.

This is how we detected encoding before:

mathijsfish_1-1701785425743.png

Error using shared access cluster:

mathijsfish_2-1701785466668.png

After some investigation we concluded that using with open, but also the os and glob module on mounted locations with a shared access cluster, does not work properly. Any idea how we can fix this?

For your reference, we have to use this mounted location, and a shared access cluster.

 

1 ACCEPTED SOLUTION

Accepted Solutions

mathijs-fish
New Contributor III

@Ayushi_SutharThanks! However, this does not solve the issue; because we have to use shared clusters. In the meantime we found a way of detecting the encoding on shared clusters in the following way:

rawdata = (
        spark.read.format("binaryFile")
        .load(file_path)
        .selectExpr("SUBSTR(content, 0, 500000) AS content")
        .collect()[0]
        .content
    )
    encoding = chardet.detect(rawdata)["encoding"]
    print(encoding)

View solution in original post

5 REPLIES 5

Ayushi_Suthar
Databricks Employee
Databricks Employee

Hi @mathijs-fish,I completely understand your hesitation and appreciate your approach to seeking guidance!

I see you are trying to access the external files from dbfs mount location.
As you can see in the snapshots which you have shared, The reason behind the below error while trying to access the external dbfs mount file using "with open" is that you are using a shared access mode cluster.

@FileNotFoundError: [Errno 2] No such file or directory: '/dbfs/mnt/art/inbound.
Ltest/EUT_Alignment_20230630_20230712130221.csv'

This is a known limitation for Shared Clusters, where /dbfs path is not accessible. You can try using a single-user cluster instead to access /dbfs which supports UC.

Please refer:
https://docs.databricks.com/clusters/configure.html#shared-access-mode-limitations
https://docs.databricks.com/en/dbfs/unity-catalog.html#how-does-dbfs-work-in-shared-access-mode

And we also have a preview feature 'Improved Shared Clusters' that addresses some of the limitations of Shared Clusters.

Leave a like if this helps, followups are appreciated.

Kudos,

Ayushi

mathijs-fish
New Contributor III

@Ayushi_SutharThanks! However, this does not solve the issue; because we have to use shared clusters. In the meantime we found a way of detecting the encoding on shared clusters in the following way:

rawdata = (
        spark.read.format("binaryFile")
        .load(file_path)
        .selectExpr("SUBSTR(content, 0, 500000) AS content")
        .collect()[0]
        .content
    )
    encoding = chardet.detect(rawdata)["encoding"]
    print(encoding)

We are able to read the file but how we can write again into storage path for there is any solution 

Ayushi_Suthar
Databricks Employee
Databricks Employee

Hi @mathijs-fish thank you for sharing the solution, it will help us to update our records and documentation, enabling us to assist other customers more effectively in similar cases.

nagND
New Contributor II

Hi @mathijs-fish @Ayushi_Suthar  - I am having the same issue with shared cluster. I can see the list of PDF files on the mount using dbutils.fs.ls(mount_point), but when I am trying to read the PDF files using PyPDF, I am getting - FileNotFoundError: [Errno 2] No such file or directory

Can I read the files after enabling certain settings on the shared cluster? I see that @srajawat can read the PDF.

Looking forward to your reply, thanks

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group