cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Rasterio on shared/standard cluster has no access to proj.db

der
Contributor

We try to use rasterio on a Databricks shared/standard cluster with DBR 17.1. Rasterio is directly installed on the cluster as library. 

Code:

import rasterio
rasterio.show_versions()

Output: 

rasterio info:
rasterio: 1.4.3
GDAL: 3.9.3
PROJ: 9.4.1
GEOS: 3.11.1
PROJ DATA: /databricks/native/proj-data
GDAL DATA: /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.12/site-packages/rasterio/gdal_data

System:
python: 3.12.3 (main, Aug 14 2025, 17:47:21) [GCC 13.3.0]
executable: /local_disk0/.ephemeral_nfs/envs/pythonEnv-b2347f39-219b-43b3-a5db-676ce38e43ca/bin/python
machine: Linux-5.15.0-1092-azure-x86_64-with-glibc2.39

Python deps:
affine: 2.4.0
attrs: 24.3.0
certifi: 2025.01.31
click: 8.1.7
cligj: 0.7.2
cython: 3.0.12
numpy: 2.1.3
click-plugins: None
setuptools: 74.0.0 

Test script:

import numpy as np
from rasterio.io import MemoryFile
from rasterio.transform import from_origin

meta = {
    "driver": "GTiff",
    "height": 1, "width": 1, "count": 1,
    "dtype": "uint8",
    "crs": "EPSG:2056", # <-- forces an EPSG lookup in proj.db
    "transform": from_origin(0, 1, 1, 1),
}

with MemoryFile() as mem:
    with mem.open(**meta) as ds:
        ds.write(np.zeros((1, 1, 1), dtype="uint8"))

CRSError: The EPSG code is unknown. PROJ: internal_proj_create_from_database: Cannot find proj.db

So we have no access to /databricks/native/proj-data where the proj.db is stored.

I am aware, that there are possible hacks and workarounds:

  1. putting the proj.db somewhere where users have access and than work with environment variables as PROJ_DATA
  2. write the test script that no lookup is needed

However, this is just working around the issue. Dear Databricks could you please change this access rights or give me more insights, why it is the right way?

8 REPLIES 8

szymon_dybczak
Esteemed Contributor III

Hi @der,

I guess this is related to limitation of standard/shared cluster access mode which you're using. 

https://docs.databricks.com/aws/en/compute/standard-limitations#network-and-file-system-limitations

You can try to go with dedicated access mode or is that's not an option for you then setup path to proj.db to i.e UC Volume which should be accessible by standard access mode.

der
Contributor

Hi @szymon_dybczak 

Yes on dedicated access mode it works fine, because we have access to /databricks/native/proj-data. From a cost perspective we want to stay on standard/shared cluster.

Pushing proj.db to UC volume and change the environment path is what I meant with hack/workaround 1. 

I still do not get it why the "proj-data" is in a path, where we have no "read" access from a standard cluster?

szymon_dybczak
Esteemed Contributor III

This is explained in the limitations section in my previous answer. 

To make it easier: 

"Standard compute runs commands as a low-privilege user forbidden from accessing sensitive parts of the filesystem.

POSIX-style paths (/) for DBFS are not supported."

Which limitations do you mean?

Standard compute runs commands as a low-privilege user forbidden from accessing sensitive parts of the filesystem.

Not sure if proj-data is sensitive data.

der_0-1761147625449.png

So for Databricks would be simple to change the group to "spark-users" and change the rights. 

sudo chown -R root:spark-users native
sudo chmod -R 750 native
sudo chmod g+s native

 No workaround would be needed. They did the same for "licenses", "python3", ....

szymon_dybczak
Esteemed Contributor III

Exactly that part. In shared access mode they specifically forbidden access to some paths. Of course you can complain about this but they won't change this behavior because your library doesn't work as expected - especially since other approaches - like using Volumes or other access mode are available. 

Chiran-Gajula
New Contributor

Hi @der 

Can you try adding this in your test script.

import os

os.environ["PROJ_LIB"]="/databricks/native/proj-data"


Hope users have access to this path /databricks/native/proj-data 


G.Chiranjeevi

Hi @Chiran-Gajula

Exactly this is the issue. On standard cluster there is no access to /databricks/native/proj-data where the proj.db is stored. 

So your proposed solution won't work.