Wednesday
We try to use rasterio on a Databricks shared/standard cluster with DBR 17.1. Rasterio is directly installed on the cluster as library.
Code:
import rasterio
rasterio.show_versions()Output:
rasterio info:
rasterio: 1.4.3
GDAL: 3.9.3
PROJ: 9.4.1
GEOS: 3.11.1
PROJ DATA: /databricks/native/proj-data
GDAL DATA: /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.12/site-packages/rasterio/gdal_data
System:
python: 3.12.3 (main, Aug 14 2025, 17:47:21) [GCC 13.3.0]
executable: /local_disk0/.ephemeral_nfs/envs/pythonEnv-b2347f39-219b-43b3-a5db-676ce38e43ca/bin/python
machine: Linux-5.15.0-1092-azure-x86_64-with-glibc2.39
Python deps:
affine: 2.4.0
attrs: 24.3.0
certifi: 2025.01.31
click: 8.1.7
cligj: 0.7.2
cython: 3.0.12
numpy: 2.1.3
click-plugins: None
setuptools: 74.0.0
Test script:
import numpy as np
from rasterio.io import MemoryFile
from rasterio.transform import from_origin
meta = {
"driver": "GTiff",
"height": 1, "width": 1, "count": 1,
"dtype": "uint8",
"crs": "EPSG:2056", # <-- forces an EPSG lookup in proj.db
"transform": from_origin(0, 1, 1, 1),
}
with MemoryFile() as mem:
with mem.open(**meta) as ds:
ds.write(np.zeros((1, 1, 1), dtype="uint8"))CRSError: The EPSG code is unknown. PROJ: internal_proj_create_from_database: Cannot find proj.db
So we have no access to /databricks/native/proj-data where the proj.db is stored.
I am aware, that there are possible hacks and workarounds:
However, this is just working around the issue. Dear Databricks could you please change this access rights or give me more insights, why it is the right way?
Wednesday
Hi @der,
I guess this is related to limitation of standard/shared cluster access mode which you're using.
https://docs.databricks.com/aws/en/compute/standard-limitations#network-and-file-system-limitations
You can try to go with dedicated access mode or is that's not an option for you then setup path to proj.db to i.e UC Volume which should be accessible by standard access mode.
Wednesday
Yes on dedicated access mode it works fine, because we have access to /databricks/native/proj-data. From a cost perspective we want to stay on standard/shared cluster.
Pushing proj.db to UC volume and change the environment path is what I meant with hack/workaround 1.
I still do not get it why the "proj-data" is in a path, where we have no "read" access from a standard cluster?
Wednesday
This is explained in the limitations section in my previous answer.
Wednesday
To make it easier:
"Standard compute runs commands as a low-privilege user forbidden from accessing sensitive parts of the filesystem.
POSIX-style paths (/) for DBFS are not supported."
Wednesday
Which limitations do you mean?
Standard compute runs commands as a low-privilege user forbidden from accessing sensitive parts of the filesystem.
Not sure if proj-data is sensitive data.
So for Databricks would be simple to change the group to "spark-users" and change the rights.
sudo chown -R root:spark-users native
sudo chmod -R 750 native
sudo chmod g+s native
No workaround would be needed. They did the same for "licenses", "python3", ....
Wednesday - last edited Wednesday
Exactly that part. In shared access mode they specifically forbidden access to some paths. Of course you can complain about this but they won't change this behavior because your library doesn't work as expected - especially since other approaches - like using Volumes or other access mode are available.
Wednesday
Hi @der
Can you try adding this in your test script.
import os
os.environ["PROJ_LIB"]="/databricks/native/proj-data"
Hope users have access to this path /databricks/native/proj-data
Wednesday
Hi @Chiran-Gajula,
Exactly this is the issue. On standard cluster there is no access to /databricks/native/proj-data where the proj.db is stored.
So your proposed solution won't work.
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now