05-10-2024 02:57 AM
Hello,
after switching to "shared cluster" usage a python job is failing with error message:
Py4JJavaError: An error occurred while calling o877.load.
: org.apache.spark.SparkSecurityException: [INSUFFICIENT_PERMISSIONS] Insufficient privileges:
User does not have permission SELECT on any file.
This error happens on the attempt of reading messages from a Kafka topic, according to the stacktrace (in the spark method spark_.read) =>
288 else:
289 raw_df = (
290 self.spark_.read.format("kafka")
291 .option(
292 "kafka.bootstrap.servers",
293 self.kafka_secrets.kafka_bootstrap_servers,
294 )
295 .option("subscribe", topic.topic)
296 .option("groupIdPrefix", topic.consumer_group_prefix)
297 .option("startingOffsets", "earliest")
298 .option("failOnDataLoss", "false")
299 .option("includeHeaders", "true")
300 .options(**self.sasl_ssl_auth_options)
301 .options(**spark_opts)
--> 302 .load()
303 ).drop("timestampType")
The job runs fine if "streaming" is enabled, means we use spark_.readStream instead.
What exactly is raising the "INSUFFICIENT_PERMISSIONS" error, at using "spark_.read" methon , and how to get rid of it ?!?!
Usually this error is thrown if someone wants to access data on DBFS or has tableACLs enabled, but both of them is not the case here.
Context:
Since the difference in behaviour is between using spark_.read vs spark_.readStream my guess is, that the spark_.read is internally trying to access/interact with Hive-Metastore
Any hint how to eliminate this issue is highly appreciated 😄
05-15-2024 10:41 AM - edited 05-15-2024 10:47 AM
Hi, The reason for this issue could be shared cluster, Unity catalog best supports with personal cluster or job clusters.
I would suggest try using personal cluster.
Check out the below article this might help
https://community.databricks.com/t5/data-engineering/create-table-using-a-location/td-p/68725
05-15-2024 01:34 PM
Hello @Hkesharwani ,
thanks for replying.
Indeed, as I stated in the beginning of my post, the issue occurs only with shared cluster usage (single user cluster all is fine). Since I *have to* switch to shared cluster (rowlevel security is only available there atm.), it would be great if someone provides any insights of what is causing this issue on shared clusters.
06-07-2024 10:01 PM
hey @GeKo
did you get any solution ?
07-08-2024 07:30 AM
Hi @sravs_227 ,
the issue was, that the checkpoint directory (while reading from kafka) was set to a dbfs folder. We switched this now to also UC volume
Tuesday
Hi @GeKo
The checkpoint directory, is that set on cluster level or how do we set that ? Can you please help me with this ?
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group