Databricks

jakubk · ‎03-21-2023

I'm looking to migrate onto unity catalog but a number of my data ingestion notebooks throw a securityexception/whitelist errors for numerous spark. functions

Is there some configuration setting I need to enable to whitelist the spark.* methods/functions?

I know its because I'm using 'shared' access mode. I've always run 'no isolation shared' clusters before with external tables when using hive metastore

I use externally managed tables and use spark.catalog to check if a table exists before I create it. This is failing with the whitelist error. I can refactor that check to use the information_schema columns I guess?

But any tips on how to refactor this?

I have multiple tsvs which have free text comments at the top of the file. I need to skip n lines and process the rest

    row_rdd = spark.sparkContext \
        .textFile(sourceFilePath) \
        .zipWithIndex() \
        .filter(lambda row: row[1] >= n_skip_rows) \
        .map(lambda row: row[0])
    df = spark.read.csv(row_rdd,sep='\t',header="true",inferSchema="true")

I also need to process vcfs using the glow library - this doesn't work either

Are there any docs on what Single user access mode actually is? Is it like its running using someone's credentials as a service account? Can other users connect to it using odbc/jdbc and an access token? Or is it a personal compute which only allows one connection?

karthik_p · ‎03-22-2023

@Jakub K there are few limitations to migrate external tables, like optimization is not supported in unity catalog and when you create cluster with single access mode, you should be able to handle that . please follow below steps , external location and credentials should be created and should have proper acess to perform upgradation

https://docs.databricks.com/data-governance/unity-catalog/migrate.html

jakubk · ‎03-27-2023

I don't need help with migrating data from the hive metastore

I'm looking for some design patterns for ingesting new data into unity catalog

Do I really need a dedicated cluster per user to be able to use unity catalog & load data?? That can't be right

Shawn_Eary · ‎08-23-2023

@jakubk wrote:
Do I really need a dedicated cluster per user to be able to use unity catalog & load data?? That can't be right

I certainly Hope that ain't the case. I can call spark.catalog.TableExists without issue from a Personal Compute cluster, but when I try to call it from a Shared Compute cluster with Access mode = Shared, I get this error:
"py4j.security.Py4JSecurityException: Method public boolean org.apache.spark.sql.internal.CatalogImpl.tableExists(java.lang.String) is not whitelisted ..."

How do I check if a table exists from a Shared Cluster if I'm not allowed to use spark.catalog.tableExists?

Shawn_Eary · ‎08-23-2023

I found a workaround for my particular situation where I just needed to check if a table existed. It was based on these posts:

https://kb.databricks.com/en_US/delta/programmatically-determine-if-a-table-is-a-delta-table-or-not

https://stackoverflow.com/questions/167576/check-if-table-exists-in-sql-server

Anonymous · ‎03-26-2023

Hi @Jakub K

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!

jakubk · ‎03-30-2023

No I haven't, I can't see any answers posted?

Anonymous · ‎03-31-2023

Hi @Jakub K

I'm sorry you could not find a solution to your problem in the answers provided.

Our community strives to provide helpful and accurate information, but sometimes an immediate solution may only be available for some issues.

I suggest providing more information about your problem, such as specific error messages, error logs or details about the steps you have taken. This can help our community members better understand the issue and provide more targeted solutions.

Alternatively, you can consider contacting the support team for your product or service. They may be able to provide additional assistance or escalate the issue to the appropriate section for further investigation.

Thank you for your patience and understanding, and please let us know if there is anything else we can do to assist you.

Databricks

Unity Catalog - spark.* functions throwing Py4JSecurityException - org.apache.spark.sql.internal.CatalogImpl.currentCatalog() is not whitelisted on class class org.apache.spark.sql.internal.CatalogImpl

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI