topic Re: Running `pyspark` with `databricks-connect` in Data Engineering

Running `pyspark` with `databricks-connect`

agagrins — Wed, 01 Feb 2023 11:24:28 GMT

Hiya,

I'm trying to run `pyspark` with `databricks-connect==11.30.b0`, but am failing.

The trace I see is

```

File "/home/agagrins/databricks9/lib/python3.9/site-packages/py4j/java_gateway.py", line 1321, in __call__

return_value = get_return_value(

File "/home/agagrins/databricks9/lib/python3.9/site-packages/pyspark/sql/utils.py", line 196, in deco

return f(*a, **kw)

File "/home/agagrins/databricks9/lib/python3.9/site-packages/py4j/protocol.py", line 326, in get_return_value

raise Py4JJavaError(

py4j.protocol.Py4JJavaError: An error occurred while calling o33.sql.

: org.apache.spark.SparkException: There is no Credential Scope.

at com.databricks.unity.UCSDriver$Manager.$anonfun$currentScopeId$1(UCSDriver.scala:94)

at scala.Option.getOrElse(Option.scala:189)

at com.databricks.unity.UCSDriver$Manager.currentScopeId(UCSDriver.scala:94)

at com.databricks.unity.UCSDriver$Manager.currentScope(UCSDriver.scala:97)

at com.databricks.unity.UnityCredentialScope$.currentScope(UnityCredentialScope.scala:100)

at com.databricks.unity.UnityCredentialScope$.getCredentialManager(UnityCredentialScope.scala:128)

at com.databricks.unity.CredentialManager$.getUnityApiTokenOpt(CredentialManager.scala:456)

at com.databricks.unity.UnityCatalogClientHelper$.getToken(UnityCatalogClientHelper.scala:34)

at com.databricks.managedcatalog.ManagedCatalogClientImpl.$anonfun$getCatalog$1(ManagedCatalogClientImpl.scala:163)

at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:80)

at com.databricks.managedcatalog.ManagedCatalogClientImpl.$anonfun$recordAndWrapException$1(ManagedCatalogClientImpl.scala:2904)

at com.databricks.managedcatalog.ErrorDetailsHandler.wrapServiceException(ErrorDetailsHandler.scala:25)

at com.databricks.managedcatalog.ErrorDetailsHandler.wrapServiceException$(ErrorDetailsHandler.scala:23)

at com.databricks.managedcatalog.ManagedCatalogClientImpl.wrapServiceException(ManagedCatalogClientImpl.scala:77)

at com.databricks.managedcatalog.ManagedCatalogClientImpl.recordAndWrapException(ManagedCatalogClientImpl.scala:2903)

at com.databricks.managedcatalog.ManagedCatalogClientImpl.getCatalog(ManagedCatalogClientImpl.scala:156)

at com.databricks.sql.managedcatalog.ManagedCatalogCommon.catalogExists(ManagedCatalogCommon.scala:94)

at com.databricks.sql.managedcatalog.PermissionEnforcingManagedCatalog.catalogExists(PermissionEnforcingManagedCatalog.scala:177)

at com.databricks.sql.managedcatalog.ManagedCatalogSessionCatalog.catalogExists(ManagedCatalogSessionCatalog.scala:384)

at com.databricks.sql.DatabricksCatalogManager.isCatalogRegistered(DatabricksCatalogManager.scala:104)

at org.apache.spark.sql.SparkServiceCatalogV2Handler$.catalogOperationV2(SparkServiceCatalogV2Handler.scala:58)

at com.databricks.service.SparkServiceImpl$.$anonfun$catalogOperationV2$1(SparkServiceImpl.scala:165)

```

I've tried to Google "There is no Credential Scope", but to no avail. Anyone have a clue of where to start to look?

Re: Running `pyspark` with `databricks-connect`

sher — Wed, 01 Feb 2023 11:28:04 GMT

where you are running?

Re: Running `pyspark` with `databricks-connect`

agagrins — Wed, 01 Feb 2023 11:32:45 GMT

I'm starting the run locally, with Python 3.9.1 under WSL, but the idea then is to run the job in Databricks on AWS

Re: Running `pyspark` with `databricks-connect`

sergiu — Mon, 06 Feb 2023 07:58:23 GMT

Hello @Aigars Grins. Can you tell me a bit more about what you are trying to run via Databricks Connect? Generally, we recommend using dbx for local development over Databricks Connect.

Could you also provide more information on what type of compute you are connecting to? Such as runtime and whether it is running on Unity Catalog or the legacy Hive Metastore?

Re: Running `pyspark` with `databricks-connect`

agagrins — Mon, 06 Feb 2023 13:23:11 GMT

My understanding is that there are three main ways for me to work with Databricks: `databricks-connect`, `databricks-sql-connector`, and `dbx`. I'm trying out all three, for slightly different purposes, to see what fits our worksflows best where.

Re: Running `pyspark` with `databricks-connect`

agagrins — Mon, 06 Feb 2023 13:27:00 GMT

As for the problem above it seems to have gone away. While I'm not sure, it felt a bit like I didn't do anything different. Buy instead I'm faced with a much more mundane situation.

Again, I'm here trying to make `databricks-connect` work.

I simply do

```

$ python3 -m venv ~/databricks11

$ . ~/databricks11/bin/activate

$ pip install --upgrade pip

$ pip install --upgrade setuptools

$ pip install databricks-connect==11.3.0b0

$ databricks-connect configure

$ databricks-connect test

```

My `.databricks-connect` looks like

```

{

"host": "https://dbc-****.cloud.databricks.com",

"token": "dapi****",

"cluster_id": "0110-****,

"port": "15001"

}

```

I also have some environment variables, just in case

```

DATABRICKS_ADDRESS=https://dbc-****.cloud.databricks.com

DATABRICKS_API_TOKEN=dapi****

DATABRICKS_CLUSTER_ID=0110-****

DATABRICKS_PORT=15001

```

But I get an error

```

23/02/03 11:47:17 ERROR SparkClientManager: Fail to get the SparkClient

java.util.concurrent.ExecutionException: com.databricks.service.SparkServiceConnectionException: Invalid token

To connect to a Databricks cluster, you must specify an API token.

API Token: The API token used to confirm your identity to Databricks

- Learn more about API tokens here: https://docs.databricks.com/api/latest/authentication.html#generate-a-token

- Get current value: spark.conf.get("spark.databricks.service.token")

- Set via conf: spark.conf.set("spark.databricks.service.token", <your API token>)

- Set via environment variable: export DATABRICKS_API_TOKEN=<your API token>

```

Re: Running `pyspark` with `databricks-connect`

agagrins — Mon, 06 Feb 2023 13:28:57 GMT

The cluster I'm connecting to runs "11.3 LTS (includes Apache Spark 3.3.0, Scala 2.12)"

Re: Running `pyspark` with `databricks-connect`

sergiu — Mon, 06 Feb 2023 14:13:56 GMT

Hmm, the connect info looks good to me. Can you try either of the following, see if you still get the error:

Run against a cluster unconnected to Unity Catalog (put it in Access mode - No isolation shared)

Try with an earlier runtime, like 10.4 (and appropriate version of the connector, do pip install -U "databricks-connect==10.4.*"

Lastly, as stated in the documentation, we recommend running dbx for local development, over databricks-connect. Is there anything specific you believe you can do with databricks-connect, which you cannot achieve with dbx?

Re: Running `pyspark` with `databricks-connect`

agagrins — Thu, 09 Feb 2023 09:28:49 GMT

I tried with creating a new cluster, for 10.4, but that didn't get my anywhere either. The steps I followed where:

```

$ databricks clusters create --json-file cluster.json

```

Where `cluster.json` looks like

```

{

"cluster_name": "test50",

"spark_version": "10.4.x-scala2.12",

"spark_conf": {

"spark.databricks.service.client.enabled": true,

"spark.databricks.service.server.enabled": true,

"spark.speculation": true,

"spark.sql.session.timeZone": "UTC"

"spark_env_vars": {

"PYSPARK_PYTHON": "/databricks/python3/bin/python3"

"node_type_id": "i3.xlarge",

"autoscale": {

"min_workers": 1,

"max_workers": 8

"autotermination_minutes": 10,

"aws_attributes": {

"first_on_demand": 0,

"availability": "SPOT_WITH_FALLBACK",

"zone_id": "eu-west-1b",

"spot_bid_price_percent": 100

"enable_elastic_disk": false,

"data_security_mode": "SINGLE_USER",

"single_user_name": "****"

}

```

And then

```

$ python3 -m venv ~/databricks12

$ . ~/databricks12/bin/activate

$ pip install --upgrade pip

$ pip install --upgrade setuptools

$ pip install databricks-connect==10.4.18

$ databricks-connect test

```

And the result is as before

```

23/02/09 10:22:14 ERROR SparkServiceRPCClient: Failed to sync with the spark cluster. This could be a intermittent issue, please check your cluster's state and retry.

com.databricks.service.SparkServiceConnectionException: Invalid token

To connect to a Databricks cluster, you must specify an API token.

API Token: The API token used to confirm your identity to Databricks

- Learn more about API tokens here: https://docs.databricks.com/api/latest/authentication.html#generate-a-token

- Get current value: spark.conf.get("spark.databricks.service.token")

- Set via conf: spark.conf.set("spark.databricks.service.token", <your API token>)

- Set via environment variable: export DATABRICKS_API_TOKEN=<your API token>

```

Re: Running `pyspark` with `databricks-connect`

agagrins — Thu, 09 Feb 2023 09:30:01 GMT

I'm not sure how to test the "Run against a cluster unconnected to Unity Catalog (put it in Access mode - No isolation shared)" thingy. Could you provide a `cluster.json` with the corresponding settings?

Re: Running `pyspark` with `databricks-connect`

agagrins — Thu, 09 Feb 2023 09:54:11 GMT

Why then `databricks-connect` and not `dbx`? Well, I'm trying to get both to work.

I posted a related question about `dbx` https://community.databricks.com/s/feed/0D58Y00009qtFLrSAM.

My hope here is that `databricks-connect` can have a much quicker turnaround time, compared to `dbx`, since no new environments have to be set up.

Re: Running `pyspark` with `databricks-connect`

sergiu — Thu, 09 Feb 2023 13:33:27 GMT

I tried your exact code on my environment and it worked without issue.

Could it be something about the token you are using and its permissions? Is it the same token you are using for the databricks CLI? What workspace permissions does the principal have?

Re: Running `pyspark` with `databricks-connect`

sergiu — Thu, 09 Feb 2023 13:34:58 GMT

Change the data_security_mode field in the cluster config to NO_ISOLATION. It's unlikely that it is related to the issues you are facing, it's more likely an issue with the configuration.

But it might be worth double checking.

Re: Running `pyspark` with `databricks-connect`

agagrins — Thu, 09 Feb 2023 13:47:14 GMT

I use the same token when working with `dbx`, and that works, so I suspect the token itself isn't a problem. I'll check the permissions

Re: Running `pyspark` with `databricks-connect`

agagrins — Thu, 09 Feb 2023 14:11:02 GMT

These are the permissions on the cluster. Is that what you wanted?

Re: Running `pyspark` with `databricks-connect`

sergiu — Fri, 10 Feb 2023 15:10:10 GMT

Yes, this is what I was looking for. Does the token belong to the censored principal, or to a principal within the admin group? The token needs to belong to a principal which can attach on the cluster.

Re: Running `pyspark` with `databricks-connect`

sergiu — Fri, 10 Feb 2023 15:18:41 GMT

Can you put the whole error trace here? Or was the above the full error?

Re: Running `pyspark` with `databricks-connect`

ryojikn — Fri, 24 Mar 2023 03:06:33 GMT

How to make it work in a cluster with Unity Catalog enabled?