Summary:
We use Zscaler and are trying to use Databricks Connect to develop pyspark code locally. At first, we received SSL HTTP errors, which we resolved by ensuring Python's request library could find Zscaler's CA cert (setting REQUESTS_CA_BUNDLE env var).
We continued to get SSL errors, which came from the GRPC library used by Spark Connect. We resolved this by setting GRPC_DEFAULT_SSL_ROOTS_FILE_PATH.
But now, we receive "Cannot check peer: missing selected ALPN property" from the GRPC library. GRPC uses HTTP/2, and MITM proxies like Zscaler don't play nicely with HTTP/2.
Is there any workaround for this? Can we use HTTP/1.1 as the protocol for Databricks Connect? Or add an exception for the Databricks domain to our proxy?
Note: Databricks JDBC Driver appears to be unaffected
System:
Operating system: OSX 14.6
Python version: 3.11
Python Libraries:
databricks-connect==15.4.2
databricks-sdk==0.33.0
delta-spark==3.2.1
pyspark==3.5.3
grpcio==1.66.2
grpcio-status==1.66.2
requests==2.32.3
Steps to Reproduce:
- Be connected to a proxy which conducts man-in-the-middle inspections, such as Zscaler
- Set the Python requests library CA file using the REQUESTS_CA_BUNDLE env var::
export REQUESTS_CA_BUNDLE=/path/to/my/root/ca.pem
- Set the Python GRPC library's CA file using the GRPC_DEFAULT_SSL_ROOTS_FILE_PATH env var:
export GRPC_DEFAULT_SSL_ROOTS_FILE_PATH=/path/to/my/root/ca.pem
If you do not do this step, you will receive the following _InactiveRpcError error:failed to connect to all addresses; last error: UNKNOWN: ipv4:[REDACTED WORKSPACE IP]:443: Ssl handshake failed (TSI_PROTOCOL_FAILURE): SSL_ERROR_SSL: error:1000007d:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED
- Execute the following code with your Databricks profile configured:
from databricks.connect import DatabricksSession
spark = DatabricksSession.builder.getOrCreate()
Stack Trace:
Traceback (most recent call last):
File ".venv/lib/python3.11/site-packages/pyspark/sql/connect/client/core.py", line 1853, in config
resp = self._stub.Config(req, metadata=self.metadata())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".venv/lib/python3.11/site-packages/grpc/_channel.py", line 1181, in __call__
return _end_unary_response_blocking(state, call, False, None)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".venv/lib/python3.11/site-packages/grpc/_channel.py", line 1006, in _end_unary_response_blocking
raise _InactiveRpcError(state) # pytype: disable=not-instantiable
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:[REDACTED WORKSPACE IP]:443: Cannot check peer: missing selected ALPN property."
debug_error_string = "UNKNOWN:Error received from peer {
grpc_message:"failed to connect to all addresses;
last error: UNKNOWN:
ipv4:[REDACTED WORKSPACE IP]:443:
Cannot check peer: missing selected ALPN property.",
grpc_status:14,
created_time:"2024-10-17T12:51:04.105548+01:00"
}"
>