- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-17-2025 04:33 PM
Thanks for the detailed context—here’s a concise, actionable troubleshooting plan tailored to Databricks with Unity Catalog volumes and Avro + Confluent Schema Registry over APIM with mTLS.
What’s likely going wrong
Based on your description, the initial TLS handshake succeeds (APIM logs show a successful request), but message decoding fails in Spark/Kafka consumers, with errors surfacing only in logs on 14.3 LTS single-user and similar behavior on 15.4 LTS shared clusters. That pattern typically points to one of the following:
-
Avro conversion path not finding/using the keystore/truststore correctly when reading from Unity Catalog volumes, especially if the path or volume access semantics differ from DBFS mount assumptions. This often shows up after handshake success if the schema registry client can connect but cannot complete certificate chain validation or client auth for subsequent calls due to path resolution or scope issues. The Databricks doc that describes using truststore/keystore from Unity Catalog volumes could not be read via Glean, so I can’t verify exact syntax from that page right now.
-
Schema registry auth header or SSL options mismatch for the Confluent client inside from_avro, particularly when using APIM as a gateway. If APIM requires a client certificate, the Confluent Avro deserializer must be given the correct SSL properties, and they must be discoverable on executors.
-
Classpath/library version incompatibility between:
- Spark 3.4/3.5 runtime (14.3 LTS / 15.4 LTS),
- azure-eventhubs-spark_2.12:2.3.22,
- Confluent schema-registry and Avro deserializer libraries (which are implicitly used by
from_avro). This can lead to silent deserialization failures logged in log4j but not thrown to the application, especially with shaded/relocated dependencies.
-
Executor access to the keystore/truststore files within UC volumes. The driver may access them, but executors might fail if paths aren’t accessible in the same way or if you’re referencing a local filesystem path that isn’t distributed.
Recommended fixes and checks
Please try these in order; they’re low-risk and address the most common causes.
-
Validate the path syntax for Unity Catalog volumes and use a path that executors can read:
-
Prefer reading keystore/truststore into the driver/executors’ local filesystem from the UC volume before initializing from_avro, rather than pointing the Confluent client to a UC volume path directly. For example:
- Copy files at cluster start (init script) from UC volume to
/local_disk/…and reference those paths infromAvroOptions. - Or programmatically copy once per session:
dbutils.fs.cp("uc://catalog.schema.volume/keystores/Client_Cert.truststore.jks", "file:/local_disk/keystores/Client_Cert.truststore.jks")Then set:
fromAvroOptions.put("confluent.schema.registry.ssl.truststore.location", "/local_disk/keystores/Client_Cert.truststore.jks")This sidesteps any path scheme issues with UC volumes on executors.
- Copy files at cluster start (init script) from UC volume to
-
-
Confirm the exact options keys used by
from_avro’s Confluent client:- Keys you listed look correct for SSL, but APIM via mTLS sometimes requires:
confluent.schema.registry.urlconfluent.schema.registry.basic.auth.credentials.source(if APIM needs headers)confluent.schema.registry.ssl.keystore.typeand…truststore.typeset explicitly toJKSconfluent.schema.registry.ssl.endpoint.identification.algorithmset toHTTPSor empty depending on APIM’s certs/SANs.
- If APIM hostname differs from cert CN/SAN, try setting
endpoint.identification.algorithmto empty ("") to bypass hostname verification for testing; then fix certs/SANs properly.
- Keys you listed look correct for SSL, but APIM via mTLS sometimes requires:
-
Ensure executors receive the passwords securely:
- Use Spark conf or
fromAvroOptionsvalues set on the driver and broadcast to executors. Avoid relying on only driver-local variables without proper serialization. - Confirm the password strings are non-empty and correct on executors (e.g., via a small map function logging masked presence).
- Use Spark conf or
-
Check library alignment:
- On 14.3 LTS: Spark 3.4.1. On 15.4 LTS: Spark 3.5.x. azure-eventhubs-spark 2.3.22 is built for Spark 3.4+ but ensure no conflicting Avro/Confluent libs pulled transitively.
- If you also added Avro/Confluent dependencies, remove duplicates or align versions so that
org.apache.spark:spark-avroand Confluent deserializers are compatible. - In clusters, prefer using the built-in
spark-avroinstead of adding another Avro jar unless necessary.
-
Verify schema registry calls via APIM with mTLS outside Spark:
- From a notebook, run a plain Java/Scala HTTPS client using the same keystore/truststore files to GET a known schema ID from APIM to confirm full TLS + hostname verification behavior with those paths.
-
Inspect logs for specific errors in the executors:
- Enable debug logging for
io.confluentandorg.apache.avroto catch exact deserialization failures (e.g., unknown schema ID, incompatible reader/writer schema, or SSL handshake on subsequent calls). - Sometimes the first metadata request succeeds, but the per-message decode triggers more schema lookups that fail if the client can’t reuse SSL context.
- Enable debug logging for
-
If you’re using Event Hubs capture Avro vs Kafka APIs, ensure the Avro binary is compatible with
from_avro:- The
from_avrofunction expects the binary payload and a valid schema resolution via the registry; confirm messages are not JSON Avro or have envelopes that need custom parsing beforefrom_avro.
- The
Concrete pattern to try (driver-local copy from UC volume)
This pattern avoids path scheme surprises and ensures executor access.
// Example: copy keystore/truststore from UC volume to local disk
dbutils.fs.mkdirs("file:/local_disk/keystores")
dbutils.fs.cp("uc://<catalog>.<schema>.<volume>/keystores/Client_Cert.truststore.jks", "file:/local_disk/keystores/truststore.jks")
dbutils.fs.cp("uc://<catalog>.<schema>.<volume>/keystores/Client_Cert.keystore.jks", "file:/local_disk/keystores/keystore.jks")
val fromAvroOptions = new java.util.HashMap[String, String]()
fromAvroOptions.put("mode", "PERMISSIVE")
fromAvroOptions.put("confluent.schema.registry.url", "<https://your-apim-endpoint>")
fromAvroOptions.put("confluent.schema.registry.ssl.truststore.location", "/local_disk/keystores/truststore.jks")
fromAvroOptions.put("confluent.schema.registry.ssl.truststore.password", truststorePass)
fromAvroOptions.put("confluent.schema.registry.ssl.truststore.type", "JKS")
fromAvroOptions.put("confluent.schema.registry.ssl.keystore.location", "/local_disk/keystores/keystore.jks")
fromAvroOptions.put("confluent.schema.registry.ssl.keystore.password", keystorePass)
fromAvroOptions.put("confluent.schema.registry.ssl.keystore.type", "JKS")
fromAvroOptions.put("confluent.schema.registry.ssl.key.password", keyPass)
// Optional if APIM certs/SANs are non-standard; remove once certs are fixed
// fromAvroOptions.put("confluent.schema.registry.ssl.endpoint.identification.algorithm", "")
// When creating the DF from Event Hubs payload, ensure the value column is raw binary for from_avro
import org.apache.spark.sql.functions._
val df = rawDf.select(from_avro(col("value"), lit("<schema-id-or-subject>"), fromAvroOptions).as("decoded"))
If you are using subject name strategy instead of direct schema ID, ensure the option to resolve subjects is correctly set and that APIM forwards necessary paths.
Cluster mode considerations
- On single user clusters with UC, ACLs are scoped to your identity; since you granted full privileges on catalog/schema/volume, the driver should read UC volumes. But executors sometimes hit filesystem path differences. The local copy pattern above neutralizes that.
- On shared clusters with UC, make sure your user has access and the files are not stored in a directory that requires other workspace-level permissions.
If the problem persists
Collect and share the following for precise root cause:
- The exact log4j error messages from executors for Avro decoding.
- The options map you pass to
from_avro(masking secrets), including URL and any additional SSL properties. - Confirmation that a standalone HTTPS client using the same JKS pair can GET a schema from APIM in the same cluster.
- Whether messages are Kafka-compatible Avro or Event Hubs payload with an envelope—sometimes a pre-processing step is needed before
from_avro.