I have a cluster on Databricks with configuration Databricks Runtime Version16.4 LTS ML Beta (includes Apache Spark 3.5.2, GPU, Scala 2.12), and another cluster with configuration 16.0 ML (includes Apache Spark 3.5.2, GPU, Scala 2.12). According to the documentation here (https://learn.microsoft.com/en-gb/azure/databricks/release-notes/runtime/16.4lts-ml) the GPU cluster has the following libraries installed:
- CUDA 12.6
- cublas 12.6.0.22-1
- cusolver 11.6.4.38-1
- cupti 12.6.37-1
- cusparse 12.5.2.23-1
- cuDNN 9.3.0.75-1
- NCCL 2.22.3
- TensorRT 10.2.0.19-1
The documentation for the 16.0 ML also has the same libraries installed.
However both of the clusters when I print the cuda/cudnn version it both returned a lower version:
```
import torch
print('CUDA:',torch.version.cuda)
cudnn = torch.backends.cudnn.version()
cudnn_major = cudnn // 10000
cudnn = cudnn % 1000
cudnn_minor = cudnn // 100
cudnn_patch = cudnn % 100
print( 'cuDNN:', '.'.join([str(cudnn_major),str(cudnn_minor),str(cudnn_patch)]) )
```
Output:
CUDA: 12.4
cuDNN: 9.1.0
Further, when I run a tensorflow model training pipeline, the 16.4 LTS ML cluster runs without error, however the 16.0 ML cluster returns the following error:
Epoch 1/40 WARNING: All log messages before absl::InitializeLog() is called are written to STDERR I0000 00:00:1744832380.305571 2695 service.cc:148] XLA service 0x7f9138003620 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: I0000 00:00:1744832380.305600 2695 service.cc:156] StreamExecutor device (0): Tesla T4, Compute Capability 7.5 2025-04-16 19:39:42.151334: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var MLIR_CRASH_REPRODUCER_DIRECTORY to enable. E0000 00:00:1744832387.521437 2695 cuda_dnn.cc:522] Loaded runtime CuDNN library: 9.1.0 but source was compiled with: 9.3.0. CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration. E0000 00:00:1744832390.408174 2695 cuda_dnn.cc:522] Loaded runtime CuDNN library: 9.1.0 but source was compiled with: 9.3.0. CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
Please let me know why this situation happens, and how to avoid it in the future. Thanks!