cancel
Showing results for 
Search instead for 
Did you mean: 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results for 
Search instead for 
Did you mean: 

Notebook is stuck and cluster goes into waiting state while using spark libraries

BhawaniD
New Contributor

Hey,

We have installed the com.databricks:spark-xml_2.12:0.18.0 library in our VNET-injected Databricks workspace to read XML files from a storage account. The notebook runs successfully for text files when the cluster is started without the library installed. However, when running the notebook with XML files, the cluster enters a waiting state.

Our Databricks subnet has a route table attached, and all traffic is routed through our firewall. When we disassociate the route table from the Databricks public subnet, the notebook runs without any issues, indicating that the firewall is blocking the required connectivity. However, I am unable to determine which ports or FQDNs need to be opened to resolve this issue. 

I would greatly appreciate any thoughts on this! 

3 REPLIES 3

Witold
Contributor III

Since it's a maven dependency it should be simply HTTP and port 80/443.

Besides, are you aware that native XML support is included since runtime 14.3? This replaces the library spark-xml.

I don't think it is included in the runtime 14.3. I tried running my notebook without installing the library and it fails straightaway because the library has not been installed. 

I have opened the firewall rules using service tags on port 443 but still it doesn't help. 


I don't think it is included in the runtime 14.3.

You are wrong here, native XML support is indeed included. Please check the documentation how to use it properly, as there might be slight differences to spark-xml. The reason why it's included is that spark-xml becomes obsolete as it will be part of Spark 4. In Databricks we can use it today starting from version 14.3

 


I have opened the firewall rules using service tags on port 443 but still it doesn't help. 

You might want to consult your network colleagues to configure it properly. Usually maven downloads the libraries e.g. from here (or one of the mirrors)

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group