Databricks Community

KLin · yesterday

Hi everyone,

I have a question regarding networking.

A bit of background first: For security reasons, the current allow-policy from GCP to our on-prem-infrastructure is being replaced by a deny-policy for traffic originating from GCP. Therefore access needs to be specifically granted per on-prem data source/service that is going to be used from GCP.

Our setup: We have 4 DBX workspaces and the corresponding 4 subnets in the GCP customer managed VPC. Our compute clusters now uses Google Compute Engine and starts the cluster in the corresponding subnet.

Problem: It seems like the traffic originates from GCP clusters when connecting to on-prem data sources is coming from all over the place - sometimes from the node subnets and sometimes from the pod subnets.

My question is: What can I do to pinpoint exactly where the traffic originates when we connect to a on prem datasource? What subnets (node, pod, service) do Databricks use for cluster, SQL warehouses and so on?

Looking forward to our discussions. Thank you!

Best Regards

Alberto_Umana · yesterday

Hi @KLin, happy to help! -

The reason why traffic originates from the pods subnet for clusters/SQL warehouses without the x-databricks-nextgen-cluster tag (still using GKE) and from the node subnet for clusters with the GCE tag is due to the underlying infrastructure differences between Google Kubernetes Engine (GKE) and Google Compute Engine (GCE).

In GKE, Databricks clusters are implemented as Kubernetes namespaces, and the traffic is managed at the pod level. Each pod within the GKE cluster is assigned an IP address from the pod subnet, which is why traffic originates from the pods subnet.

On the other hand, when using GCE, the clusters are hosted on virtual machines (VMs) rather than Kubernetes pods. These VMs are assigned IP addresses from the node subnet. Therefore, traffic for clusters with the GCE tag originates from the node subnet.

This distinction is made to align with the network configurations and resource management specific to each type of infrastructure (GKE for Kubernetes-based deployments and GCE for VM-based deployments).

View solution in original post

Alberto_Umana · yesterday

Hi @KLin,

Adding some comments:

Subnets in GCP VPC:

Node Subnets: These are used for the GKE cluster nodes.
Pod Subnets: These are secondary IP ranges allocated for the pods running on the GKE cluster.
Service Subnets: These are secondary IP ranges allocated for the services running within the GKE cluster.

Traffic Origin:

Databricks assigns two IP addresses per node: one for management traffic and one for Spark applications.
The traffic from Databricks clusters to on-premises data sources can originate from either the node subnets or the pod subnets.

Identifying Traffic Origin:

To pinpoint the exact origin of the traffic, you can monitor the IP addresses used by the nodes and pods within your GKE cluster. This can be done by checking the network logs and the IP ranges assigned to your node and pod subnets.
You can also use tools like tcpdump or Wireshark on your on-premises data sources to capture and analyze the incoming traffic, which will help you identify the source IP addresses.

KLin · yesterday

Hi @Alberto_Umana

Thanks for the timely reply.

With regard to your reply:

The traffic from Databricks clusters to on-premises data sources can originate from either the node subnets or the pod subnets.

A follow up question from me is: How does DBX decide whether to use node subnets or pod subnets? and since DBX has the incentives already to move from GKE to GCE (we have already made the switch, but, correct me if i am wrong, it seems like the workspaces are still hosted on GKE), do you know how does the switch impact the workspaces and the subnets?

Thanks a lot!

Alberto_Umana · yesterday

Hi @KLin - no problem! You can share your workspace ID via a DIM, I can try getting more details of need be.

Databricks decides whether to use node subnets or pod subnets based on the specific network configuration and the type of traffic. The node subnets are used for the Google Compute Engine (GCE) virtual machines that host the nodes, while the pod subnets are used for the individual pods within the Google Kubernetes Engine (GKE) clusters.

Regarding the switch from GKE to GCE, the workspaces are still hosted on GKE, which means that the network configurations involving node subnets and pod subnets remain relevant. The switch to GCE primarily impacts the underlying infrastructure but does not change the way workspaces and subnets are managed within the GKE clusters. The workspaces will continue to use the same subnet configurations for nodes and pods as defined during their creation.

KLin · yesterday

Hi @Alberto_Umana thank you for the detailed explanation. I figured out that the clusters/SQL warehouses that does not have the x-databricks-nextgen-cluster tag, i.e. still using GKE, the traffic originates from the pods subnet. If the clusters have the GCE tag, then the traffic originates from the node subnet. Is there a reason why it is done this way?

Much appreciated!

Alberto_Umana · yesterday

Hi @KLin, happy to help! -

The reason why traffic originates from the pods subnet for clusters/SQL warehouses without the x-databricks-nextgen-cluster tag (still using GKE) and from the node subnet for clusters with the GCE tag is due to the underlying infrastructure differences between Google Kubernetes Engine (GKE) and Google Compute Engine (GCE).

In GKE, Databricks clusters are implemented as Kubernetes namespaces, and the traffic is managed at the pod level. Each pod within the GKE cluster is assigned an IP address from the pod subnet, which is why traffic originates from the pods subnet.

On the other hand, when using GCE, the clusters are hosted on virtual machines (VMs) rather than Kubernetes pods. These VMs are assigned IP addresses from the node subnet. Therefore, traffic for clusters with the GCE tag originates from the node subnet.

This distinction is made to align with the network configurations and resource management specific to each type of infrastructure (GKE for Kubernetes-based deployments and GCE for VM-based deployments).