cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Networking reduction cost for NATGateway and Shared Catalog

RaulPino
New Contributor III

Use case and context:

We have a databricks workspace in a specific region, reading and writing files from/to the same region.

We also read from a Shared Catalog in a different company, a data provider, which is pointing to multi-region s3 buckets.

The result is that we are incurring in high NATGateway-Bytes and DataTransfer-Regional-Bytes bills. 

 

Measures that we took to reduce cost:

Implemented a S3 Gateway Endpoint, to route any traffic between instances managed by databricks in private subnets and S3 in the same region. The idea is that this should reduce cost while reading and writing to our S3 in the same region, and reading from the shared catalog pointing to multiregion buckets, but we are still seeing no reduction on NATGateway-Bytes and DataTransfer-Regional-Bytes costs.

 

Are these costs inevitable? What could be wrong in our networking setup? Is there any other alternative?

1 ACCEPTED SOLUTION

Accepted Solutions

RaulPino
New Contributor III

Thanks @Kaniz_Fatma for all the suggestions.

After some days of monitoring NAT cost, I realized that the implementation of the S3 Gateway Endpoint it was actually working, the problem was that I thought that this change would be reflected right away in terms of costs, but I found out that this can take a bit more than 24 hours to be visible in AWS Cost Explorer.

From AWS docs: https://docs.aws.amazon.com/cost-management/latest/userguide/ce-exploring-data.html

All costs reflect your usage up to the previous day. For example, if today is December 2, the data includes your usage through December 1.

 

We already had AWS Flow Logs implemented in the VPC, so using the following query in Cloudwatch Logs Insight, I saw some reduction the first day, but I wasn't sure if it was a real reduction, or just casual less traffic:

# downloads in total
filter (dstAddr like '10.0.0.1' and not isIpv4InSubnet(srcAddr, '10.0.0.0/16')) | stats sum(bytes) as bytesTransferred
# uploads in total
filter (srcAddr like '10.0.0.1' and not isIpv4InSubnet(dstAddr, '10.0.0.0/16')) | stats sum(bytes) as bytesTransferred


So I needed to confirm that actually all inbound/outbound traffic between the subnets and S3 was going through the S3 Gateway Endpoint. After some research I found all AWS IP ranges here https://docs.aws.amazon.com/vpc/latest/userguide/aws-ip-ranges.html, then using this simple script to get only the S3 IP ranges:

import json

# Load the AWS IP ranges JSON file
with open('aws-ips.json') as file:
   ip_ranges = json.load(file)

# Filter for S3 IPs in a specific region, e.g., us-east-1
s3_ips = [range["ip_prefix"] for range in ip_ranges["prefixes"]
          if range["service"] == "S3" and range["region"] == "us-east-1"]

print(s3_ips)


I was able to write a more precise Logs Insight query to check for traffic between our NAT and S3, to check if there was still some traffic:

# downloads from s3
filter (
   dstAddr like '10.0.0.1' and (
           isIpv4InSubnet(srcAddr, '18.34.0.0/19')
           or isIpv4InSubnet(srcAddr, '54.231.0.0/16')
           or isIpv4InSubnet(srcAddr, '52.216.0.0/15')
           or isIpv4InSubnet(srcAddr, '18.34.232.0/21')
           or isIpv4InSubnet(srcAddr, '16.182.0.0/16')
           or isIpv4InSubnet(srcAddr, '3.5.0.0/19')
           or isIpv4InSubnet(srcAddr, '44.192.134.240/28')
           or isIpv4InSubnet(srcAddr, '44.192.140.64/28')
       )
) | stats sum(bytes) as bytesTransferred
# uploads to s3
filter (
   srcAddr like '10.0.0.1' and (
           isIpv4InSubnet(dstAddr, '18.34.0.0/19')
           or isIpv4InSubnet(dstAddr, '54.231.0.0/16')
           or isIpv4InSubnet(dstAddr, '52.216.0.0/15')
           or isIpv4InSubnet(dstAddr, '18.34.232.0/21')
           or isIpv4InSubnet(dstAddr, '16.182.0.0/16')
           or isIpv4InSubnet(dstAddr, '3.5.0.0/19')
           or isIpv4InSubnet(dstAddr, '44.192.134.240/28')
           or isIpv4InSubnet(dstAddr, '44.192.140.64/28')
       )
) | stats sum(bytes) as bytesTransferred


After running these queries, I confirmed there was no traffic, downloading nor uploading, between NAT and S3, right after the S3 Gateway Endpoint was deployed.

 

NOTE:

A great tool I didn't know before this was AWS Reachability Analyzer, which I used to check connectivity between instance's ENIs in private subnets, and the S3 Gateway Endpoint.

 

View solution in original post

4 REPLIES 4

Kaniz_Fatma
Community Manager
Community Manager

Hi @RaulPino, Reducing costs related to NATGateway-Bytes and DataTransfer-Regional-Bytes in your Databricks environment is crucial for efficient resource management. Letโ€™s explore some strategies and potential alternatives:

Azure Databricks Cost Management:

Networking Setup Considerations:

  • S3 Gateway Endpoint: Your implementation of an S3 Gateway Endpoint is a step in the right direction. It should reduce costs when reading and writing to S3 within the same region.
  • Shared Catalog Traffic: However, if youโ€™re still experiencing high costs, consider the following:
    • Shared Catalog Traffic: Verify that traffic from your shared catalog to multi-region S3 buckets is indeed routed through the S3 Gateway Endpoint. Ensure that the endpoint is correctly configured and that all relevant traffic is using it.
    • Availability Zones: Check whether the resources behind the NAT gateway (which generate the most traffic) are in the same Availability Zone as the NAT gateway itself. Data transfer within the same Availability Zone is free in AWS3. If they are in different zones, consider adjusting your setup.
    • Data Transfer Patterns: Understand the data transfer patterns between your Databricks workspace, S3, and the shared catalog. Are there any unexpected cross-region transfers? Investigate whether certain jobs or clusters are causing excessive data movement.
    • VPC Endpoints: Explore using VPC endpoints for services like S3. These allow private communication between your VPC and supported AWS services without going over the internet. It might further reduce costs and improve security.
    • Direct Connect: If you have an AWS Direct Connect setup, review its data transfer charges. Direct Connect data transfer over a public or private virtual interface has specific usage types4.

Reserved Instances and Spot Instances:

  • While not directly related to networking, consider using Reserved Instances for VMs or Databricks clusters. Reserved Instances offer cost savings over On-Demand pricing.
  • Additionally, explore using Spot Instances for non-critical workloads. Spot Instances can significantly reduce costs but come with the caveat that they can be preempted.

Continuous Monitoring and Optimization:

  • Regularly monitor your costs and usage patterns. Adjust your setup based on actual data and performance.
  • Collaborate with your team to ensure everyone understands the cost implications of their workloads.

RaulPino
New Contributor III

Thanks @Kaniz_Fatma for all the suggestions.

After some days of monitoring NAT cost, I realized that the implementation of the S3 Gateway Endpoint it was actually working, the problem was that I thought that this change would be reflected right away in terms of costs, but I found out that this can take a bit more than 24 hours to be visible in AWS Cost Explorer.

From AWS docs: https://docs.aws.amazon.com/cost-management/latest/userguide/ce-exploring-data.html

All costs reflect your usage up to the previous day. For example, if today is December 2, the data includes your usage through December 1.

 

We already had AWS Flow Logs implemented in the VPC, so using the following query in Cloudwatch Logs Insight, I saw some reduction the first day, but I wasn't sure if it was a real reduction, or just casual less traffic:

# downloads in total
filter (dstAddr like '10.0.0.1' and not isIpv4InSubnet(srcAddr, '10.0.0.0/16')) | stats sum(bytes) as bytesTransferred
# uploads in total
filter (srcAddr like '10.0.0.1' and not isIpv4InSubnet(dstAddr, '10.0.0.0/16')) | stats sum(bytes) as bytesTransferred


So I needed to confirm that actually all inbound/outbound traffic between the subnets and S3 was going through the S3 Gateway Endpoint. After some research I found all AWS IP ranges here https://docs.aws.amazon.com/vpc/latest/userguide/aws-ip-ranges.html, then using this simple script to get only the S3 IP ranges:

import json

# Load the AWS IP ranges JSON file
with open('aws-ips.json') as file:
   ip_ranges = json.load(file)

# Filter for S3 IPs in a specific region, e.g., us-east-1
s3_ips = [range["ip_prefix"] for range in ip_ranges["prefixes"]
          if range["service"] == "S3" and range["region"] == "us-east-1"]

print(s3_ips)


I was able to write a more precise Logs Insight query to check for traffic between our NAT and S3, to check if there was still some traffic:

# downloads from s3
filter (
   dstAddr like '10.0.0.1' and (
           isIpv4InSubnet(srcAddr, '18.34.0.0/19')
           or isIpv4InSubnet(srcAddr, '54.231.0.0/16')
           or isIpv4InSubnet(srcAddr, '52.216.0.0/15')
           or isIpv4InSubnet(srcAddr, '18.34.232.0/21')
           or isIpv4InSubnet(srcAddr, '16.182.0.0/16')
           or isIpv4InSubnet(srcAddr, '3.5.0.0/19')
           or isIpv4InSubnet(srcAddr, '44.192.134.240/28')
           or isIpv4InSubnet(srcAddr, '44.192.140.64/28')
       )
) | stats sum(bytes) as bytesTransferred
# uploads to s3
filter (
   srcAddr like '10.0.0.1' and (
           isIpv4InSubnet(dstAddr, '18.34.0.0/19')
           or isIpv4InSubnet(dstAddr, '54.231.0.0/16')
           or isIpv4InSubnet(dstAddr, '52.216.0.0/15')
           or isIpv4InSubnet(dstAddr, '18.34.232.0/21')
           or isIpv4InSubnet(dstAddr, '16.182.0.0/16')
           or isIpv4InSubnet(dstAddr, '3.5.0.0/19')
           or isIpv4InSubnet(dstAddr, '44.192.134.240/28')
           or isIpv4InSubnet(dstAddr, '44.192.140.64/28')
       )
) | stats sum(bytes) as bytesTransferred


After running these queries, I confirmed there was no traffic, downloading nor uploading, between NAT and S3, right after the S3 Gateway Endpoint was deployed.

 

NOTE:

A great tool I didn't know before this was AWS Reachability Analyzer, which I used to check connectivity between instance's ENIs in private subnets, and the S3 Gateway Endpoint.

 

RaulPino
New Contributor III

Thanks @Kaniz_Fatma for all the suggestions.

After some days of monitoring NAT cost, I realized that the implementation of the S3 Gateway Endpoint it was actually working, the problem was that I thought that this change would be reflected right away in terms of costs, but I found out that this can take a bit more than 24 hours to be visible in AWS Cost Explorer.

From AWS docs: https://docs.aws.amazon.com/cost-management/latest/userguide/ce-exploring-data.html

All costs reflect your usage up to the previous day. For example, if today is December 2, the data includes your usage through December 1.

 

We already had AWS Flow Logs implemented in the VPC, so using the following query in Cloudwatch Logs Insight, I saw some reduction the first day, but I wasn't sure if it was a real reduction, or just casual less traffic:

# downloads in total
filter (dstAddr like '10.0.0.1' and not isIpv4InSubnet(srcAddr, '10.0.0.0/16')) | stats sum(bytes) as bytesTransferred
# uploads in total
filter (srcAddr like '10.0.0.1' and not isIpv4InSubnet(dstAddr, '10.0.0.0/16')) | stats sum(bytes) as bytesTransferred

So I needed to confirm that actually all inbound/outbound traffic between the subnets and S3 was going through the S3 Gateway Endpoint. After some research I found all AWS IP ranges here https://docs.aws.amazon.com/vpc/latest/userguide/aws-ip-ranges.html, then using this simple script to get only the S3 IP ranges:

import json

# Load the AWS IP ranges JSON file
with open('aws-ips.json') as file:
   ip_ranges = json.load(file)

# Filter for S3 IPs in a specific region, e.g., us-east-1
s3_ips = [range["ip_prefix"] for range in ip_ranges["prefixes"]
          if range["service"] == "S3" and range["region"] == "us-east-1"]

print(s3_ips)


I was able to write a more precise Logs Insight query to check for traffic between our NAT and S3, to check if there was still some traffic:

# downloads from s3
filter (
   dstAddr like '10.0.0.1' and (
           isIpv4InSubnet(srcAddr, '18.34.0.0/19')
           or isIpv4InSubnet(srcAddr, '54.231.0.0/16')
           or isIpv4InSubnet(srcAddr, '52.216.0.0/15')
           or isIpv4InSubnet(srcAddr, '18.34.232.0/21')
           or isIpv4InSubnet(srcAddr, '16.182.0.0/16')
           or isIpv4InSubnet(srcAddr, '3.5.0.0/19')
           or isIpv4InSubnet(srcAddr, '44.192.134.240/28')
           or isIpv4InSubnet(srcAddr, '44.192.140.64/28')
       )
) | stats sum(bytes) as bytesTransferred
# uploads to s3
filter (
   srcAddr like '10.0.0.1' and (
           isIpv4InSubnet(dstAddr, '18.34.0.0/19')
           or isIpv4InSubnet(dstAddr, '54.231.0.0/16')
           or isIpv4InSubnet(dstAddr, '52.216.0.0/15')
           or isIpv4InSubnet(dstAddr, '18.34.232.0/21')
           or isIpv4InSubnet(dstAddr, '16.182.0.0/16')
           or isIpv4InSubnet(dstAddr, '3.5.0.0/19')
           or isIpv4InSubnet(dstAddr, '44.192.134.240/28')
           or isIpv4InSubnet(dstAddr, '44.192.140.64/28')
       )
) | stats sum(bytes) as bytesTransferred


After running these queries, I confirmed there was no traffic, downloading nor uploading, between NAT and S3, right after the S3 Gateway Endpoint was deployed.

 

NOTE:

A great tool worth mentioning was AWS Reachability Analyzer, which I used to check connectivity between instance's ENIs in private subnets, and the S3 Gateway Endpoint.

RaulPino
New Contributor III

Thanks @Kaniz_Fatma for all the suggestions.

After some days of monitoring NAT cost, I realized that the implementation of the S3 Gateway Endpoint it was actually working, the problem was that I thought that this change would be reflected right away in terms of costs, but I found out that this can take a bit more than 24 hours to be visible in AWS Cost Explorer.

From AWS docs: https://docs.aws.amazon.com/cost-management/latest/userguide/ce-exploring-data.html

All costs reflect your usage up to the previous day. For example, if today is December 2, the data includes your usage through December 1.

 

We already had AWS Flow Logs implemented in the VPC, so using the following query in Cloudwatch Logs Insight, I saw some reduction the first day, but I wasn't sure if it was a real reduction, or just casual less traffic:

# downloads in total
filter (dstAddr like '10.0.0.1' and not isIpv4InSubnet(srcAddr, '10.0.0.0/16')) | stats sum(bytes) as bytesTransferred
# uploads in total
filter (srcAddr like '10.0.0.1' and not isIpv4InSubnet(dstAddr, '10.0.0.0/16')) | stats sum(bytes) as bytesTransferred
So I needed to confirm that actually all inbound/outbound traffic between the subnets and S3 was going through the S3 Gateway Endpoint. After some research I found all AWS IP ranges here https://docs.aws.amazon.com/vpc/latest/userguide/aws-ip-ranges.html, then using this simple script to get only the S3 IP ranges:

import json

# Load the AWS IP ranges JSON file
with open('aws-ips.json') as file:
   ip_ranges = json.load(file)

# Filter for S3 IPs in a specific region, e.g., us-east-1
s3_ips = [range["ip_prefix"] for range in ip_ranges["prefixes"]
          if range["service"] == "S3" and range["region"] == "us-east-1"]

print(s3_ips)
I was able to write a more precise Logs Insight query to check for traffic between our NAT and S3, to check if there was still some traffic:

# downloads from s3
filter (
   dstAddr like '10.0.0.1' and (
           isIpv4InSubnet(srcAddr, '18.34.0.0/19')
           or isIpv4InSubnet(srcAddr, '54.231.0.0/16')
           or isIpv4InSubnet(srcAddr, '52.216.0.0/15')
           or isIpv4InSubnet(srcAddr, '18.34.232.0/21')
           or isIpv4InSubnet(srcAddr, '16.182.0.0/16')
           or isIpv4InSubnet(srcAddr, '3.5.0.0/19')
           or isIpv4InSubnet(srcAddr, '44.192.134.240/28')
           or isIpv4InSubnet(srcAddr, '44.192.140.64/28')
       )
) | stats sum(bytes) as bytesTransferred
# uploads to s3
filter (
   srcAddr like '10.0.0.1' and (
           isIpv4InSubnet(dstAddr, '18.34.0.0/19')
           or isIpv4InSubnet(dstAddr, '54.231.0.0/16')
           or isIpv4InSubnet(dstAddr, '52.216.0.0/15')
           or isIpv4InSubnet(dstAddr, '18.34.232.0/21')
           or isIpv4InSubnet(dstAddr, '16.182.0.0/16')
           or isIpv4InSubnet(dstAddr, '3.5.0.0/19')
           or isIpv4InSubnet(dstAddr, '44.192.134.240/28')
           or isIpv4InSubnet(dstAddr, '44.192.140.64/28')
       )
) | stats sum(bytes) as bytesTransferred
After running these queries, I confirmed there was no traffic, downloading nor uploading, between NAT and S3, right after the S3 Gateway Endpoint was deployed.

 

NOTE:

A great tool I didn't know before this was AWS Reachability Analyzer, which I used to check connectivity between instance's ENIs in private subnets, and the S3 Gateway Endpoint.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group