โ02-13-2024 07:27 AM
Use case and context:
We have a databricks workspace in a specific region, reading and writing files from/to the same region.
We also read from a Shared Catalog in a different company, a data provider, which is pointing to multi-region s3 buckets.
The result is that we are incurring in high NATGateway-Bytes and DataTransfer-Regional-Bytes bills.
Measures that we took to reduce cost:
Implemented a S3 Gateway Endpoint, to route any traffic between instances managed by databricks in private subnets and S3 in the same region. The idea is that this should reduce cost while reading and writing to our S3 in the same region, and reading from the shared catalog pointing to multiregion buckets, but we are still seeing no reduction on NATGateway-Bytes and DataTransfer-Regional-Bytes costs.
Are these costs inevitable? What could be wrong in our networking setup? Is there any other alternative?
โ02-21-2024 06:04 AM
Thanks @Kaniz_Fatma for all the suggestions.
After some days of monitoring NAT cost, I realized that the implementation of the S3 Gateway Endpoint it was actually working, the problem was that I thought that this change would be reflected right away in terms of costs, but I found out that this can take a bit more than 24 hours to be visible in AWS Cost Explorer.
From AWS docs: https://docs.aws.amazon.com/cost-management/latest/userguide/ce-exploring-data.html
All costs reflect your usage up to the previous day. For example, if today is December 2, the data includes your usage through December 1.
We already had AWS Flow Logs implemented in the VPC, so using the following query in Cloudwatch Logs Insight, I saw some reduction the first day, but I wasn't sure if it was a real reduction, or just casual less traffic:
# downloads in total
filter (dstAddr like '10.0.0.1' and not isIpv4InSubnet(srcAddr, '10.0.0.0/16')) | stats sum(bytes) as bytesTransferred
# uploads in total
filter (srcAddr like '10.0.0.1' and not isIpv4InSubnet(dstAddr, '10.0.0.0/16')) | stats sum(bytes) as bytesTransferred
So I needed to confirm that actually all inbound/outbound traffic between the subnets and S3 was going through the S3 Gateway Endpoint. After some research I found all AWS IP ranges here https://docs.aws.amazon.com/vpc/latest/userguide/aws-ip-ranges.html, then using this simple script to get only the S3 IP ranges:
import json
# Load the AWS IP ranges JSON file
with open('aws-ips.json') as file:
ip_ranges = json.load(file)
# Filter for S3 IPs in a specific region, e.g., us-east-1
s3_ips = [range["ip_prefix"] for range in ip_ranges["prefixes"]
if range["service"] == "S3" and range["region"] == "us-east-1"]
print(s3_ips)
I was able to write a more precise Logs Insight query to check for traffic between our NAT and S3, to check if there was still some traffic:
# downloads from s3
filter (
dstAddr like '10.0.0.1' and (
isIpv4InSubnet(srcAddr, '18.34.0.0/19')
or isIpv4InSubnet(srcAddr, '54.231.0.0/16')
or isIpv4InSubnet(srcAddr, '52.216.0.0/15')
or isIpv4InSubnet(srcAddr, '18.34.232.0/21')
or isIpv4InSubnet(srcAddr, '16.182.0.0/16')
or isIpv4InSubnet(srcAddr, '3.5.0.0/19')
or isIpv4InSubnet(srcAddr, '44.192.134.240/28')
or isIpv4InSubnet(srcAddr, '44.192.140.64/28')
)
) | stats sum(bytes) as bytesTransferred
# uploads to s3
filter (
srcAddr like '10.0.0.1' and (
isIpv4InSubnet(dstAddr, '18.34.0.0/19')
or isIpv4InSubnet(dstAddr, '54.231.0.0/16')
or isIpv4InSubnet(dstAddr, '52.216.0.0/15')
or isIpv4InSubnet(dstAddr, '18.34.232.0/21')
or isIpv4InSubnet(dstAddr, '16.182.0.0/16')
or isIpv4InSubnet(dstAddr, '3.5.0.0/19')
or isIpv4InSubnet(dstAddr, '44.192.134.240/28')
or isIpv4InSubnet(dstAddr, '44.192.140.64/28')
)
) | stats sum(bytes) as bytesTransferred
After running these queries, I confirmed there was no traffic, downloading nor uploading, between NAT and S3, right after the S3 Gateway Endpoint was deployed.
NOTE:
A great tool I didn't know before this was AWS Reachability Analyzer, which I used to check connectivity between instance's ENIs in private subnets, and the S3 Gateway Endpoint.
โ02-14-2024 01:32 AM - edited โ02-14-2024 01:32 AM
Hi @RaulPino, Reducing costs related to NATGateway-Bytes and DataTransfer-Regional-Bytes in your Databricks environment is crucial for efficient resource management. Letโs explore some strategies and potential alternatives:
Azure Databricks Cost Management:
Networking Setup Considerations:
Reserved Instances and Spot Instances:
Continuous Monitoring and Optimization:
โ02-21-2024 06:04 AM
Thanks @Kaniz_Fatma for all the suggestions.
After some days of monitoring NAT cost, I realized that the implementation of the S3 Gateway Endpoint it was actually working, the problem was that I thought that this change would be reflected right away in terms of costs, but I found out that this can take a bit more than 24 hours to be visible in AWS Cost Explorer.
From AWS docs: https://docs.aws.amazon.com/cost-management/latest/userguide/ce-exploring-data.html
All costs reflect your usage up to the previous day. For example, if today is December 2, the data includes your usage through December 1.
We already had AWS Flow Logs implemented in the VPC, so using the following query in Cloudwatch Logs Insight, I saw some reduction the first day, but I wasn't sure if it was a real reduction, or just casual less traffic:
# downloads in total
filter (dstAddr like '10.0.0.1' and not isIpv4InSubnet(srcAddr, '10.0.0.0/16')) | stats sum(bytes) as bytesTransferred
# uploads in total
filter (srcAddr like '10.0.0.1' and not isIpv4InSubnet(dstAddr, '10.0.0.0/16')) | stats sum(bytes) as bytesTransferred
So I needed to confirm that actually all inbound/outbound traffic between the subnets and S3 was going through the S3 Gateway Endpoint. After some research I found all AWS IP ranges here https://docs.aws.amazon.com/vpc/latest/userguide/aws-ip-ranges.html, then using this simple script to get only the S3 IP ranges:
import json
# Load the AWS IP ranges JSON file
with open('aws-ips.json') as file:
ip_ranges = json.load(file)
# Filter for S3 IPs in a specific region, e.g., us-east-1
s3_ips = [range["ip_prefix"] for range in ip_ranges["prefixes"]
if range["service"] == "S3" and range["region"] == "us-east-1"]
print(s3_ips)
I was able to write a more precise Logs Insight query to check for traffic between our NAT and S3, to check if there was still some traffic:
# downloads from s3
filter (
dstAddr like '10.0.0.1' and (
isIpv4InSubnet(srcAddr, '18.34.0.0/19')
or isIpv4InSubnet(srcAddr, '54.231.0.0/16')
or isIpv4InSubnet(srcAddr, '52.216.0.0/15')
or isIpv4InSubnet(srcAddr, '18.34.232.0/21')
or isIpv4InSubnet(srcAddr, '16.182.0.0/16')
or isIpv4InSubnet(srcAddr, '3.5.0.0/19')
or isIpv4InSubnet(srcAddr, '44.192.134.240/28')
or isIpv4InSubnet(srcAddr, '44.192.140.64/28')
)
) | stats sum(bytes) as bytesTransferred
# uploads to s3
filter (
srcAddr like '10.0.0.1' and (
isIpv4InSubnet(dstAddr, '18.34.0.0/19')
or isIpv4InSubnet(dstAddr, '54.231.0.0/16')
or isIpv4InSubnet(dstAddr, '52.216.0.0/15')
or isIpv4InSubnet(dstAddr, '18.34.232.0/21')
or isIpv4InSubnet(dstAddr, '16.182.0.0/16')
or isIpv4InSubnet(dstAddr, '3.5.0.0/19')
or isIpv4InSubnet(dstAddr, '44.192.134.240/28')
or isIpv4InSubnet(dstAddr, '44.192.140.64/28')
)
) | stats sum(bytes) as bytesTransferred
After running these queries, I confirmed there was no traffic, downloading nor uploading, between NAT and S3, right after the S3 Gateway Endpoint was deployed.
NOTE:
A great tool I didn't know before this was AWS Reachability Analyzer, which I used to check connectivity between instance's ENIs in private subnets, and the S3 Gateway Endpoint.
โ02-21-2024 05:53 AM - edited โ02-21-2024 05:56 AM
Thanks @Kaniz_Fatma for all the suggestions.
After some days of monitoring NAT cost, I realized that the implementation of the S3 Gateway Endpoint it was actually working, the problem was that I thought that this change would be reflected right away in terms of costs, but I found out that this can take a bit more than 24 hours to be visible in AWS Cost Explorer.
From AWS docs: https://docs.aws.amazon.com/cost-management/latest/userguide/ce-exploring-data.html
All costs reflect your usage up to the previous day. For example, if today is December 2, the data includes your usage through December 1.
We already had AWS Flow Logs implemented in the VPC, so using the following query in Cloudwatch Logs Insight, I saw some reduction the first day, but I wasn't sure if it was a real reduction, or just casual less traffic:
# downloads in total
filter (dstAddr like '10.0.0.1' and not isIpv4InSubnet(srcAddr, '10.0.0.0/16')) | stats sum(bytes) as bytesTransferred
# uploads in total
filter (srcAddr like '10.0.0.1' and not isIpv4InSubnet(dstAddr, '10.0.0.0/16')) | stats sum(bytes) as bytesTransferred
So I needed to confirm that actually all inbound/outbound traffic between the subnets and S3 was going through the S3 Gateway Endpoint. After some research I found all AWS IP ranges here https://docs.aws.amazon.com/vpc/latest/userguide/aws-ip-ranges.html, then using this simple script to get only the S3 IP ranges:
import json
# Load the AWS IP ranges JSON file
with open('aws-ips.json') as file:
ip_ranges = json.load(file)
# Filter for S3 IPs in a specific region, e.g., us-east-1
s3_ips = [range["ip_prefix"] for range in ip_ranges["prefixes"]
if range["service"] == "S3" and range["region"] == "us-east-1"]
print(s3_ips)
I was able to write a more precise Logs Insight query to check for traffic between our NAT and S3, to check if there was still some traffic:
# downloads from s3
filter (
dstAddr like '10.0.0.1' and (
isIpv4InSubnet(srcAddr, '18.34.0.0/19')
or isIpv4InSubnet(srcAddr, '54.231.0.0/16')
or isIpv4InSubnet(srcAddr, '52.216.0.0/15')
or isIpv4InSubnet(srcAddr, '18.34.232.0/21')
or isIpv4InSubnet(srcAddr, '16.182.0.0/16')
or isIpv4InSubnet(srcAddr, '3.5.0.0/19')
or isIpv4InSubnet(srcAddr, '44.192.134.240/28')
or isIpv4InSubnet(srcAddr, '44.192.140.64/28')
)
) | stats sum(bytes) as bytesTransferred
# uploads to s3
filter (
srcAddr like '10.0.0.1' and (
isIpv4InSubnet(dstAddr, '18.34.0.0/19')
or isIpv4InSubnet(dstAddr, '54.231.0.0/16')
or isIpv4InSubnet(dstAddr, '52.216.0.0/15')
or isIpv4InSubnet(dstAddr, '18.34.232.0/21')
or isIpv4InSubnet(dstAddr, '16.182.0.0/16')
or isIpv4InSubnet(dstAddr, '3.5.0.0/19')
or isIpv4InSubnet(dstAddr, '44.192.134.240/28')
or isIpv4InSubnet(dstAddr, '44.192.140.64/28')
)
) | stats sum(bytes) as bytesTransferred
After running these queries, I confirmed there was no traffic, downloading nor uploading, between NAT and S3, right after the S3 Gateway Endpoint was deployed.
NOTE:
A great tool worth mentioning was AWS Reachability Analyzer, which I used to check connectivity between instance's ENIs in private subnets, and the S3 Gateway Endpoint.
โ02-21-2024 05:57 AM
Thanks @Kaniz_Fatma for all the suggestions.
After some days of monitoring NAT cost, I realized that the implementation of the S3 Gateway Endpoint it was actually working, the problem was that I thought that this change would be reflected right away in terms of costs, but I found out that this can take a bit more than 24 hours to be visible in AWS Cost Explorer.
From AWS docs: https://docs.aws.amazon.com/cost-management/latest/userguide/ce-exploring-data.html
All costs reflect your usage up to the previous day. For example, if today is December 2, the data includes your usage through December 1.
We already had AWS Flow Logs implemented in the VPC, so using the following query in Cloudwatch Logs Insight, I saw some reduction the first day, but I wasn't sure if it was a real reduction, or just casual less traffic:
# downloads in total
filter (dstAddr like '10.0.0.1' and not isIpv4InSubnet(srcAddr, '10.0.0.0/16')) | stats sum(bytes) as bytesTransferred
# uploads in total
filter (srcAddr like '10.0.0.1' and not isIpv4InSubnet(dstAddr, '10.0.0.0/16')) | stats sum(bytes) as bytesTransferred
So I needed to confirm that actually all inbound/outbound traffic between the subnets and S3 was going through the S3 Gateway Endpoint. After some research I found all AWS IP ranges here https://docs.aws.amazon.com/vpc/latest/userguide/aws-ip-ranges.html, then using this simple script to get only the S3 IP ranges:
import json
# Load the AWS IP ranges JSON file
with open('aws-ips.json') as file:
ip_ranges = json.load(file)
# Filter for S3 IPs in a specific region, e.g., us-east-1
s3_ips = [range["ip_prefix"] for range in ip_ranges["prefixes"]
if range["service"] == "S3" and range["region"] == "us-east-1"]
print(s3_ips)
I was able to write a more precise Logs Insight query to check for traffic between our NAT and S3, to check if there was still some traffic:
# downloads from s3
filter (
dstAddr like '10.0.0.1' and (
isIpv4InSubnet(srcAddr, '18.34.0.0/19')
or isIpv4InSubnet(srcAddr, '54.231.0.0/16')
or isIpv4InSubnet(srcAddr, '52.216.0.0/15')
or isIpv4InSubnet(srcAddr, '18.34.232.0/21')
or isIpv4InSubnet(srcAddr, '16.182.0.0/16')
or isIpv4InSubnet(srcAddr, '3.5.0.0/19')
or isIpv4InSubnet(srcAddr, '44.192.134.240/28')
or isIpv4InSubnet(srcAddr, '44.192.140.64/28')
)
) | stats sum(bytes) as bytesTransferred
# uploads to s3
filter (
srcAddr like '10.0.0.1' and (
isIpv4InSubnet(dstAddr, '18.34.0.0/19')
or isIpv4InSubnet(dstAddr, '54.231.0.0/16')
or isIpv4InSubnet(dstAddr, '52.216.0.0/15')
or isIpv4InSubnet(dstAddr, '18.34.232.0/21')
or isIpv4InSubnet(dstAddr, '16.182.0.0/16')
or isIpv4InSubnet(dstAddr, '3.5.0.0/19')
or isIpv4InSubnet(dstAddr, '44.192.134.240/28')
or isIpv4InSubnet(dstAddr, '44.192.140.64/28')
)
) | stats sum(bytes) as bytesTransferred
After running these queries, I confirmed there was no traffic, downloading nor uploading, between NAT and S3, right after the S3 Gateway Endpoint was deployed.
NOTE:
A great tool I didn't know before this was AWS Reachability Analyzer, which I used to check connectivity between instance's ENIs in private subnets, and the S3 Gateway Endpoint.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group