cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Administration & Architecture
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Scalable API/binary lookups

leelee3000
New Contributor III
New Contributor III

We sometimes process large dataframes that contain a column of IP addresses and we need to associate an Autonomous System Number (ASN) per IP address. The ASN information is provided by MaxMind in the form of a binary data file only accessible via a Python function. We have tried using a UDF that calls the Python function; however, there are issues accessing the MaxMind binary data file. How can this be done using Databricks? Note, we can successfully do this using Panda dataframes (on the Databricks platform), but we cannot rely on Pandas at scale.

1 REPLY 1

Kaniz
Community Manager
Community Manager

Hi @leelee3000, You have a couple of options to associate Autonomous System Numbers (ASNs) with IP addresses in Databricks.

 

Letโ€™s explore them:

 

Using MaxMindโ€™s GeoLite2 ASN Database:

  • MaxMind provides a GeoLite2 ASN Database that contains information about ASNs associated with IPv4 and IPv6 addresses.
  • You can download this database in either binary format (MaxMind DB) or CSV format.
  • I recommend the CSV format for your use case since itโ€™s more straightforward to work with in Databricks.
  • The CSV files include details such as the IP network, ASN, and organization associated with each IP address.
  • Hereโ€™s how you can proceed:
    • Download the GeoLite2-ASN-CSV_{YYYYMMDD}.zip file from MaxMindโ€™s developer portal.
    • Extract the zip file to access the CSV files containing the relevant data.
    • In Databricks, you can read these CSV files into a DataFrame using standard SQL or Spark APIs.
    • Use the network column (representing IP networks in CIDR format) to join your existing DataFrame with the ASN information.
    • The autonomous_system_number column will give you the ASN for each IP address, and the autonomous_system_organization column provides the associated organization.
    • This approach allows you to scale efficiently without relying on Pandas.

Python UDF with MaxMindโ€™s Python Client API:

  • If you prefer using the binary MaxMind DB format, you can create a Python UDF in Databricks.
  • First, download the MaxMind ASN database in binary format.
  • Next, write a Python function that takes an IP address as input and returns the corresponding ASN.
  • In Databricks, create a UDF that calls this Python function for each IP address in your DataFrame.
  • Apply the UDF to the IP address column to generate a new column with the associated ASNs.
  • Remember that Python UDFs can be slower than native Spark operations, especially at scale. However, they provide flexibility when working with custom libraries or APIs.

Remember to handle privacy considerations and comply with data privacy regulations when working with IP addresses and ASNs. MaxMind provides privacy exclusions APIs to help you manage this.

 

Feel free to choose the approach that best suits your requirements and performance needs! ๐Ÿš€

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.