I have a python application which I need to connect with databricks (Databricks SQL Connector). The process for that is quite simple -
1. connect with dbr
2. execute your query
3. get the result
4. close the connection
Example from - https://docs.databricks.com/en/dev-tools/python-sql-connector.html#manage-cursors-and-connections
from databricks import sql
import os
connection = sql.connect(server_hostname = os.getenv("DATABRICKS_SERVER_HOSTNAME"),
http_path = os.getenv("DATABRICKS_HTTP_PATH"),
access_token = os.getenv("DATABRICKS_TOKEN"))
cursor = connection.cursor()
cursor.execute("SELECT * from range(10)")
print(cursor.fetchall())
cursor.close()
connection.close()
But the issue here is that connecting to dbr takes almost 15-16 seconds in best case scenarios which is too much for an API I am building.
So to overcome, I implemented a custom "connection pooling" technique -
from databricks import sql
from app.utils.constants import DATABRICKS_SERVER_HOSTNAME, DATABRICKS_HTTP_PATH, DATABRICKS_TOKEN
import threading
import logging
logging.basicConfig(level=logging.INFO)
class ConnectionPool:
def __init__(self, max_size=5):
self.pool = []
self.max_size = max_size
self.lock = threading.Lock()
def create_connection(self):
logging.info("Creating a new connection")
return sql.connect(server_hostname=DATABRICKS_SERVER_HOSTNAME,
http_path=DATABRICKS_HTTP_PATH,
access_token=DATABRICKS_TOKEN)
def get_connection(self):
with self.lock:
if self.pool:
logging.info(f"Reusing connection from pool. Pool size before: {len(self.pool)}")
return self.pool.pop()
else:
logging.info("No available connections in pool, creating a new one")
return self.create_connection()
def release_connection(self, connection):
with self.lock:
if len(self.pool) < self.max_size:
self.pool.append(connection)
logging.info(f"Connection released back to pool. Pool size now: {len(self.pool)}")
else:
connection.close()
logging.info("Pool is full. Closing the connection")
connection_pool = ConnectionPool(max_size=5)
I know there are various factors which I have to factor as part of "best practices" like Manage Connection Lifetimes (Connection Timeout, Max Connection Lifetime), Connection Validation and Retry Logic - which I believe can be a topic of future discussions.
Now my main point currently is, the technique I have used is good enough for a basic production PoC or connection pooling can deteriorate performance (and be a mistake in long-run) or is there any out-of-the box solution for something like this (for ex - in rails active-record also has similar concept but built-in) ?
Any thoughts/suggestions/reviews are welcome 🙂