cancel
Showing results for 
Search instead for 
Did you mean: 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks connection pool(ing) using python

prateekgoyal
New Contributor II

I have a python application which I need to connect with databricks (Databricks SQL Connector). The process for that is quite simple -

1. connect with dbr
2. execute your query
3. get the result
4. close the connection

Example from - https://docs.databricks.com/en/dev-tools/python-sql-connector.html#manage-cursors-and-connections

 

from databricks import sql
import os

connection = sql.connect(server_hostname = os.getenv("DATABRICKS_SERVER_HOSTNAME"),
http_path = os.getenv("DATABRICKS_HTTP_PATH"),
access_token = os.getenv("DATABRICKS_TOKEN"))

cursor = connection.cursor()

cursor.execute("SELECT * from range(10)")
print(cursor.fetchall())

cursor.close()
connection.close()

 

But the issue here is that connecting to dbr takes almost 15-16 seconds in best case scenarios which is too much for an API I am building.

So to overcome, I implemented a custom "connection pooling" technique -

 

from databricks import sql
from app.utils.constants import DATABRICKS_SERVER_HOSTNAME, DATABRICKS_HTTP_PATH, DATABRICKS_TOKEN
import threading
import logging

logging.basicConfig(level=logging.INFO)

class ConnectionPool:
def __init__(self, max_size=5):
self.pool = []
self.max_size = max_size
self.lock = threading.Lock()

def create_connection(self):
logging.info("Creating a new connection")
return sql.connect(server_hostname=DATABRICKS_SERVER_HOSTNAME,
http_path=DATABRICKS_HTTP_PATH,
access_token=DATABRICKS_TOKEN)

def get_connection(self):
with self.lock:
if self.pool:
logging.info(f"Reusing connection from pool. Pool size before: {len(self.pool)}")
return self.pool.pop()
else:
logging.info("No available connections in pool, creating a new one")
return self.create_connection()

def release_connection(self, connection):
with self.lock:
if len(self.pool) < self.max_size:
self.pool.append(connection)
logging.info(f"Connection released back to pool. Pool size now: {len(self.pool)}")
else:
connection.close()
logging.info("Pool is full. Closing the connection")

connection_pool = ConnectionPool(max_size=5)

 

I know there are various factors which I have to factor as part of "best practices" like Manage Connection Lifetimes (Connection Timeout, Max Connection Lifetime), Connection Validation and Retry Logic - which I believe can be a topic of future discussions.

Now my main point currently is, the technique I have used is good enough for a basic production PoC or connection pooling can deteriorate performance (and be a mistake in long-run) or is there any out-of-the box solution for something like this (for ex - in rails active-record also has similar concept but built-in) ?

Any thoughts/suggestions/reviews are welcome 🙂

2 REPLIES 2

Kaniz_Fatma
Community Manager
Community Manager

Hi @prateekgoyal, For best practices, consider managing connection lifetimes by closing and recreating them periodically, validating connections before reuse to avoid issues with stale connections, and implementing retry logic for failed attempts. Be aware of potential issues such as resource leaks and ensure thread safety. While Databricks doesn’t have built-in pooling, third-party libraries like SQLAlchemy and DBUtils offer robust solutions. For example, SQLAlchemy can be configured for efficient pooling with settings like pool_size and max_overflow. Adjust pool sizes according to your needs and implement monitoring to track performance. Leveraging these established libraries can enhance your setup and minimize long-term performance issues.

prateekgoyal
New Contributor II

Hi @Kaniz_Fatma, thanks for your reply. I have 2 follow-up questions for you - 

  1. When you say "consider managing connection lifetimes by closing and recreating them periodically" - the problem I face with this is when my API request comes in and I create and close a connection in its synchronous lifecycle, the connection times is almost 15-16 seconds which adds this much to API's overall response time, which is not very suitable for our application. Am I understanding correctly what you want to say ?
  2. You suggested to use "third-party libraries like SQLAlchemy and DBUtils" - You mean don't connect to databricks directly but through these libraries to take advantage of connection pooling?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group