Databricks Community

prateekgoyal · ‎07-29-2024

I have a python application which I need to connect with databricks (Databricks SQL Connector). The process for that is quite simple -

1. connect with dbr
2. execute your query
3. get the result
4. close the connection

Example from - https://docs.databricks.com/en/dev-tools/python-sql-connector.html#manage-cursors-and-connections

from databricks import sql
import os

connection = sql.connect(server_hostname = os.getenv("DATABRICKS_SERVER_HOSTNAME"),
http_path = os.getenv("DATABRICKS_HTTP_PATH"),
access_token = os.getenv("DATABRICKS_TOKEN"))

cursor = connection.cursor()

cursor.execute("SELECT * from range(10)")
print(cursor.fetchall())

cursor.close()
connection.close()

But the issue here is that connecting to dbr takes almost 15-16 seconds in best case scenarios which is too much for an API I am building.

So to overcome, I implemented a custom "connection pooling" technique -

from databricks import sql
from app.utils.constants import DATABRICKS_SERVER_HOSTNAME, DATABRICKS_HTTP_PATH, DATABRICKS_TOKEN
import threading
import logging

logging.basicConfig(level=logging.INFO)

class ConnectionPool:
def __init__(self, max_size=5):
self.pool = []
self.max_size = max_size
self.lock = threading.Lock()

def create_connection(self):
logging.info("Creating a new connection")
return sql.connect(server_hostname=DATABRICKS_SERVER_HOSTNAME,
http_path=DATABRICKS_HTTP_PATH,
access_token=DATABRICKS_TOKEN)

def get_connection(self):
with self.lock:
if self.pool:
logging.info(f"Reusing connection from pool. Pool size before: {len(self.pool)}")
return self.pool.pop()
else:
logging.info("No available connections in pool, creating a new one")
return self.create_connection()

def release_connection(self, connection):
with self.lock:
if len(self.pool) < self.max_size:
self.pool.append(connection)
logging.info(f"Connection released back to pool. Pool size now: {len(self.pool)}")
else:
connection.close()
logging.info("Pool is full. Closing the connection")

connection_pool = ConnectionPool(max_size=5)

I know there are various factors which I have to factor as part of "best practices" like Manage Connection Lifetimes (Connection Timeout, Max Connection Lifetime), Connection Validation and Retry Logic - which I believe can be a topic of future discussions.

Now my main point currently is, the technique I have used is good enough for a basic production PoC or connection pooling can deteriorate performance (and be a mistake in long-run) or is there any out-of-the box solution for something like this (for ex - in rails active-record also has similar concept but built-in) ?

Any thoughts/suggestions/reviews are welcome 🙂

prateekgoyal · ‎08-07-2024

Hi @Retired_mod, thanks for your reply. I have 2 follow-up questions for you -

When you say "consider managing connection lifetimes by closing and recreating them periodically" - the problem I face with this is when my API request comes in and I create and close a connection in its synchronous lifecycle, the connection times is almost 15-16 seconds which adds this much to API's overall response time, which is not very suitable for our application. Am I understanding correctly what you want to say ?
You suggested to use "third-party libraries like SQLAlchemy and DBUtils" - You mean don't connect to databricks directly but through these libraries to take advantage of connection pooling?

Databricks Community

Databricks connection pool(ing) using python

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences