Databricks SQL Connector is fundamentally synchronous - it doesn't have native
async/await support. This creates an inherent mismatch when you want to use it in an
async Python application.
Why This Matters
When you make a database query to Databricks:
1. Your Python code sends a request over the network
2. Databricks processes the query
3. Results come back over the network
4. Your code waits (blocks) during all of this
In synchronous code, this blocking stops your entire program. In async code, you want
other tasks to run during these waiting periods.
The Solution: Thread-Based Concurrency Wrapper
Since the Databricks connector itself can't be made async, you wrap the synchronous
operations in threads and use async primitives to manage those threads:
Conceptual Flow:
1. Your async code wants to query Databricks
2. You offload the synchronous Databricks call to a background thread
3. Your async event loop continues running other tasks
4. When the database query completes in the background thread, it returns to your async
code
5. You can run multiple queries concurrently by spinning up multiple threads
Two Architectural Approaches
Approach 1: Native Databricks Connector in Threads
- Use the databricks.sql.connect() API directly (not SQLAlchemy)
- Wrap each connection/query in a thread executor
- Simpler, more predictable, fewer layers
- Direct control over connection lifecycle
Approach 2: SQLAlchemy Dialect in Threads
- Use SQLAlchemy's ORM/Core with Databricks dialect
- Still wrapped in threads (no true async SQLAlchemy support for Databricks)
- Benefit: If your application already uses SQLAlchemy patterns
- Drawback: Additional abstraction layer, connection string format variations
Key Architectural Components
Thread Pool Executor:
- Manages a pool of worker threads
- Prevents creating too many threads (resource exhaustion)
- Reuses threads across multiple queries
- Should match or slightly exceed your connection pool size
Connection Pooling:
- SQLAlchemy maintains a pool of database connections
- Connections are expensive to create (network handshake, authentication)
- Pool keeps connections alive and reuses them
- Must configure pool size, overflow, and timeout settings
Event Loop Integration:
- Your async framework (asyncio) manages the event loop
- asyncio.to_thread() or loop.run_in_executor() bridges sync and async worlds
- The event loop schedules when thread results are processed
- Allows concurrent execution of multiple database operations
Connection Requirements for Azure Databricks
You need four essential pieces of information:
1. Workspace Hostname: Your Azure Databricks workspace URL
2. HTTP Path: Points to a specific SQL Warehouse or cluster
3. Authentication Token: Personal Access Token or Azure AD token
4. Catalog/Schema: Unity Catalog and schema names (optional but recommended)
The Authentication Flow
1. Your application authenticates with Azure Databricks (token-based)
2. Connection is established to a SQL Warehouse (compute resource)
3. Queries are executed on that warehouse
4. Results stream back through the connection
5. Connection is returned to the pool or closed
Performance Considerations
Concurrency vs Parallelism:
- You're achieving concurrency (multiple operations in progress)
- But NOT true parallelism (Databricks connector still blocks threads)
- Each concurrent query consumes a thread
- Too many threads = diminishing returns or resource exhaustion
Optimal Configuration:
- Thread pool size should match expected concurrent query load
- Connection pool size should align with thread pool
- Consider SQL Warehouse's query concurrency limits
- Monitor for thread starvation or connection exhaustion
Limitations to Understand
1. Not True Async I/O: You're not getting non-blocking I/O benefits, just concurrent
thread management
2. Thread Overhead: Each thread has memory overhead (typically 1-8 MB)
3. GIL Considerations: Python's Global Interpreter Lock means threads don't provide CPU
parallelism (but fine for I/O-bound database operations)
4. Connection Limits: SQL Warehouses have maximum concurrent query limits
When This Approach Works Well
- You have multiple independent queries to run concurrently
- Your application is already async-based (FastAPI, aiohttp, etc.)
- You need to integrate Databricks queries without blocking your event loop
- You're I/O bound (waiting on database) not CPU bound
When to Consider Alternatives
- If you only run sequential queries, async adds complexity without benefit
- For very high concurrency (1000+ concurrent queries), consider a queue-based
architecture
- If you need true streaming results, explore Databricks' native streaming APIs
- For simple scripts, synchronous code is clearer and sufficient
The Package Ecosystem
databricks-sql-connector:
- Official Databricks Python connector
- Includes optional SQLAlchemy dialect support
- Synchronous only
- Actively maintained
SQLAlchemy:
- Database abstraction layer
- Provides ORM and Core query building
- Has async support (SQLAlchemy 1.4+) but requires async drivers
- Databricks dialect doesn't have async driver
asyncio:
- Python's standard async framework (built-in, no install needed)
- Provides event loop, tasks, and thread executor integration
- Foundation for async/await syntax