Hi @lw2,
The key to getting a "live connection" end-to-end is replacing the manual CSV export with a scheduled pipeline that writes directly to Delta tables, then connecting Power BI to those Delta tables via DirectQuery. Here is a complete approach.
OVERVIEW
Since SQLite is a file-based database (not a server), you cannot point a traditional JDBC live connection at it. The recommended pattern is:
1. Schedule a Databricks job that reads the SQLite file from S3, extracts the data, and upserts it into Delta tables.
2. Connect Power BI to those Delta tables using DirectQuery through a Databricks SQL Warehouse, so Power BI always queries the latest data.
This gives you a near-live pipeline with no manual CSV steps.
STEP 1: COPY THE SQLITE FILE FROM S3 TO THE DRIVER NODE
SQLite requires local file access, so you need to copy the file from S3 to the ephemeral driver storage first. If you are using Unity Catalog volumes, you can upload the SQLite file there instead and read it via the /Volumes path.
Option A: Read from S3 directly using dbutils
import sqlite3
import pandas as pd
# Copy from S3 to local driver storage
dbutils.fs.cp("s3://your-bucket/path/to/database.sqlite", "file:/tmp/database.sqlite")
# Connect with sqlite3
conn = sqlite3.connect("/tmp/database.sqlite")
Option B: Read from a Unity Catalog Volume
import sqlite3
import pandas as pd
# If you uploaded the file to a UC volume
conn = sqlite3.connect("/Volumes/my_catalog/my_schema/my_volume/database.sqlite")
STEP 2: READ SQLITE TABLES AND WRITE TO DELTA
Once connected, read each table and write it as a Delta table. Use MERGE (upsert) if you want incremental updates, or overwrite if a full refresh is acceptable.
Full refresh approach (simplest):
# List all tables in the SQLite database
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = cursor.fetchall()
for (table_name,) in tables:
# Read the SQLite table into a pandas DataFrame
pdf = pd.read_sql_query(f"SELECT * FROM {table_name}", conn)
# Convert to Spark DataFrame and write as a Delta table
df = spark.createDataFrame(pdf)
df.write.mode("overwrite").saveAsTable(f"my_catalog.my_schema.{table_name}")
conn.close()
Incremental upsert approach (if you have a primary key):
pdf = pd.read_sql_query("SELECT * FROM my_table", conn)
df = spark.createDataFrame(pdf)
# Create a temporary view for the MERGE
df.createOrReplaceTempView("source_data")
spark.sql("""
MERGE INTO my_catalog.my_schema.my_table AS target
USING source_data AS source
ON target.id = source.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
""")
conn.close()
STEP 3: SCHEDULE THE NOTEBOOK AS A DATABRICKS JOB
Put the code above into a notebook, then create a Databricks Job to run it on a schedule:
1. Go to Workflows in the left sidebar.
2. Click "Create Job."
3. Add a notebook task pointing to your notebook.
4. Set a schedule trigger (for example, every 15 minutes, hourly, or daily, depending on how frequently your SQLite file updates).
5. Choose a cluster or use serverless compute.
Each time the job runs, it will pick up the latest version of your SQLite file from S3 and update the Delta tables.
Documentation on Jobs: https://docs.databricks.com/aws/en/jobs/index.html
STEP 4: CONNECT POWER BI VIA DIRECTQUERY
Since you mentioned you already know how to connect Power BI to Databricks through Spark, the key detail for a "live" experience is to use DirectQuery mode (not Import mode) when connecting through a Databricks SQL Warehouse:
1. In Power BI Desktop, use the Databricks connector.
2. Enter the Server Hostname and HTTP Path from your SQL Warehouse connection details.
3. When prompted, choose DirectQuery instead of Import.
With DirectQuery, Power BI sends queries to Databricks at interaction time, so your reports always reflect the latest data in your Delta tables without needing a Power BI refresh schedule.
Documentation on Power BI: https://docs.databricks.com/aws/en/partners/bi/power-bi.html
TIPS
- If your SQLite file is large, consider reading only changed records if your schema supports it (e.g., filter by a last_modified timestamp column in your SQL query).
- Use a SQL Warehouse (not an all-purpose cluster) for the Power BI connection, as SQL Warehouses are optimized for BI query patterns and can auto-scale.
- If your SQLite file updates very frequently and you need near-real-time freshness, consider scheduling the job every few minutes. The job overhead is low since it runs on the driver node only.
Hope this helps get your pipeline running end to end.
* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.