Hi @lw2,
The approach you have been using (copy SQLite to local, read with sqlite3, export to CSV, then manually create tables) works as a one-shot load but, as you noticed, it does not give you an easy path to keep things in sync. Below is a streamlined pattern that reads SQLite directly into Delta tables and can be scheduled so your Power BI reports always reflect the latest data.
OVERVIEW OF THE APPROACH
Because SQLite is a file-based database (not a server you can connect to over the network), Spark's built-in JDBC reader cannot reach it on S3 directly. The proven pattern is:
1. Copy the SQLite file from S3 to the driver's local storage.
2. Use Python's sqlite3 module to query each table.
3. Convert the results to Spark DataFrames and write them as Delta tables.
4. Schedule the notebook with a Lakeflow Job so it runs automatically whenever the SQLite file is updated.
STEP 1: COPY THE SQLITE FILE TO LOCAL STORAGE
If your S3 bucket is configured as an external location in Unity Catalog, you can use dbutils. Otherwise you can use boto3 or the AWS CLI. Here is the dbutils approach:
import os
s3_path = "s3://your-bucket/path/to/database.sqlite"
local_path = "/tmp/database.sqlite"
dbutils.fs.cp(s3_path, "file:" + local_path)
If you prefer boto3:
import boto3
s3 = boto3.client("s3")
s3.download_file("your-bucket", "path/to/database.sqlite", "/tmp/database.sqlite")
Note: /tmp on the driver node is ephemeral, so this file only lives for the duration of the cluster session. That is fine because we re-download it each run.
STEP 2: READ SQLITE TABLES AND WRITE TO DELTA
import sqlite3
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
conn = sqlite3.connect("/tmp/database.sqlite")
# List all tables in the SQLite database
tables = pd.read_sql_query(
"SELECT name FROM sqlite_master WHERE type='table';", conn
)
catalog = "your_catalog"
schema = "your_schema"
for table_name in tables["name"]:
pdf = pd.read_sql_query(f"SELECT * FROM [{table_name}]", conn)
sdf = spark.createDataFrame(pdf)
sdf.write.mode("overwrite").saveAsTable(f"{catalog}.{schema}.{table_name}")
print(f"Wrote {len(pdf)} rows to {catalog}.{schema}.{table_name}")
conn.close()
Key notes:
- mode("overwrite") replaces the entire table each run. If you want incremental updates instead, you can use Delta merge (MERGE INTO) with a primary key column as the match condition.
- saveAsTable creates managed Delta tables in Unity Catalog, which Power BI can connect to directly via the Databricks SQL Connector or Partner Connect.
- The bracket syntax around table_name in the SQL handles table names that contain spaces or special characters.
STEP 3 (OPTIONAL): INCREMENTAL MERGE FOR LARGE TABLES
If your SQLite tables have a reliable primary key and you want to avoid full overwrites, you can use Delta merge:
from delta.tables import DeltaTable
for table_name in tables["name"]:
pdf = pd.read_sql_query(f"SELECT * FROM [{table_name}]", conn)
updates_df = spark.createDataFrame(pdf)
full_table = f"{catalog}.{schema}.{table_name}"
if spark.catalog.tableExists(full_table):
delta_table = DeltaTable.forName(spark, full_table)
delta_table.alias("target").merge(
updates_df.alias("source"),
"target.id = source.id" # replace 'id' with your primary key
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()
else:
updates_df.write.saveAsTable(full_table)
STEP 4: SCHEDULE WITH A LAKEFLOW JOB
To keep your Delta tables current automatically:
1. Save the code above in a Databricks notebook.
2. Go to Workflows in the left sidebar and create a new Job.
3. Add a task pointing to your notebook.
4. Set a trigger: use a time-based schedule (for example, every hour or daily), or use file arrival triggers if you want it to run whenever the SQLite file is updated in S3.
This way your Power BI dashboards, connected to the Delta tables via the Databricks SQL Connector, will always show up-to-date data without any manual steps.
ADDITIONAL TIPS
- If the SQLite file is very large and individual tables have millions of rows, consider reading the data in chunks with pd.read_sql_query using the chunksize parameter, then writing each chunk with mode("append").
- If you have not yet set up your S3 bucket as an external location in Unity Catalog, follow the guide here: https://docs.databricks.com/en/connect/storage/index.html
- For the Power BI connection, the Databricks SQL Connector and Partner Connect documentation is here: https://docs.databricks.com/en/partners/bi/power-bi.html
Hope this helps you get a fully automated pipeline running.
* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.