cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Read sqlite file from s3 bucket into databricks, creating delta tables

lw2
New Contributor

I have a sqlite database that I want to read into databricks to create delta tables/dataframes in Python that I can export to power BI and have a live connection. When there is new data added to my sqlite data base, the changes will need to reflect in my Power BI.

My sqlite file is uploaded to an s3 bucket. I was able to read the file in by creating a temporary local path and then using the sqlite3.connect(local_path) to connect. I then used pandas to read the sql query to create my pandas dataframe, then output my dataframe to my s3 bucket as a csv file. Once this was output to my s3 bucket, I could then manually create a table from that output file using UI in databricks. I do know how to connect to power BI through spark and was able to get my tables in there, so I don't need help on that part.

However, this does not give me a live connection. How can I read the sqlite file from my s3 bucket into databricks and create delta tables/dataframes? Thank you in advance.

1 REPLY 1

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @lw2,

The approach you have been using (copy SQLite to local, read with sqlite3, export to CSV, then manually create tables) works as a one-shot load but, as you noticed, it does not give you an easy path to keep things in sync. Below is a streamlined pattern that reads SQLite directly into Delta tables and can be scheduled so your Power BI reports always reflect the latest data.


OVERVIEW OF THE APPROACH

Because SQLite is a file-based database (not a server you can connect to over the network), Spark's built-in JDBC reader cannot reach it on S3 directly. The proven pattern is:

1. Copy the SQLite file from S3 to the driver's local storage.
2. Use Python's sqlite3 module to query each table.
3. Convert the results to Spark DataFrames and write them as Delta tables.
4. Schedule the notebook with a Lakeflow Job so it runs automatically whenever the SQLite file is updated.


STEP 1: COPY THE SQLITE FILE TO LOCAL STORAGE

If your S3 bucket is configured as an external location in Unity Catalog, you can use dbutils. Otherwise you can use boto3 or the AWS CLI. Here is the dbutils approach:

import os

s3_path = "s3://your-bucket/path/to/database.sqlite"
local_path = "/tmp/database.sqlite"

dbutils.fs.cp(s3_path, "file:" + local_path)

If you prefer boto3:

import boto3

s3 = boto3.client("s3")
s3.download_file("your-bucket", "path/to/database.sqlite", "/tmp/database.sqlite")

Note: /tmp on the driver node is ephemeral, so this file only lives for the duration of the cluster session. That is fine because we re-download it each run.


STEP 2: READ SQLITE TABLES AND WRITE TO DELTA

import sqlite3
import pandas as pd
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

conn = sqlite3.connect("/tmp/database.sqlite")

# List all tables in the SQLite database
tables = pd.read_sql_query(
"SELECT name FROM sqlite_master WHERE type='table';", conn
)

catalog = "your_catalog"
schema = "your_schema"

for table_name in tables["name"]:
pdf = pd.read_sql_query(f"SELECT * FROM [{table_name}]", conn)
sdf = spark.createDataFrame(pdf)
sdf.write.mode("overwrite").saveAsTable(f"{catalog}.{schema}.{table_name}")
print(f"Wrote {len(pdf)} rows to {catalog}.{schema}.{table_name}")

conn.close()

Key notes:
- mode("overwrite") replaces the entire table each run. If you want incremental updates instead, you can use Delta merge (MERGE INTO) with a primary key column as the match condition.
- saveAsTable creates managed Delta tables in Unity Catalog, which Power BI can connect to directly via the Databricks SQL Connector or Partner Connect.
- The bracket syntax around table_name in the SQL handles table names that contain spaces or special characters.


STEP 3 (OPTIONAL): INCREMENTAL MERGE FOR LARGE TABLES

If your SQLite tables have a reliable primary key and you want to avoid full overwrites, you can use Delta merge:

from delta.tables import DeltaTable

for table_name in tables["name"]:
pdf = pd.read_sql_query(f"SELECT * FROM [{table_name}]", conn)
updates_df = spark.createDataFrame(pdf)
full_table = f"{catalog}.{schema}.{table_name}"

if spark.catalog.tableExists(full_table):
delta_table = DeltaTable.forName(spark, full_table)
delta_table.alias("target").merge(
updates_df.alias("source"),
"target.id = source.id" # replace 'id' with your primary key
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()
else:
updates_df.write.saveAsTable(full_table)


STEP 4: SCHEDULE WITH A LAKEFLOW JOB

To keep your Delta tables current automatically:

1. Save the code above in a Databricks notebook.
2. Go to Workflows in the left sidebar and create a new Job.
3. Add a task pointing to your notebook.
4. Set a trigger: use a time-based schedule (for example, every hour or daily), or use file arrival triggers if you want it to run whenever the SQLite file is updated in S3.

This way your Power BI dashboards, connected to the Delta tables via the Databricks SQL Connector, will always show up-to-date data without any manual steps.


ADDITIONAL TIPS

- If the SQLite file is very large and individual tables have millions of rows, consider reading the data in chunks with pd.read_sql_query using the chunksize parameter, then writing each chunk with mode("append").
- If you have not yet set up your S3 bucket as an external location in Unity Catalog, follow the guide here: https://docs.databricks.com/en/connect/storage/index.html
- For the Power BI connection, the Databricks SQL Connector and Partner Connect documentation is here: https://docs.databricks.com/en/partners/bi/power-bi.html

Hope this helps you get a fully automated pipeline running.

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.