cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Read Sqlite file in to create delta table/dataframe with live connection

lw2
New Contributor

I have a sqlite database that I want to read into databricks to create delta tables/dataframes in Python that I can export to power BI and have a live connection. When there is new data added to my sqlite data base, the changes will need to reflect in my Power BI.

My sqlite file is uploaded to an s3 bucket. I was able to read the file in by creating a temporary local path and then using the sqlite3.connect(local_path) to connect. I then used pandas to read the sql query to create my pandas dataframe, then output my dataframe to my s3 bucket as a csv file. Once this was output to my s3 bucket, I could then manually create a table from that output file using UI in databricks. I do know how to connect to power BI through spark and was able to get my tables in there, so I don't need help on that part.

However, this does not give me a live connection. How can I read the sqlite file from my s3 bucket into databricks, create delta tables/dataframes to output to power BI, all while having the live connection? Thank you in advance.

3 REPLIES 3

pradeep_singh
Contributor III

You can ingest and upsert changes from your SQLite file into Delta tables on Databricks on a schedule (or event-driven if possible). Then connect Power BI to those Delta tables via a Databricks SQL Warehouse using DirectQuery. Power BI will query Databricks at interaction/refresh time so users see the latest data without manual CSV steps.

 

Thank You
Pradeep Singh - https://www.linkedin.com/in/dbxdev

aleksandra_ch
Databricks Employee
Databricks Employee

Hi @lw2 ,

You can also leverage a JDBC connection in Unity Catalog to connect directly to your sqlite database. This approach ensures you have the freshest data as it directly queries the sqlite database. Please not that this feature is in Beta, so you might need to activate it in your workspace. 

Find all the details here: https://docs.databricks.com/aws/en/connect/jdbc-connection

Hope it helps!

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @lw2,

The key to getting a "live connection" end-to-end is replacing the manual CSV export with a scheduled pipeline that writes directly to Delta tables, then connecting Power BI to those Delta tables via DirectQuery. Here is a complete approach.


OVERVIEW

Since SQLite is a file-based database (not a server), you cannot point a traditional JDBC live connection at it. The recommended pattern is:

1. Schedule a Databricks job that reads the SQLite file from S3, extracts the data, and upserts it into Delta tables.
2. Connect Power BI to those Delta tables using DirectQuery through a Databricks SQL Warehouse, so Power BI always queries the latest data.

This gives you a near-live pipeline with no manual CSV steps.


STEP 1: COPY THE SQLITE FILE FROM S3 TO THE DRIVER NODE

SQLite requires local file access, so you need to copy the file from S3 to the ephemeral driver storage first. If you are using Unity Catalog volumes, you can upload the SQLite file there instead and read it via the /Volumes path.

Option A: Read from S3 directly using dbutils

import sqlite3
import pandas as pd

# Copy from S3 to local driver storage
dbutils.fs.cp("s3://your-bucket/path/to/database.sqlite", "file:/tmp/database.sqlite")

# Connect with sqlite3
conn = sqlite3.connect("/tmp/database.sqlite")

Option B: Read from a Unity Catalog Volume

import sqlite3
import pandas as pd

# If you uploaded the file to a UC volume
conn = sqlite3.connect("/Volumes/my_catalog/my_schema/my_volume/database.sqlite")


STEP 2: READ SQLITE TABLES AND WRITE TO DELTA

Once connected, read each table and write it as a Delta table. Use MERGE (upsert) if you want incremental updates, or overwrite if a full refresh is acceptable.

Full refresh approach (simplest):

# List all tables in the SQLite database
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = cursor.fetchall()

for (table_name,) in tables:
# Read the SQLite table into a pandas DataFrame
pdf = pd.read_sql_query(f"SELECT * FROM {table_name}", conn)

# Convert to Spark DataFrame and write as a Delta table
df = spark.createDataFrame(pdf)
df.write.mode("overwrite").saveAsTable(f"my_catalog.my_schema.{table_name}")

conn.close()

Incremental upsert approach (if you have a primary key):

pdf = pd.read_sql_query("SELECT * FROM my_table", conn)
df = spark.createDataFrame(pdf)

# Create a temporary view for the MERGE
df.createOrReplaceTempView("source_data")

spark.sql("""
MERGE INTO my_catalog.my_schema.my_table AS target
USING source_data AS source
ON target.id = source.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
""")

conn.close()


STEP 3: SCHEDULE THE NOTEBOOK AS A DATABRICKS JOB

Put the code above into a notebook, then create a Databricks Job to run it on a schedule:

1. Go to Workflows in the left sidebar.
2. Click "Create Job."
3. Add a notebook task pointing to your notebook.
4. Set a schedule trigger (for example, every 15 minutes, hourly, or daily, depending on how frequently your SQLite file updates).
5. Choose a cluster or use serverless compute.

Each time the job runs, it will pick up the latest version of your SQLite file from S3 and update the Delta tables.

Documentation on Jobs: https://docs.databricks.com/aws/en/jobs/index.html


STEP 4: CONNECT POWER BI VIA DIRECTQUERY

Since you mentioned you already know how to connect Power BI to Databricks through Spark, the key detail for a "live" experience is to use DirectQuery mode (not Import mode) when connecting through a Databricks SQL Warehouse:

1. In Power BI Desktop, use the Databricks connector.
2. Enter the Server Hostname and HTTP Path from your SQL Warehouse connection details.
3. When prompted, choose DirectQuery instead of Import.

With DirectQuery, Power BI sends queries to Databricks at interaction time, so your reports always reflect the latest data in your Delta tables without needing a Power BI refresh schedule.

Documentation on Power BI: https://docs.databricks.com/aws/en/partners/bi/power-bi.html


TIPS

- If your SQLite file is large, consider reading only changed records if your schema supports it (e.g., filter by a last_modified timestamp column in your SQL query).
- Use a SQL Warehouse (not an all-purpose cluster) for the Power BI connection, as SQL Warehouses are optimized for BI query patterns and can auto-scale.
- If your SQLite file updates very frequently and you need near-real-time freshness, consider scheduling the job every few minutes. The job overhead is low since it runs on the driver node only.

Hope this helps get your pipeline running end to end.

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.