Databricks Community

georgemichael40 · a month ago

Hi,

We are migrating a local dash app to the Databricks infrastructure (using databricks apps and our delta lake).

The local app does the following (among others):

- takes Excel files form the end-user

- read in-memory and transforms to pandas dataframe

- Appends to a relevant table in sqlite, or updates rows if the unique id already exists. It uses sqlalchemy as support.

I have been trying to replace these tasks the Databricks way, but haven't been able to do it effectively, despite the overwhelming amount of options.

My idea was to stay as close to the local app design and get it working, so that I can in the longer term make it more "production-ready". My local app takes under a second to finish the job, whilst my new app is quite slow, even though the apps are mostly the same. What I have tried so far and ideas:

- Take excel file from user, and use databricks-sqlalchemy or databricks-sql-connector to write/update the data to a delta table, using a sql warehouse. For databricks-sql-connector I now am using the INSERT into option, and it's been working okayish so far (the tables I am working with are not the biggest). I identify new rows and these are all inserted. For updates, I DELETE the old rows (matching ID with new rows) and INSERT the new rows. This feels shaky, complex and it's quite slow overall.

- I am going to write the excel files to volumes, for data lineage and auditing purposes. I was thinking from there, I could use other tools, such as Spark. But not sure about it?

- We already use DBT with seeds in production. If all fails, I might have to take these tools into account, although it will take more time to finish the migration project.

saurabh18cs · a month ago

Hi @georgemichael40

My suggestion wud be to try using MERGE INTO for delta tables which works with connector then using delete/insert statements. This will also keep your code in SQL as you wanted. your tables are not large so this shud be sufficient otherwise we need to consider spark/delta combination.

e.g.

merge_sql = f"""

MERGE INTO your_table AS target

USING (SELECT * FROM VALUES {values} AS t(id, col1, col2)) AS source

ON target.id = source.id

WHEN MATCHED THEN UPDATE SET col1 = source.col1, col2 = source.col2

WHEN NOT MATCHED THEN INSERT (id, col1, col2) VALUES (source.id, source.col1, source.col2)

"""