Hi,
We are migrating a local dash app to the Databricks infrastructure (using databricks apps and our delta lake).
The local app does the following (among others):
- takes Excel files form the end-user
- read in-memory and transforms to pandas dataframe
- Appends to a relevant table in sqlite, or updates rows if the unique id already exists. It uses sqlalchemy as support.
I have been trying to replace these tasks the Databricks way, but haven't been able to do it effectively, despite the overwhelming amount of options.
My idea was to stay as close to the local app design and get it working, so that I can in the longer term make it more "production-ready". My local app takes under a second to finish the job, whilst my new app is quite slow, even though the apps are mostly the same. What I have tried so far and ideas:
- Take excel file from user, and use databricks-sqlalchemy or databricks-sql-connector to write/update the data to a delta table, using a sql warehouse. For databricks-sql-connector I now am using the INSERT into option, and it's been working okayish so far (the tables I am working with are not the biggest). I identify new rows and these are all inserted. For updates, I DELETE the old rows (matching ID with new rows) and INSERT the new rows. This feels shaky, complex and it's quite slow overall.
- I am going to write the excel files to volumes, for data lineage and auditing purposes. I was thinking from there, I could use other tools, such as Spark. But not sure about it?
- We already use DBT with seeds in production. If all fails, I might have to take these tools into account, although it will take more time to finish the migration project.