06-05-2023 08:36 AM
The client receives data from a third party as weekly "datadumps" of a MySQL database copied into an Azure Blob Storage account container (I suspect this is done manually, I also suspect the changes between the approx 7GB files are very small). I need to automate ingestion of changes into my main catalogue.
I'm working on trying to get view access to the source DB, but in the mean time...
After failing to get a working connection to the Gen 1 storage account using spark config params using managed identities and service principle [AbfsRestOperatonException](https://community.databricks.com/s/question/0D58Y00009pvzofSAA/unity-catalogerror-creating-table-errorclassinvalidstate-failed-to-access-cloud-storage-abfsrestoperationexception), I upgraded to Gen 2, finally figured out that the account was still set to key auth rather than AAD auth) and after multiple days I could access the files from within my Azure Databricks WS.
So I have the Gen2 Blob storage saved as an external location accessed via an access connector, with read-only permission, and this is as far as I have gotten
```
# todo: delta tableify this
# fun fact apparently there are issues with files in the root folder, but hey it's working for me.
files: sDataFrame = spark.sql("LIST 'abfss://source_container@source_sa.dfs.core.windows.net'").
# write output to delta table, select new/modified files
# for each new/modified file:
dbutils.fs.cp('abfss://fpdcontainer@pdazureblobnew.dfs.core.windows.net/modified_file.sql.gz', 'abfss://test@datalake_dedicated_sa.dfs.core.windows.net/raw/source_name/../modified_file.sql.gz')
dbutils.fs.cp('test@datalake_dedicated_sa.dfs.core.windows.net/raw/source_name/../modified_file.sql.gz', 'file:/tmp/temp.sql.gz')
```
```
%sh
# gzip -d /tmp/temp.sql.gz
ls /tmp/temp.sql
```
-> confirmed it's outputing a bunch of mysql queries
So my original plan was to spin up a temporary sql server using sqlcmd:
```
%sh
curl https://packages.microsoft.com/keys/microsoft.asc | sudo apt-key add -
curl https://packages.microsoft.com/config/ubuntu/20.04/prod.list | sudo tee /etc/apt/sources.list.d/msprod.list
```
but the cell never resolved.
I kept thinking that there had to be a way to ingest a .sql file from spark, so I went down the rabbit hole of sql parsers (sqlglot
[the scala spark native ParserInterface](https://stackoverflow.com/questions/52108677/how-to-extract-tables-with-data-from-sql-dumps-using-spark), a python equivalent), but these all seem geared towards table ingestion, not database/schema level ingestion. (plus coming across another [fun bug](https://stackoverflow.com/questions/15434709/checksum-exception-when-reading-from-or-copying-to-hdfs-in-apache-hadoop) when trying to copy back to the external location)
I'm going to try from mysql now, but I'm still concerned about accessing a temporary server set up on the cluster (I guess I'd use 'localhost'), but I can't find any examples of anyone doing this.
Has anyone got a better/functional solution? I'm lost int the wilderness, please send help.
(My current plan is to manually set up a server on my local env and access it from databricks and copy up a single version of it so I can at least look at the data, and keep trying to fight for a reasonable source of data...)
06-13-2023 07:43 AM
@Sylvia VB :
Here are some suggestions and considerations to help you navigate through the issues:
06-05-2023 02:03 PM
Well ideally you can dump the data as json into blob.
Or when you get access, make a sql connection with python and create tables in blob (Delta is the default).
Otherwise, does this option work?
df = sc.textFile(<path to sql.gz file>)
df.display()
07-07-2023 05:27 AM
Hi!
Sorry for not replying sooner (had to put this project on the side for a bit.)
I was able to dump the json blob and read it as a text file after changing the storage, etc.
My issue was finding a way to parse it as a database.
Sorry, I know my original post was very unclear - Thanks for your input!
06-13-2023 07:43 AM
@Sylvia VB :
Here are some suggestions and considerations to help you navigate through the issues:
07-07-2023 05:42 AM
Hi!
Thanks so much for this response - sorry I'm so delayed in responding.
I ended up spinning up a dedicated Azure flexible MySQL serverrather than trying to set one up on the cluster machines (too many conflicting dependencies). This worked fine, but I definetly think it's not the most efficient solution. Reading in the dump took an hour, and then reading the tables into Databricks took multiple hours.
I did look into processing the dumps directly in spark, based on [this](https://stackoverflow.com/questions/52108677/how-to-extract-tables-with-data-from-sql-dumps-using-sp... thread, but it seemed a little less than reliable for automated tasks.
I'm going to have to revisit this because I'll have to do some kind of change capture/diff from each weekly dump that ideally doesn't involve reprocessing the entire table, but that's a bit further down on my list
Thanks again!
06-14-2023 12:08 AM
Hi @Sylvia VB
Thank you for posting your question in our community! We are happy to assist you.
To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?
This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group