Databricks Community

scvbelle · ‎06-05-2023

The client receives data from a third party as weekly "datadumps" of a MySQL database copied into an Azure Blob Storage account container (I suspect this is done manually, I also suspect the changes between the approx 7GB files are very small). I need to automate ingestion of changes into my main catalogue.

I'm working on trying to get view access to the source DB, but in the mean time...

After failing to get a working connection to the Gen 1 storage account using spark config params using managed identities and service principle [AbfsRestOperatonException](https://community.databricks.com/s/question/0D58Y00009pvzofSAA/unity-catalogerror-creating-table-err...), I upgraded to Gen 2, finally figured out that the account was still set to key auth rather than AAD auth) and after multiple days I could access the files from within my Azure Databricks WS.

So I have the Gen2 Blob storage saved as an external location accessed via an access connector, with read-only permission, and this is as far as I have gotten

```

# todo: delta tableify this

# fun fact apparently there are issues with files in the root folder, but hey it's working for me.

files: sDataFrame = spark.sql("LIST 'abfss://source_container@source_sa.dfs.core.windows.net'").

# write output to delta table, select new/modified files

# for each new/modified file:

dbutils.fs.cp('abfss://fpdcontainer@pdazureblobnew.dfs.core.windows.net/modified_file.sql.gz', 'abfss://test@datalake_dedicated_sa.dfs.core.windows.net/raw/source_name/../modified_file.sql.gz')

dbutils.fs.cp('test@datalake_dedicated_sa.dfs.core.windows.net/raw/source_name/../modified_file.sql.gz', 'file:/tmp/temp.sql.gz')

```

%sh

# gzip -d /tmp/temp.sql.gz

ls /tmp/temp.sql

```

-> confirmed it's outputing a bunch of mysql queries

So my original plan was to spin up a temporary sql server using sqlcmd:

```

%sh

curl https://packages.microsoft.com/keys/microsoft.asc | sudo apt-key add -

curl https://packages.microsoft.com/config/ubuntu/20.04/prod.list | sudo tee /etc/apt/sources.list.d/msprod.list

```

but the cell never resolved.

I kept thinking that there had to be a way to ingest a .sql file from spark, so I went down the rabbit hole of sql parsers (sqlglot

[the scala spark native ParserInterface](https://stackoverflow.com/questions/52108677/how-to-extract-tables-with-data-from-sql-dumps-using-sp...), a python equivalent), but these all seem geared towards table ingestion, not database/schema level ingestion. (plus coming across another [fun bug](https://stackoverflow.com/questions/15434709/checksum-exception-when-reading-from-or-copying-to-hdfs...) when trying to copy back to the external location)

I'm going to try from mysql now, but I'm still concerned about accessing a temporary server set up on the cluster (I guess I'd use 'localhost'), but I can't find any examples of anyone doing this.

Has anyone got a better/functional solution? I'm lost int the wilderness, please send help.

(My current plan is to manually set up a server on my local env and access it from databricks and copy up a single version of it so I can at least look at the data, and keep trying to fight for a reasonable source of data...)

Anonymous · ‎06-13-2023

@Sylvia VB :

Here are some suggestions and considerations to help you navigate through the issues:

Accessing the MySQL datadumps:

As you mentioned, obtaining direct view access to the source MySQL database would be the ideal solution. This would allow you to perform incremental updates instead of relying on weekly datadumps.
- If you are unable to get direct access, you can continue with the approach of ingesting the datadumps from Azure Blob Storage.

Accessing Azure Blob Storage:

Since you have upgraded to Gen 2 Blob storage and resolved the authentication issue, you can now access the files in the storage account using Databricks.
- The code snippet you provided, listing and copying files, should work fine for accessing the modified files.

Processing the MySQL datadumps:

It seems you are decompressing the SQL dump file and planning to execute the SQL statements using SQL Server tools like sqlcmd.
- However, executing the SQL statements directly in Spark may be a more efficient approach, as it eliminates the need for an intermediate SQL Server setup.
- You can read the SQL dump file in Spark and execute the statements using the JDBC driver for MySQL. This approach allows you to leverage the distributed processing capabilities of Spark and avoid the overhead of setting up and managing temporary SQL servers.

View solution in original post

etsyal1e2r3 · ‎06-05-2023

Well ideally you can dump the data as json into blob.

Or when you get access, make a sql connection with python and create tables in blob (Delta is the default).

Otherwise, does this option work?

df = sc.textFile(<path to sql.gz file>)
df.display()

scvbelle · ‎07-07-2023

Hi!
Sorry for not replying sooner (had to put this project on the side for a bit.)
I was able to dump the json blob and read it as a text file after changing the storage, etc.
My issue was finding a way to parse it as a database.
Sorry, I know my original post was very unclear - Thanks for your input!

Anonymous · ‎06-13-2023

@Sylvia VB :

Here are some suggestions and considerations to help you navigate through the issues:

Accessing the MySQL datadumps:

As you mentioned, obtaining direct view access to the source MySQL database would be the ideal solution. This would allow you to perform incremental updates instead of relying on weekly datadumps.
- If you are unable to get direct access, you can continue with the approach of ingesting the datadumps from Azure Blob Storage.

Accessing Azure Blob Storage:

Since you have upgraded to Gen 2 Blob storage and resolved the authentication issue, you can now access the files in the storage account using Databricks.
- The code snippet you provided, listing and copying files, should work fine for accessing the modified files.

Processing the MySQL datadumps:

It seems you are decompressing the SQL dump file and planning to execute the SQL statements using SQL Server tools like sqlcmd.
- However, executing the SQL statements directly in Spark may be a more efficient approach, as it eliminates the need for an intermediate SQL Server setup.
- You can read the SQL dump file in Spark and execute the statements using the JDBC driver for MySQL. This approach allows you to leverage the distributed processing capabilities of Spark and avoid the overhead of setting up and managing temporary SQL servers.

scvbelle · ‎07-07-2023

Hi!
Thanks so much for this response - sorry I'm so delayed in responding.
I ended up spinning up a dedicated Azure flexible MySQL serverrather than trying to set one up on the cluster machines (too many conflicting dependencies). This worked fine, but I definetly think it's not the most efficient solution. Reading in the dump took an hour, and then reading the tables into Databricks took multiple hours.

I did look into processing the dumps directly in spark, based on [this](https://stackoverflow.com/questions/52108677/how-to-extract-tables-with-data-from-sql-dumps-using-sp... thread, but it seemed a little less than reliable for automated tasks.

I'm going to have to revisit this because I'll have to do some kind of change capture/diff from each weekly dump that ideally doesn't involve reprocessing the entire table, but that's a bit further down on my list

Thanks again!

Anonymous · ‎06-14-2023

Hi @Sylvia VB

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance!

Databricks Community

Recommended ETL workflow for weekly ingestion of .sql.tz "database dumps" from Blob Storage into Unity Catalogue-enabled Metastore

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!