Databricks

JesseS · ‎01-12-2023

Here is the situation I am working with. I am trying to extract source data using Databricks JDBC connector using SQL Server databases as my data source. I want to write those into a directory in my data lake as JSON files, then have AutoLoader ingest those into a Delta Table.

When I use Azure Data Factory to write a single JSON file the AutoLoader component works perfectly. When I use PySpark to write the JSON data, I get a folder that has the name of my file that contains multiple JSON files, and AutoLoader doesn't seem to want to ingest that data. When I convert the data frame to a Pandas data frame I run into out of memory errors.

The current workaround I am using is the load the source data into a PySpark data frame and write it into a Delta Table, then save as a JSON file as a backup in case I need to rebuild the Delta Table. We are looking at using Delta Live Tables in the future, so will this workaround impact our ability to use Delta Live Tables, and is there a better way to achieve this?

Aashita · ‎01-12-2023

To add to @werners point, I would use ADF to load SQL server data into ADLS Gen 2 as json.
Then Load these Raw Json files from your ADLS base location into a Delta table using Autoloader.
Delta Live Tables can be used in this scenario.
You can also register for this workshop that walks you through the code and explains how to get your data from source to databricks using autoloader. https://pages.databricks.com/202210-AMER-FE-102022-St-Louis-Lakehouse-Hands-On-Lab-_01-Registration-...

View solution in original post

-werners- · ‎01-12-2023

This is due to the distributed/parallel nature of spark.

You could add a .coalesce(1) to your write statement. Like that only one file will be generated, BUT it will still reside in a directory and have a funny name.

Frankly, I would use Data Factory for ingest. Cheap, fast, a pain to use but see the previous two points 🙂

Aashita · ‎01-12-2023

To add to @werners point, I would use ADF to load SQL server data into ADLS Gen 2 as json.
Then Load these Raw Json files from your ADLS base location into a Delta table using Autoloader.
Delta Live Tables can be used in this scenario.
You can also register for this workshop that walks you through the code and explains how to get your data from source to databricks using autoloader. https://pages.databricks.com/202210-AMER-FE-102022-St-Louis-Lakehouse-Hands-On-Lab-_01-Registration-...

Databricks

How to extract source data from on-premise databases into a data lake and load with AutoLoader?

How to successfully build GenAI applications

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI