- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-12-2023 07:02 AM
Here is the situation I am working with. I am trying to extract source data using Databricks JDBC connector using SQL Server databases as my data source. I want to write those into a directory in my data lake as JSON files, then have AutoLoader ingest those into a Delta Table.
When I use Azure Data Factory to write a single JSON file the AutoLoader component works perfectly. When I use PySpark to write the JSON data, I get a folder that has the name of my file that contains multiple JSON files, and AutoLoader doesn't seem to want to ingest that data. When I convert the data frame to a Pandas data frame I run into out of memory errors.
The current workaround I am using is the load the source data into a PySpark data frame and write it into a Delta Table, then save as a JSON file as a backup in case I need to rebuild the Delta Table. We are looking at using Delta Live Tables in the future, so will this workaround impact our ability to use Delta Live Tables, and is there a better way to achieve this?
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-12-2023 09:41 AM
- To add to @werners point, I would use ADF to load SQL server data into ADLS Gen 2 as json.
- Then Load these Raw Json files from your ADLS base location into a Delta table using Autoloader.
- Delta Live Tables can be used in this scenario.
- You can also register for this workshop that walks you through the code and explains how to get your data from source to databricks using autoloader. https://pages.databricks.com/202210-AMER-FE-102022-St-Louis-Lakehouse-Hands-On-Lab-_01-Registration-...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-12-2023 07:37 AM
This is due to the distributed/parallel nature of spark.
You could add a .coalesce(1) to your write statement. Like that only one file will be generated, BUT it will still reside in a directory and have a funny name.
Frankly, I would use Data Factory for ingest. Cheap, fast, a pain to use but see the previous two points 🙂
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-12-2023 09:41 AM
- To add to @werners point, I would use ADF to load SQL server data into ADLS Gen 2 as json.
- Then Load these Raw Json files from your ADLS base location into a Delta table using Autoloader.
- Delta Live Tables can be used in this scenario.
- You can also register for this workshop that walks you through the code and explains how to get your data from source to databricks using autoloader. https://pages.databricks.com/202210-AMER-FE-102022-St-Louis-Lakehouse-Hands-On-Lab-_01-Registration-...