Here is the situation I am working with. I am trying to extract source data using Databricks JDBC connector using SQL Server databases as my data source. I want to write those into a directory in my data lake as JSON files, then have AutoLoader ingest those into a Delta Table.
When I use Azure Data Factory to write a single JSON file the AutoLoader component works perfectly. When I use PySpark to write the JSON data, I get a folder that has the name of my file that contains multiple JSON files, and AutoLoader doesn't seem to want to ingest that data. When I convert the data frame to a Pandas data frame I run into out of memory errors.
The current workaround I am using is the load the source data into a PySpark data frame and write it into a Delta Table, then save as a JSON file as a backup in case I need to rebuild the Delta Table. We are looking at using Delta Live Tables in the future, so will this workaround impact our ability to use Delta Live Tables, and is there a better way to achieve this?