cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How to extract source data from on-premise databases into a data lake and load with AutoLoader?

JesseS
New Contributor

Here is the situation I am working with. I am trying to extract source data using Databricks JDBC connector using SQL Server databases as my data source. I want to write those into a directory in my data lake as JSON files, then have AutoLoader ingest those into a Delta Table.

When I use Azure Data Factory to write a single JSON file the AutoLoader component works perfectly. When I use PySpark to write the JSON data, I get a folder that has the name of my file that contains multiple JSON files, and AutoLoader doesn't seem to want to ingest that data. When I convert the data frame to a Pandas data frame I run into out of memory errors.

The current workaround I am using is the load the source data into a PySpark data frame and write it into a Delta Table, then save as a JSON file as a backup in case I need to rebuild the Delta Table. We are looking at using Delta Live Tables in the future, so will this workaround impact our ability to use Delta Live Tables, and is there a better way to achieve this?

1 ACCEPTED SOLUTION

Accepted Solutions

Aashita
Contributor III
Contributor III
  1. To add to @werners point, I would use ADF to load SQL server data into ADLS Gen 2 as json.
  2. Then Load these Raw Json files from your ADLS base location into a Delta table using Autoloader.
  3. Delta Live Tables can be used in this scenario.
  4. You can also register for this workshop that walks you through the code and explains how to get your data from source to databricks using autoloader. https://pages.databricks.com/202210-AMER-FE-102022-St-Louis-Lakehouse-Hands-On-Lab-_01-Registration-...

View solution in original post

2 REPLIES 2

-werners-
Esteemed Contributor III

This is due to the distributed/parallel nature of spark.

You could add a .coalesce(1) to your write statement. Like that only one file will be generated, BUT it will still reside in a directory and have a funny name.

Frankly, I would use Data Factory for ingest. Cheap, fast, a pain to use but see the previous two points 🙂

Aashita
Contributor III
Contributor III
  1. To add to @werners point, I would use ADF to load SQL server data into ADLS Gen 2 as json.
  2. Then Load these Raw Json files from your ADLS base location into a Delta table using Autoloader.
  3. Delta Live Tables can be used in this scenario.
  4. You can also register for this workshop that walks you through the code and explains how to get your data from source to databricks using autoloader. https://pages.databricks.com/202210-AMER-FE-102022-St-Louis-Lakehouse-Hands-On-Lab-_01-Registration-...
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.