cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to extract source data from on-premise databases into a data lake and load with AutoLoader?

JesseS
New Contributor

Here is the situation I am working with. I am trying to extract source data using Databricks JDBC connector using SQL Server databases as my data source. I want to write those into a directory in my data lake as JSON files, then have AutoLoader ingest those into a Delta Table.

When I use Azure Data Factory to write a single JSON file the AutoLoader component works perfectly. When I use PySpark to write the JSON data, I get a folder that has the name of my file that contains multiple JSON files, and AutoLoader doesn't seem to want to ingest that data. When I convert the data frame to a Pandas data frame I run into out of memory errors.

The current workaround I am using is the load the source data into a PySpark data frame and write it into a Delta Table, then save as a JSON file as a backup in case I need to rebuild the Delta Table. We are looking at using Delta Live Tables in the future, so will this workaround impact our ability to use Delta Live Tables, and is there a better way to achieve this?

1 ACCEPTED SOLUTION

Accepted Solutions

Aashita
Contributor III
  1. To add to @werners point, I would use ADF to load SQL server data into ADLS Gen 2 as json.
  2. Then Load these Raw Json files from your ADLS base location into a Delta table using Autoloader.
  3. Delta Live Tables can be used in this scenario.
  4. You can also register for this workshop that walks you through the code and explains how to get your data from source to databricks using autoloader. https://pages.databricks.com/202210-AMER-FE-102022-St-Louis-Lakehouse-Hands-On-Lab-_01-Registration-...

View solution in original post

2 REPLIES 2

-werners-
Esteemed Contributor III

This is due to the distributed/parallel nature of spark.

You could add a .coalesce(1) to your write statement. Like that only one file will be generated, BUT it will still reside in a directory and have a funny name.

Frankly, I would use Data Factory for ingest. Cheap, fast, a pain to use but see the previous two points 🙂

Aashita
Contributor III
  1. To add to @werners point, I would use ADF to load SQL server data into ADLS Gen 2 as json.
  2. Then Load these Raw Json files from your ADLS base location into a Delta table using Autoloader.
  3. Delta Live Tables can be used in this scenario.
  4. You can also register for this workshop that walks you through the code and explains how to get your data from source to databricks using autoloader. https://pages.databricks.com/202210-AMER-FE-102022-St-Louis-Lakehouse-Hands-On-Lab-_01-Registration-...

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group