cancel
Showing results for 
Search instead for 
Did you mean: 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

Data Insertion

bachan
New Contributor II

Scenario: Data from blob storage to SQL db once a week.

I have 15(from current date to next 15 days) days data into the blob storage, stored date wise in parquet format, and after seven days the next 15 days data will be inserted. Means till 7th day the data will be same and after seven days when data will be refreshed, it will overwrite the existing 8 days data and extend till another 7 days. For one day the data is nearly 10lacs rows, means for whole 15 days, I have nearly 1.5 crore.

Now I need to create a pipeline, where I need to read the parquet file, and I will insert the data into the SQL db, where I need to store the data continuously.

My Approach: I have written a PySpark code to read the data from parquet file, and format it.

Now I need to insert the new data to the SQL table, where the data is present, so I am taking date as the key, if the data for the particular date is present in the SQL table then I will use Azure functions to delete the data for that date and reinsert the new data, and if the data for the particular date is not present then I will append the data into the table. I am using jdbc in databricks to insert the data.

Purpose: Purpose is to append the table and maintain the historical data as it is, if the new data comes then I need to overwrite only that data which is new and insert the remaining data, considering the date as the key.

 

Issue: when I use Azure functions, it gives me timeout error, because day by day the data is increasing in the SQL table, so the problem is everytime I am getting the timeout error with response as 504.

So in that case I am not able to insert the new data, each and everytime I can not do it manually to delete the data for the dates which are updated in parquet file, and then insert the new data. So I want to automate the process, suggest me how can I do it ?

Note : Earlier when the data was less then the whole code was working fine, but as data started accumulation, function app started giving response as 504, i cannot go for a while statement that until I get response as 200, i cannot insert new data.

I can use any azure resource, or combination of resources. Suggest me the approach here ....

2 REPLIES 2

Kaniz_Fatma
Community Manager
Community Manager

Hi @bachanBased on your scenario, you might consider using Azure Data Factory (ADF) for your data pipeline. Azure Data Factory is a cloud-based data integration service that orchestrates and automates the movement and transformation of data. 

Here is a high-level approach you can take:

1. Use Azure Data Factory to create a pipeline that triggers your PySpark code in Databricks. This PySpark code will read the parquet files from blob storage and format the data.

2. In your Azure Data Factory pipeline, add a ’Lookup’ activity to query your SQL database and check if data for a particular date exists.

3. Depending on the outcome of the ’Lookup’ activity, you can add a ’Conditional Split’ activity to decide whether to delete existing data or insert new data.

4. If data for a particular date exists, use a ’Stored Procedure’ activity to delete the existing data in your SQL database.

5. Then, insert the new data into your SQL database using a' Copy’ activity.

6. If data for a particular date does not exist, use a ’Copy’ activity to insert the new data into your SQL database.

This approach should help you automate checking for existing data, deleting it if necessary, and inserting new data. 

Please note that this is a high-level approach, and you might need to adjust it based on your specific requirements and constraints.

bachan
New Contributor II

Thanks, kaniz, I tried this solution earlier, but the problem associated with this procedure is, in ADF, you can schedule a pipeline, but for getting pass/fail status on emails, I need to use logic app, because I could not find any way, where I can schedule it and get updates on emails, which is very simple in databricks.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group