Databricks Community

bachan · ‎09-04-2023

Scenario: Data from blob storage to SQL db once a week.

I have 15(from current date to next 15 days) days data into the blob storage, stored date wise in parquet format, and after seven days the next 15 days data will be inserted. Means till 7th day the data will be same and after seven days when data will be refreshed, it will overwrite the existing 8 days data and extend till another 7 days. For one day the data is nearly 10lacs rows, means for whole 15 days, I have nearly 1.5 crore.

Now I need to create a pipeline, where I need to read the parquet file, and I will insert the data into the SQL db, where I need to store the data continuously.

My Approach: I have written a PySpark code to read the data from parquet file, and format it.

Now I need to insert the new data to the SQL table, where the data is present, so I am taking date as the key, if the data for the particular date is present in the SQL table then I will use Azure functions to delete the data for that date and reinsert the new data, and if the data for the particular date is not present then I will append the data into the table. I am using jdbc in databricks to insert the data.

Purpose: Purpose is to append the table and maintain the historical data as it is, if the new data comes then I need to overwrite only that data which is new and insert the remaining data, considering the date as the key.

Issue: when I use Azure functions, it gives me timeout error, because day by day the data is increasing in the SQL table, so the problem is everytime I am getting the timeout error with response as 504.

So in that case I am not able to insert the new data, each and everytime I can not do it manually to delete the data for the dates which are updated in parquet file, and then insert the new data. So I want to automate the process, suggest me how can I do it ?

Note : Earlier when the data was less then the whole code was working fine, but as data started accumulation, function app started giving response as 504, i cannot go for a while statement that until I get response as 200, i cannot insert new data.

I can use any azure resource, or combination of resources. Suggest me the approach here ....

bachan · ‎09-06-2023

Thanks, kaniz, I tried this solution earlier, but the problem associated with this procedure is, in ADF, you can schedule a pipeline, but for getting pass/fail status on emails, I need to use logic app, because I could not find any way, where I can schedule it and get updates on emails, which is very simple in databricks.

Databricks Community

Data Insertion

Connect with Databricks Users in Your Area

Intelligent Data Warehousing: AI/BI for Self-service Analytics

Introducing SAP Databricks

Serverless Compute for Notebooks, Workflows and Pipelines is now Generally Available on Google Cloud

Welcoming BladeBridge to Databricks: Accelerating Data Warehouse Migrations to Lakehouse

Databricks Clean Rooms: Now Generally Available on AWS and Azure