Databricks Community

rohit_kumar · ‎08-27-2024

The above set of activities took some 4 hours at ADF to explore and design with greater ease of use, connections, monitoring and it could have probably taken 4 days or more using Databricks Workflows.

The integration runtime at ADF was just up for under a minute to complete all these activities with limited compute usage probably lower than workflows in this case as it is not compute intensive.

Can there be a better solution through Databricks with the similar ease of use and lesser compute?

Illustrating the detailed steps performed at ADF below:

The first step is the generation of Session ID:

It will require a web activity to be created in ADF Pipeline with a Post Method and body incorporating username and password and then the API end points of Informatica login will be hit.

The body can be passed with secret values from Key vault for the implementation purpose. A random user and password have been placed in the body for demonstration purpose.

It will generate an output set from which “orgUuid” and “iCsessionid” will be used for the next step of Access Token Generation.

This will be dynamically generated and will get changed with every execution.

The second step is Access Token Generation:

It will require another subsequent web activity to be created in ADF Pipeline with a Post Method and the URL for Access token generation.

It will have an additional two headers i.e., “IDS-SESSION-ID” and “Cookie” in a particular format with the URL which will be fetched dynamically from the previous activity of session ID generation.

Output for the Access Toke Generation step will come in below format as “jwt_token”

The third step will be Job ID Generation:

It will require another subsequent web activity to be created in ADF Pipeline with a Post Method and the URL for Job ID generation.

It will have additional parameters of from and size defined in the body.

It will also have additional two headers. The first one is “X-INFRA-ORG-ID” which has been fetched dynamically from the first activity of ‘session ID generation’ and the second one is “Authorization” in a particular format which has been fetched dynamically from the previous activity of Token generation.

The dynamic Headers will be as shown below:

The output of the Job ID Generation:

The output of the Job_ID_Generation will be JobId, trackingURI and OutputURI which will be used in subsequent steps for tracking and getting attachment.

The fourth step will be Track Job Status:

It will require another subsequent web activity to be created in ADF Pipeline with GET Method this time and the dynamic URL to track job status. The URL will consist of the JOB ID generated from the previous step dynamically.

The below step depicts the format of the dynamic URL in concatenated format with JOB ID fetched from previous step.

It will also have the same two additional headers as in last step. The first one is “X-INFRA-ORG-ID” which has been fetched dynamically from the first activity of ‘session ID generation’ and the second one is “Authorization” in a particular format which has been fetched dynamically from the previous activity of Token generation.

The fifth step be wait and re-check after certain interval.

The sixth step will be Get Attachment:

It will require another subsequent web activity to be created in ADF Pipeline with GET Method again and the dynamic URL to retrieve attachment. The URL will consist of the “Output URI” generated from the Job ID Generation step dynamically.

The below step depicts the format of the dynamic URL in concatenated format with JOB ID fetched from previous step.

It will also have an additional headers “Authorization” in a particular format which has been fetched dynamically from the previous activity of Token generation as used in the last step of Job ID Generation and wait and check activity.

The output of the step is as below in the Response section which is an unreadable xlsx attachment. We will download this in the next step.

The seventh step will be Extract Data:

The last step will a copy activity to fetch the data. It will have a source dataset of type Binary with the dynamic URL passed with OutputURI generated from Job_ID_Generation step shown earlier.

The request method will be GET and would require an additional header for Authorization with jwt token in the format shown in earlier steps.

The associated Linked service will be a HTTP type designed with dynamic URL which can be provided during the run time from previous steps and the authentication type as Anonymous to be handled during runtime with token.

We would need to configure a Sink setting to dump the data in ADLS. The dataset designed here is of type Binary to capture the attachment as is with the dataset properties of ADLS name, container, file path, file name defined.

The associate Linked service of type ADLS is as shown below:

The file dumped at the end of the last step is as show below:

Brahmareddy · ‎08-28-2024

Hi @rohit_kumar, How are you dong today?

To answer to your subject line question, If you're looking for flexibility and integration, Databricks Workflows might be better since it offers native support for complex data transformations and seamless integration with Spark. However, Azure Data Factory (ADF) is ideal if you prefer a more visual and managed approach, especially for orchestrating data pipelines with multiple sources. ADF is also great for integrating with other Azure services. Ultimately, choose based on your team's familiarity and the complexity of the tasks.

Give a try and let me know if it works.

Regards,

Brahma