4 weeks ago - last edited 4 weeks ago
I have to create a Data Pipeline which pushes Data (2. JSON FILE) from Source Adobe using corn job to ADLS Gen2.
1. How My ADLS Gen2 will know the new file came to container from Adobe. I am using Databricks as orchestrator and ETL tool.
2. What all are my network and Security approach when data moved from Adobe to ADLS Gen2
3. There are 2 files. delivery and tracking. how the folder structure be best. [Seperate or same folder]. i have created seperate, but not sure if it is a best practise.
Please help
Thanks a lot
4 weeks ago - last edited 4 weeks ago
Hi @Datalight ,
1. You can use file arrival triggers for such scenario
Trigger jobs when new files arrive - Azure Databricks | Microsoft Learn
2. This is too broadly stated question. We don't know anything on your current setup. For instance, does your workspace VNET Injected with SCC enabled?
3. I would put them in separate folde (especially if their schema differs)
4 weeks ago - last edited 4 weeks ago
@szymon_dybczak : Thanks a lot. It is VNET Injected. All the resources running on private subnets.
Please let me know if I you need any other information.
Thanks a lot
4 weeks ago
Hi @Datalight ,
Yes, I have 2 questions:
1. From where your cron job will be executed?
2. Does your ADSL account has disabled public network access?
4 weeks ago
1. From where your cron job will be executed?
Answer : There is no public access for external system. so cron job is databricks job in Azure.
2. Does your ADSL account has disabled public network access?
No
Kindly share your suggestion.
Many Thanks
4 weeks ago
So from networking/security perspective one thing doesn't add up for me. You have a VNET Injected workspace, but your storage account has public access enabled. This is security risk. You should disable public access and create private endpoints. Then, all your resources will be able to talk privately between each other.
4 weeks ago
@szymon_dybczak : Thanks. done. anything else you think should I take care of it with respect to networking/security perspective.
4 weeks ago
You know, Ii depends on how secure you want your environment to be. For example, you can enable SCC option in your VNET injected workspace. When secure cluster connectivity is enabled, customer virtual networks have no open ports and compute resources in the classic compute plane have no public IP addresses.
Then, you can define user defined routes, apply them on databricks subnets and redirect all outbound traffic through azure firewall.
But when you don't have specific requirements, you should just stick to the approach used by your organization.
4 weeks ago
My Workspace is not have Unity Catalog Enabled.
4 weeks ago
Datalight,
With regards to 1) I see that you are using the Medallion Architecture. Have you considered using AutoLoader for the detection and ingestion of new files in ADLS Gen2
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now