a week ago
The Upstream is sending 2 files of different schema.
The Storage Account has Private Endpoints. there is no public access.
no public IP (NPIP) = yes.
How to design using only Databricks :-
1. Databricks API to read data file from Adobe and Push it to ADLS Container.
2. Pulling new Data file whenever available. (Polling or pulling)
3. I want to replace Event Grid and Function App with Databricks , please help how to do that.
Thanks
a week ago
Hi @Pratikmsbsvm ,
Okay, since youโre going to use Databricks compute for data extraction and you wrote that your workspace is deployed with the secure connectivity cluster (NPIP) option enabled, you first need to make sure that you have a stable egress IP address.
Assuming that your workspace uses VNET injection (and not a managed VNET), to add explicit outbound methods for your workspace, use an Azure NAT gateway or user-defined routes (UDRs):
Once you have the stable egress IP issue sorted out, you will then need to write code to fetch the data from Adobe and save it to ADLS.
If your source data is in one of the following formats, I recommend using Auto Loader:
avro : Avro files
binaryFile : Binary files
csv : CSV files
json : JSON files
orc : ORC files
parquet : Parquet files
text : TXT files
xml : XML files
Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage. It provides a Structured Streaming source called cloudFiles. So to keep it simple, it will automatically detect that new files arrived on data lake and process only new files (with exactly once semantic).
You can connect Auto Loader with a file arrival trigger. So when new files arrive in the storage, an event will be generated that automatically starts the workflow to process the new files using autloader mechanism described above.
Trigger jobs when new files arrive - Azure Databricks | Microsoft Learn
a week ago
Hello @Pratikmsbsvm
Good day
Here is the design for your requirements.
[ SAP / Salesforce / Adobe ]
โ
โผ
Ingestion Layer (via ADF / Synapse / Partner Connectors / REST API)
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Azure Data Lake Gen2 โ (Storage layer - centralized)
โ + Delta Lake for ACID โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
Azure Databricks (Primary Workspace)
โโ Bronze: Raw Data
โโ Silver: Cleaned & Transformed
โโ Gold: Aggregated / Business Logic Applied
โ
โโโ> Load to Hightouch / Mad Mobile (via REST APIs / Hightouch Sync)
โโโ> Share curated Delta Tables to Other Databricks Workspace (via Delta Sharing or External Table Mount)
Use Azure Data Factory or Partner Connectors (like Fivetran- We use it often our project) to ingest data from:
SAP โ via OData / RFC connectors
Salesforce โ via REST/Bulk API
Adobe โ via API or S3 data export
Store all raw and processed data in ADLS Gen2, with Delta Lake format
Organize Lakehouse zones:
Bronze: Raw ingested files
Silver: Cleaned & de-duplicated
Gold: Ready for consumption (BI / API sync)
Securely share Delta tables from one workspace to another without copying data
Works across different cloud accounts
Use Unity Catalog (if available) for fine-grained access control
Encrypt data at rest (ADLS) and in transit
Use service principals or managed identities for secure access between services
Sources โ Ingestion โ Delta Lakehouse โ Destinations
[SAP, SFDC, Adobe] [ADF, APIs] [Bronze, Silver, Gold] [Hightouch, Mad Mobile, Other DBX]
โฒ
โ
Cross-Workspace Access (Delta Sharing / Mounting / Jobs)
Let me know if this helps
a week ago
Hi @Pratikmsbsvm ,
Okay, since youโre going to use Databricks compute for data extraction and you wrote that your workspace is deployed with the secure connectivity cluster (NPIP) option enabled, you first need to make sure that you have a stable egress IP address.
Assuming that your workspace uses VNET injection (and not a managed VNET), to add explicit outbound methods for your workspace, use an Azure NAT gateway or user-defined routes (UDRs):
Once you have the stable egress IP issue sorted out, you will then need to write code to fetch the data from Adobe and save it to ADLS.
If your source data is in one of the following formats, I recommend using Auto Loader:
avro : Avro files
binaryFile : Binary files
csv : CSV files
json : JSON files
orc : ORC files
parquet : Parquet files
text : TXT files
xml : XML files
Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage. It provides a Structured Streaming source called cloudFiles. So to keep it simple, it will automatically detect that new files arrived on data lake and process only new files (with exactly once semantic).
You can connect Auto Loader with a file arrival trigger. So when new files arrive in the storage, an event will be generated that automatically starts the workflow to process the new files using autloader mechanism described above.
Trigger jobs when new files arrive - Azure Databricks | Microsoft Learn
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now