topic Re: Solution Design Recommendation on Databricks in Data Engineering

Solution Design Recommendation on Databricks

tyhatwar785 — Mon, 15 Sep 2025 07:12:24 GMT

Hi Team,

We need to design a pipeline in Databricks to:

1. Call a metadata API (returns XML per keyword), parse, and consolidate into a combined JSON.

2. Use this metadata to generate dynamic links for a second API, download ZIPs, unzip, and extract specific HTML files into ADLS.

Looking for suggestions on: Solution design – should metadata and file download be separate jobs/notebooks or combined?

Cluster recommendations – what type/size of cluster is suitable for this workload?

Parallelism – should we use Python async (aiohttp) or Spark parallelism for faster execution?

Best practices – retries, error handling, checkpointing for flaky APIs. Would appreciate guidance on how to design this efficiently.

Thanks!

Re: Solution Design Recommendation on Databricks

nikhilmohod-nm — Mon, 15 Sep 2025 12:20:08 GMT

Hi @tyhatwar785

1. Should metadata and file download be separate jobs/notebooks or combined?
Keep them in separate notebooks but orchestrate them under a single Databricks Job.
for better error handling, and retries .

2. Cluster recommendations
start with a general-purpose cluster( Standard_DS4_v2 (28 GB memory, 8 vCPU) ) with autoscaling enabled

3. Parallelism
If all processing is inside Databricks

4. Best practices

Retries: Use Databricks Job-level retries and add custom retry logic using UDF

Error Handling: Use Python’s try/except with structured logging (logging library) for better observability.

Monitoring: Integrate with Databricks Lakehouse Monitoring or send metrics/logs