Hi Team,
We need to design a pipeline in Databricks to:
1. Call a metadata API (returns XML per keyword), parse, and consolidate into a combined JSON.
2. Use this metadata to generate dynamic links for a second API, download ZIPs, unzip, and extract specific HTML files into ADLS.
Looking for suggestions on: Solution design – should metadata and file download be separate jobs/notebooks or combined?
Cluster recommendations – what type/size of cluster is suitable for this workload?
Parallelism – should we use Python async (aiohttp) or Spark parallelism for faster execution?
Best practices – retries, error handling, checkpointing for flaky APIs. Would appreciate guidance on how to design this efficiently.
Thanks!