Databricks Community

tyhatwar785 · ‎09-15-2025

Hi Team,

We need to design a pipeline in Databricks to:

1. Call a metadata API (returns XML per keyword), parse, and consolidate into a combined JSON.

2. Use this metadata to generate dynamic links for a second API, download ZIPs, unzip, and extract specific HTML files into ADLS.

Looking for suggestions on: Solution design – should metadata and file download be separate jobs/notebooks or combined?

Cluster recommendations – what type/size of cluster is suitable for this workload?

Parallelism – should we use Python async (aiohttp) or Spark parallelism for faster execution?

Best practices – retries, error handling, checkpointing for flaky APIs. Would appreciate guidance on how to design this efficiently.

Thanks!

nikhilmohod-nm · ‎09-15-2025

Hi @tyhatwar785

1. Should metadata and file download be separate jobs/notebooks or combined?
Keep them in separate notebooks but orchestrate them under a single Databricks Job.
for better error handling, and retries .

2. Cluster recommendations
start with a general-purpose cluster( Standard_DS4_v2 (28 GB memory, 8 vCPU) ) with autoscaling enabled

3. Parallelism
If all processing is inside Databricks

4. Best practices

Retries: Use Databricks Job-level retries and add custom retry logic using UDF

Error Handling: Use Python’s try/except with structured logging (logging library) for better observability.

Monitoring: Integrate with Databricks Lakehouse Monitoring or send metrics/logs

Databricks Community

Solution Design Recommendation on Databricks

Join Us as a Local Community Builder!

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples

Level Up with Databricks Specialist Sessions

🌟 Community Pulse: Your Weekly Roundup! November 07 – 13, 2025

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐