Databricks Community

kbmv · ‎02-06-2025

Hi I have read the blog Deploying Deepseek-R1-Distilled-Llama Models on Databricks at https://www.databricks.com/blog/deepseek-r1-databricks

I am new to using custom models that are not available as part of foundation models.

According to the blog, I need to download a Deepseek distilled model from huggingface to my volume. Register it on my MLFlow and serve as Provisioned throughput. Can someone help me with following questions.

If I want to download the 70B model, the recommended compute is g6e.4xlarge, which has 128GB CPU memory and 48GB GPU memory. To clarify, do I need this specific compute only for MLflow registration of the model?
Additionally, the blog states:
"You don’t need GPUs per se to deploy the model within the notebook, as long as the compute has sufficient memory capacity."
Does this refer to serving the model only? Or can I complete both MLFlow registration and deployment as serving using a compute instance with 128GB CPU memory and no GPU?
For provisioned throughput of the model, when I select my registered model for serving. What will be my pricing on usage per hour? Will deepseek-r1-distilled-llama-70b pricing be same as llama 3.3 70B, and deepseek-r1-distilled-llama-8b be same as llama 3.1B as mentioned in following link or the pricing will be different? https://www.databricks.com/product/pricing/foundation-model-serving
For custom rag chains or agent models, I have seen option to select Compute type as CPU, GPU small etc. Will it be such a case for my distilled model or as per point 2, if so what would be the recommendation for 70b and 8b variations. Attaching a screenshot .
Thanks

Isi · ‎02-07-2025

Hi @kbmv ,

Based on my experience deploying Deepseek-R1-Distilled-Llama on Databricks, here are my answers to your questions:

Compute Requirements for MLflow Registration (70B vs 8B Model) • Llama-8B was successfully registered using a cluster with 192GB memory, 40 cores, and GPU. • Llama-70B failed to register on the same setup, indicating that it requires even more resources. • A CPU-only cluster with high memory was also tested, but it failed due to insufficient memory. • Conclusion: For me, the recommendation of using g6e.4xlarge (128GB CPU, 48GB GPU memory) seems to be the minimum needed for Llama-70B registration.
GPU Requirement for Deployment • The blog states that GPUs are not strictly required for deployment if enough memory is available. • However, in practice, serving the 70B model without GPUs is not feasible due to high memory consumption and inference latency. • For Llama-8B, serving without GPUs is possible, but performance may be impacted. • Conclusion: MLflow registration is best done with GPUs, and for efficient inference serving, GPUs are strongly recommended, especially for 70B.
Pricing for Provisioned Throughput Serving • As of now, Deepseek-R1-Distilled models are not explicitly listed in the pricing documentation. • However, given that Deepseek-R1-Distilled-70B is based on Llama 3.3 70B, it is likely that pricing will be similar to Llama 3.3 70B. The 8B version may align with Llama 3.1B pricing, but confirmation from Databricks would be required.
Compute Selection for RAG Chains and Agent Models • For Llama-70B, the best practice is to use a GPU-enabled cluster, as inference latency will be too high on CPU. • For Llama-8B, CPU may work for some use cases, but performance will degrade significantly. • The compute type selection (CPU, GPU small, etc.) applies to Deepseek-R1 models as well, and choosing GPU is recommended for real-time applications.

🙂

View solution in original post

Isi · ‎02-07-2025

Hi @kbmv ,

Based on my experience deploying Deepseek-R1-Distilled-Llama on Databricks, here are my answers to your questions:

Compute Requirements for MLflow Registration (70B vs 8B Model) • Llama-8B was successfully registered using a cluster with 192GB memory, 40 cores, and GPU. • Llama-70B failed to register on the same setup, indicating that it requires even more resources. • A CPU-only cluster with high memory was also tested, but it failed due to insufficient memory. • Conclusion: For me, the recommendation of using g6e.4xlarge (128GB CPU, 48GB GPU memory) seems to be the minimum needed for Llama-70B registration.
GPU Requirement for Deployment • The blog states that GPUs are not strictly required for deployment if enough memory is available. • However, in practice, serving the 70B model without GPUs is not feasible due to high memory consumption and inference latency. • For Llama-8B, serving without GPUs is possible, but performance may be impacted. • Conclusion: MLflow registration is best done with GPUs, and for efficient inference serving, GPUs are strongly recommended, especially for 70B.
Pricing for Provisioned Throughput Serving • As of now, Deepseek-R1-Distilled models are not explicitly listed in the pricing documentation. • However, given that Deepseek-R1-Distilled-70B is based on Llama 3.3 70B, it is likely that pricing will be similar to Llama 3.3 70B. The 8B version may align with Llama 3.1B pricing, but confirmation from Databricks would be required.
Compute Selection for RAG Chains and Agent Models • For Llama-70B, the best practice is to use a GPU-enabled cluster, as inference latency will be too high on CPU. • For Llama-8B, CPU may work for some use cases, but performance will degrade significantly. • The compute type selection (CPU, GPU small, etc.) applies to Deepseek-R1 models as well, and choosing GPU is recommended for real-time applications.

🙂