Databricks Community

kbmv · ‎02-06-2025

Hi I came across the blog Deploying Deepseek-R1-Distilled-Llama Models on Databricks at https://www.databricks.com/blog/deepseek-r1-databricks

I am new to using custom models that are not available as part of foundation models.

According to the blog, I need to download a Deepseek distilled model from huggingface to my volume. Register it on my MLFlow and serve as Provisioned throughput. Can someone help me with following questions.

If I want to download the 70B model, the recommended compute is g6e.4xlarge, which has 128GB CPU memory and 48GB GPU memory. To clarify, do I need this specific compute only for MLflow registration of the model?
Additionally, the blog states:
"You don’t need GPUs per se to deploy the model within the notebook, as long as the compute has sufficient memory capacity."
Does this refer to serving the model only? Or can I complete both MLFlow registration and deployment as serving using a compute instance with 128GB CPU memory and no GPU?
For provisioned throughput of the model, when I select my registered model for serving. What will be my pricing on usage per hour? Will deepseek-r1-distilled-llama-70b pricing be same as llama 3.3 70B, and deepseek-r1-distilled-llama-8b be same as llama 3.1B as mentioned in following link or the pricing will be different? https://www.databricks.com/product/pricing/foundation-model-serving
For custom rag chains or agent models, I have seen option to select Compute type as CPU, GPU small etc. Will it be such a case for my distilled model or as per point 2, if so what would be the recommendation for 70b and 8b variations. Attaching a screenshot .

Thanks

Posted on wrong board wasn't able to move or delete so recreated same question here.