Hello everyone,
Iām working on a translation project involving documents up to 100 pages long, in 17 different languages, and I'm looking for the best approach to achieve high-quality translations in this multilingual context.
Single model vs. multi-model approach
- Is it better to use a single multilingual model or to train separate models for each source language?
- If I go with a single model, is it possible to progressively add each new language by retraining the model multiple times without losing the ability to translate into previously trained languages?
- Lastly, if Iām using the same source language, can I train the model to translate into multiple target languages without needing a dedicated model for each source-target combination?
Model
Iām planning to use Databricks to train the model, following the advice from this article: Fine-Tuning Large Language Models and leveraging Hugging Faceās translation script: run_translation.py. Would this approach be effective for achieving quality translations across a wide range of languages?
Using Databricks functions for common languages
Databricks offers a built-in translation function (ai_translate), but it currently only supports translations between French, English, and Spanish. If one of these languages matches my translation requirements, would it make sense to prioritize this solution? Is it potentially more effective than tools like DeepL, which havenāt fully met my clientās expectations?
Thanks in advance for any advice and insights on the best approach to take!