Databricks Community

Maylin · ‎10-30-2024

Hello everyone,

I’m working on a translation project involving documents up to 100 pages long, in 17 different languages, and I'm looking for the best approach to achieve high-quality translations in this multilingual context.

Single model vs. multi-model approach
- Is it better to use a single multilingual model or to train separate models for each source language?
- If I go with a single model, is it possible to progressively add each new language by retraining the model multiple times without losing the ability to translate into previously trained languages?
- Lastly, if I’m using the same source language, can I train the model to translate into multiple target languages without needing a dedicated model for each source-target combination?
Model
I’m planning to use Databricks to train the model, following the advice from this article: Fine-Tuning Large Language Models and leveraging Hugging Face’s translation script: run_translation.py. Would this approach be effective for achieving quality translations across a wide range of languages?
Using Databricks functions for common languages
Databricks offers a built-in translation function (ai_translate), but it currently only supports translations between French, English, and Spanish. If one of these languages matches my translation requirements, would it make sense to prioritize this solution? Is it potentially more effective than tools like DeepL, which haven’t fully met my client’s expectations?

Thanks in advance for any advice and insights on the best approach to take!

Louis_Frolio · yesterday

Greetings @Maylin , A single many-to-many multilingual model, fine-tuned jointly across your 17 languages, is usually the best trade-off between quality, scalability, and operational simplicity; combine it with lightweight adapters to preserve quality and add languages over time. Use built-in translation functions for any supported pairs as strong baselines, and run a focused evaluation against services like DeepL on your domain data before committing.

Model strategy

- Multilingual NMT can match or exceed bilingual systems for many directions, especially when low- and mid‑resource languages benefit from transfer learning.
- One model covers all language pairs, reducing deployment/monitoring complexity versus maintaining many separate models.

Adding languages over time

- Naively fine-tuning on a new language risks catastrophic forgetting; mitigate with replay (mix prior-language data), regularization (e.g., EWC/L2), and adapter/LoRA layers.
- Use conservative learning rates, early stopping, and per-language validation to detect regressions when you introduce new languages.

Multiple targets per source

- Many-to-many models naturally translate one source into multiple targets using language tags/prompts without dedicated models per pair.
- Shared representations encourage positive transfer between related targets and improve consistency.

Training on Databricks with HF scripts

- The Hugging Face translation script is a solid foundation; on Databricks, pair it with distributed training (e.g., Accelerate/DeepSpeed), MLflow tracking, and robust data pipelines.
- Key practices: balanced/temperature sampling across languages, a shared tokenizer (e.g., SentencePiece), and per-direction metrics (BLEU/chrF/COMET) with stratified validation sets.

Handling long documents

- Translate at sentence level with smart segmentation, but include a small preceding/following context window to improve cohesion, pronouns, and terminology.
- Add a document-level QA pass for terminology consistency, named entities, and formatting; consider glossary/term constraints and post-editing for high-stakes content.

Using Databricks ai_translate

- If your language pairs are supported, use it for rapid, production-grade baselining and batch workflows within Databricks.
- Even then, validate on your own samples with automatic scores and human review to check adequacy, style, and consistency.

DeepL vs bespoke models

- DeepL is strong on many European pairs but performance varies by language and domain; your client’s content and style requirements are decisive.
- Run a bake‑off on representative documents comparing ai_translate (where supported), DeepL, and your fine‑tuned multilingual model; score with COMET and targeted human review.

Recommended plan

- Start from a high-quality multilingual baseline and jointly fine‑tune on all 17 languages using temperature sampling and balanced batches.
- Use per-language or language‑cluster adapters/LoRA; when adding new languages, adopt replay + regularization to prevent forgetting.
- Build a doc pipeline: sentence segmentation, small context windows, terminology glossaries, and a document‑level QA pass.
- Execute an evaluation bake‑off (automatic + human) on your domain data; choose based on quality, throughput, cost, and governance.
- Operationalize on Databricks with distributed training, MLflow dashboards per language/direction, and continuous quality monitoring.

Hope this helps, Louis.