- Article
- Open access
- Published:
volume 637, pages 587–593 (2025)Cite this article
Abstract
Creating the Babel Fish, a tool that helps individuals translate speech between any two languages, requires advanced technological innovation and linguistic expertise. Although conventional speech-to-speech translation systems composed of multiple subsystems performing translation in a cascaded fashion exist1,2,3, scalable and high-performing unified systems4,5 remain underexplored. To address this gap, here we introduce SEAMLESSM4T–Massively Multilingual and Multimodal Machine Translation–a single model that supports speech-to-speech translation (101 to 36 languages), speech-to-text translation (from 101 to 96 languages), text-to-speech translation (from 96 to 36 languages), text-to-text translation (96 languages) and automatic speech recognition (96 languages). Built using a new multimodal corpus of automatically aligned speech translations and other publicly available data, SEAMLESSM4T is one of the first multilingual systems that can translate from and into English for both speech and text. Moreover, it outperforms the existing state-of-the-art cascaded systems, achieving up to 8% and 23% higher BLEU (Bilingual Evaluation Understudy) scores in speech-to-text and speech-to-speech tasks, respectively. Beyond quality, when tested for robustness, our system is, on average, approximately 50% more resilient against background noise and speaker variations in speech-to-text tasks than the previous state-of-the-art systems. We evaluated SEAMLESSM4T on added toxicity and gender bias to assess translation safety. For the former, we included two strategies for added toxicity mitigation working at either training or inference time. Finally, all contributions in this work are publicly available for non-commercial use to propel further research on inclusive speech translation technologies.
Similar content being viewed by others
Main
The Babel Fish from The Hitchhiker’s Guide to the Galaxy is a fictional tool that translates between two languages. In the contemporary global landscape, characterized by increasing interconnectivity and mobile sociality, the social imperative to actualize these technologies and facilitate on-demand speech-to-speech translation (S2ST) both in the digital and in the physical worlds has never been greater. Despite the centrality of speech in everyday communication, machine translation (MT) systems today remain text-oriented. See Supplementary Information section I.1 for more details on why speech should be prioritized in MT. Although single, unimodal models such as No Language Left Behind (NLLB)6 pushed text-to-text translation (T2TT) coverage to more than 200 languages, unified S2ST models are far from achieving similar scope or performance. This disparity could be attributed to many causes, but audio data scarcity and modelling constraints remain key obstacles.
Existing S2ST systems have three main shortcomings. First, these systems tend to focus on high-resource languages, leaving many low-resource languages behind. Second, these systems mostly service translation from a source language into English (X–eng), not the reverse (eng–X). Third, most S2ST systems rely heavily on the cascading of several subsystems; for example, automatic speech recognition (ASR) + T2TT + text-to-speech (TTS). Although direct systems exist1,4,5, they do not match the performance of their cascaded counterparts7. See Supplementary Information section I.2 for more details on the current technical landscape.
To address these limitations, we introduce SEAMLESSM4T (Massively Multilingual and Multimodal Machine Translation), a unified system that supports ASR, T2TT, speech-to-text translation (S2TT), text-to-speech translation (T2ST) and S2ST. To build this, we created a corpus of more than 470,000 h of automatically aligned speech translations (SEAMLESSALIGN) using a new sentence embedding space (Sentence-level Multimodal and Language-Agnostic Representations, or SONAR)8. We then combined a filtered subset of this corpus with human-labelled and pseudo-labelled data to develop the first multitasking system that performs S2ST from more than 100 languages into 36 languages, S2TT and ASR into 96 languages, zero-shot T2ST into 36 languages, as well as T2TT for 96 languages (see Table 1 for a comparative overview of language coverage and Supplementary Information section II for more details). Because of the unified architecture of SEAMLESSM4T (Fig. 1), the model can perform T2TT, S2TT or S2ST for non-English directions (X–X) in a zero-shot manner. It can also perform T2ST without being trained explicitly for this task. As a result of pretraining the speech encoder of SEAMLESSM4T on large amounts of unlabelled speech data (see section ‘Unsupervised speech pretraining’), it can handle utterances mixing two or more languages.
The three main blocks of UNITY2 (S2ST fine-tuning) with its non-autoregressive (NAR) T2U are shown on the top left. Multitask-UNITY2 with its additional text encoder are shown on the bottom left. Break down of the components of SEAMLESSM4T-V2 (a multitask-UNITY2 model) are shown on the right with the side panel showing the teacher T2U model used for pseudo-labelling (M4).
To evaluate the quality of outputs of our model, we used several existing metrics spanning across tasks and modalities, as well as four main evaluation datasets. For example, we used chrF2++9 for T2TT, BLEU10 for S2TT, ASR-BLEU5 for S2ST and WER for ASR. See Supplementary Table 2 for details. We also tested our models for resilience against background noise or speaker variation, as well as other fronts for responsible deployment, such as gender bias, using the MULTILINGUAL HOLISTICBIAS datasets, or added toxicity, using new speech-based metrics (ASR-ETOX and MuTox). We mitigate added toxicity with a filtering strategy at training time and a beam filtering strategy at inference time11.
Apart from building SEAMLESSM4T, we also discuss the social implications of our work and how it may contribute to greater degrees of world-readiness12 in the long run (see section ‘Social impact and conclusion’). To spur future research, we make the data tools, code and two sizes of SEAMLESSM4T models publicly available for non-commercial use.
In the subsequent sections, we trace the key results in developing our SEAMLESSM4T models. First, we outline our efforts to mine aligned speech and text data, starting with speech language identification to the mining of aligned speech and text segments using modality-agnostic encoders. Next, we report the main results of our direct translation systems trained in part with the aforementioned automatically aligned speech data. These results highlight the task versatility of SEAMLESSM4T models achieving multilingual state-of-the-art performance in ASR, T2TT, S2TT and S2ST. Then, we support the reported results with human evaluation analysis. Finally, we delineate our efforts to mitigate added toxicity and evaluate the robustness of our models to gender variations.
Data
Training speech translation systems requires labelled data; that is, speech-to-text and speech-to-speech aligned data. However, those resources are very limited for low-resource languages. We build on the multilingual and multimodal embedding space of SONAR8 and a large collection of raw speech and texts, as described in the Methods, to automatically mine aligned resources, complementing existing human-labelled and pseudo-labelled data.
Speech language identification
Processing raw speech from the web involves segmenting utterances into shorter chunks, followed by language identification. Building on an open-source model trained on VoxLingua107 with the ECAPA-TDNN architecture13,14, we developed a new speech-based language identification (LID) model covering all 100 languages featured in this work (see Methods, ‘Audio processing and speech LID’ for more details).
To measure the precision and recall of LID models, we report F1 scores on the test data in Extended Data Table 1. The results are given for the 100 SEAMLESSM4T languages (Overall) and the 79 common languages between SEAMLESSM4T and VoxLingua107 (Intersection). Note that the macro-F1 on all languages for VL107HF is low because 21 languages are not covered by this model. We find that training on the additional languages slightly decreases the overall performance for the common set of languages, which is a direct consequence of a higher number of close languages. For example, Zulu (zul) is very often confused with Nyanja (nya), Igbo (ibo) with Yoruba (yor), and Modern Standard Arabic (arb) with Moroccan Arabic (ary) and Egyptian Arabic (arz). Our model improves classification accuracy (F1 difference greater than 5%) on 17 languages with an average gain of 14.6% without counting newly covered languages, while decreasing classification accuracy for 12 (with an average loss of 9.8%). We further filtered the data by applying a threshold on the LID score (likelihood). Language-specific thresholds have been tuned to maximize F1 score on the development data. By filtering out 8% of the data, we were able to further increase the F1 score by almost 3%.
SONAR text embedding space
To mine automatically aligned translation data from language-identified segments, we rely on language- and modality-agnostic encoders. To this end, we build on the SONAR embedding space developed in ref. 8. Currently, we provide a single text encoder and decoder for 200 languages and speech encoders for 37 languages. The list of 200 languages is identical to the language list of the NLLB project6. In multilingual similarity search, xsim and xsim++15 are two well-known proxy metrics evaluating multilingual embedding spaces for the purpose of mining. As shown in Table 2, SONAR substantially outperforms other popular approaches such as LASER316 or LaBSE17 with lower xsim and xsim++.
Furthermore, we evaluated the SONAR text encoders and decoders on T2TT tasks. The average performance over 200 languages is competitive compared with the medium-sized NLLB dense model, despite the replacement of encoder–decoder attention in SONAR with a bottleneck fixed-size embedding (see the T2TT columns of Extended Data Table 2). This result proves that the use of attention is not a requisite for reasonable translation accuracy. For more details on SONAR, see D2.
Training speech encoders
The speech encoders were trained with a teacher–student approach on speech transcriptions only (see Methods, ‘SONAR’). Evaluating iterations of each speech encoder in an end-to-end loop, that is, mining and training S2TT or S2ST translation systems, would be compute-intensive. Instead, we connected the speech encoder with the SONAR text decoder and evaluated this zero-shot S2TT system as a proxy for the quality of the encoder. As shown in the S2TT columns of Extended Data Table 2, the SONAR speech encoders compare favourably to a model such as WHISPER-LARGE-V2 on FLORES6 and FLEURS18 datasets, which was trained on massive amounts of supervised data. Gaps in accuracy can be observed in some high-resource languages such as German, Russian or Portuguese, but SONAR outperforms WHISPER-LARGE-V2 in several low-resource languages such as Swahili or Bengali (see Supplementary Table 8).
SEAMLESSALIGN
The SONAR text and speech encoders were used to mine for three types of aligned data: (1) English speech to non-English texts (Sen2Txx); (2) non-English speech to English texts (Sxx2Ten); and (3) non-English speech to English speech (Sxx2Sen). SEAMLESSALIGN provides 202,796 h of audio in Sen2Txx, 239,767 h of audio in Sxx2Ten and 29,161 h of audio in Sxx2En. These aligned data were mined from a total of 2.5M h of raw audio (of which English is nearly 40%). SONAR speech encoders were trained on 43,772 h of supervised ASR data. For statistics per language, see Supplementary Table 8.
For the text domain, we use the same data consolidated by the NLLB project6. The amount varies from 33 million or 55 million sentences for low-resource languages such as Maltese or Swahili, respectively, to 22,000 million English sentences.
Except for Maltese, for which we only had access to a small amount of raw audio, we were able to mine more than 100 h of speech alignments with English speech for all languages. The alignments with English texts reached a thousand hours for most languages and exceeded 10,000 h for high-resource languages. Overall, SEAMLESSALIGN covers 37 languages for a total of 470,000 h.
Adding such large amounts of data to train a multilingual translation system is a substantial computational challenge. As described in the Methods, ‘Modelling’, not all of these data were used for modelling, but only a subset with the highest SONAR alignment scores. As our mined data can help support many different use cases, we open-sourced the meta-data needed to guide its recreation (up to a SONAR threshold of 1.15; see Methods, ‘SpeechAlign’) to allow the community to rebuild SEAMLESSALIGN and use it for their own purposes. The optimal threshold can thus be tuned based on the task, balancing dataset size and alignment quality. Our mining code is also open-sourced in the STOPES library.
Modelling multitask translation systems
Combining modelling techniques outlined in the Methods, ‘Modelling’, with additional data from SEAMLESSALIGN (see Methods, ‘Data’), we trained SEAMLESSM4T models in two sizes: large with 2.3B parameters and medium with 1.2B parameters. SEAMLESSM4T-MEDIUM is intended to be an accessible test bed to either fine-tune, improve on or engage in analysis with. We further trained an improved version of the large SEAMLESSM4T, dubbed SEAMLESSM4T-V2, with a better speech encoder (see Methods, ‘Unsupervised speech pretraining’) and a more powerful unit decoder (see section ‘S2ST fine-tuning’). All SEAMLESSM4T models support 96 source languages in the text modality and more than 100 source languages in the speech modality. On the target side, the models can output 96 languages in text form and 35 in speech form. The amount of supervised data per direction and per source (for example, M4 or SEAMLESSALIGN) is detailed in Supplementary Tables 12 and 13. This shows that, for some translation directions and given the lack of supervised data, our models will be evaluated zero-shot.
We evaluated our models on all four supervised tasks (T2TT, ASR, S2TT and S2ST) as well as the zero-shot task of text-to-speech translation (T2ST, also referred to as cross-lingual text-to-speech synthesis19). To generate text hypotheses, we decoded with beam-search. We scored with chrF2++ for T2TT and BLEU for S2TT. We measure BLEU scores with SacreBLEU and provide the signatures in Supplementary Table 2. For ASR, we scored with WER (word error rate) on normalized transcriptions and references following ref. 20.
During S2ST and T2ST inference, we performed two-pass beam-search decoding; the best hypothesis out of the first-pass decoding is embedded with the text decoder and is sent to a text-to-unit module (T2U; see section ‘S2ST fine-tuning’) to search for the best unit sequence hypothesis. We used a beam width of 5 for both searches. We evaluated S2ST and T2ST accuracy with ASR-BLEU using WHISPER models. We set the decoding temperature of WHISPER at zero and used greedy decoding to ensure a deterministic behaviour of the ASR model. The transcribed hypotheses, as well as the references, are normalized following ref. 20 before computing BLEU scores.
Comparison with cascaded approaches for speech translation
On the set of languages supported by both SEAMLESSM4T and WHISPER, we compare in Table 3 (S2TT columns) the performance of our direct S2TT model to that of cascaded models, namely, combinations of WHISPER ASR models and NLLB T2TT models. SEAMLESSM4T-V2 surpasses the cascaded models with less than 3B parameters in X–eng directions by 4.6 BLEU points (from 22.0 to 26.6) and in eng–X directions by 1 BLEU point (from 21.1 to 22.2). We also added to the comparison in Table 3 cascaded models with the large NLLB-3.3B T2TT model. These models exceed 4B parameters and are largely surpassed by SEAMLESSM4T-V2 in X–eng (+3.9); they only marginally outperform SEAMLESSM4T-V2 in eng–X directions by 0.2 BLEU points.
Compared with previous direct S2TT SOTA models that lagged behind cascaded systems (for example, AUDIOPALM-2-8B-AST. ref. 21), SEAMLESSM4T-V2 improves on FLEURS X–eng S2TT BLEU score by 6.9 points (from 19.7 to 26.6; that is, an improvement of 35%).
Table 3 (S2ST columns) also compares S2ST between SEAMLESSM4T models and cascaded models. For S2ST, we explore two options for cascading: (1) three-stage with ASR, T2TT and TTS and (2) two-stage with S2TT and TTS. Both types of cascaded systems rely on a TTS model to synthesize translated speech, and for this we use YOURTTS22 when synthesizing English speech and MMS23 when synthesizing speech in the 26 non-English languages of comparison (overlap between the support of SEAMLESSM4T and the support of TTS systems of MMS). Our SEAMLESSM4T-LARGE outperforms the two-stage cascaded models on FLEURS X–eng directions by 8 ASR-BLEU points (17.8–25.8). It also outperforms stronger three-stage cascaded models (WHISPER-LARGE-V2 + NLLB-3.3B + YOURTTS) by 2.1 ASR-BLEU points (23.7–25.8). The improved SEAMLESSM4T-V2 further strengthen this lead on S2ST FLEURS X–eng with an additional +3.9 ASR-BLEU points (25.8–29.7). On CVSS, SEAMLESSM4T-V2 outperforms the two-stage cascaded model (WHISPER-LARGE-V2 + YOURTTS) by a large margin of 9.6 ASR-BLEU points (29.6–39.2). On FLEURS S2ST eng–X directions, we reduce the evaluation set to the 26 languages supported by both TTS of MMS and SEAMLESSM4T. The medium-size model (SEAMLESSM4T-MEDIUM) scores an average ASR-BLEU of 15.8. SEAMLESSM4T-LARGE achieves an average ASR-BLEU of 20.9 and with its improved speech encoder and non-autoregressive T2U model, SEAMLESSM4T-V2 further gains +5.2 ASR-BLEU points (20.9– 26.1). By contrast, the best three-stage cascaded system with MMS (WHISPER-LARGE-V2 + NLLB-3.3B + MMS) scores an average 22.7 ASR-BLEU, that is, SEAMLESSM4T-V2 surpasses state-of-the-art cascaded models by 15% (22.7–26.1).
We share in Supplementary Information section IV.1 evaluation results for the tasks of S2TT and S2ST with additional metrics, including our modality-agnostic BLASER 2.0.
Multitasking results
We report in Table 4 results on the FLEURS benchmark for the tasks of ASR and zero-shot T2ST (X–eng and eng–X), and the related FLORES benchmark for T2TT (X–eng and eng–X). In ASR, SEAMLESSM4T-LARGE outperforms WHISPER-LARGE-V220 on the overlapping 77 supported languages with a WER reduction of 46% (from 41.7 to 22.6), whereas SEAMLESSM4T-V2 improves over WHISPER-LARGE-V2 by 56% (from 41.7 to 18.5). We also compared in Supplementary Table 9 against MMS23 on FLEURS-54, a subset of FLEURS languages that MMS and WHISPER both support. SEAMLESSM4T-V2 outperforms the MMS variants evaluated with CTC by more than 38% WER (from 31.0 to 19.1), but it is surpassed by the variants that leverage monolingual n-gram language models (5% better WER with 18.6).
In the T2TT support task, results in Table 4 show that our SEAMLESSM4T models are on par with NLLB-3.3B (ref. 6) in both X–eng and eng–X directions.
We next evaluated SEAMLESSM4T models on the task of T2ST in a zero-shot way. Given that FLEURS collected three recordings by three different native speakers for each sample, we randomly selected one for the task of T2ST (the input being text). We report in Table 4 (the T2ST columns) a comparison between SEAMLESSM4T models and cascaded models with NLLB and either YOURTTS (English TTS) or MMS (non-English TTS) for synthesizing translated text. We averaged ASR-BLEU scores over 88 X–eng directions (the overlap between FLEURS and the languages supported by SEAMLESSM4T). We also averaged ASR-BLEU over 26 eng–X directions (overlap of SEAMLESSM4T with TTS models of MMS). Compared with cascaded models, the zero-shot capability of SEAMLESSM4T-LARGE V2 is on par with NLLB-3.3B + YOURTTS in X–eng and outperforms NLLB-3.3B + MMS by more than +3.9 ASR-BLEU points in eng–X (from 23.7 to 27.6). This result demonstrates that (1) the quality of SEAMLESSM4T on zero-shot T2ST is on par with the supervised tasks and (2) non-English speech source is the most challenging input to translate with our model.
To further understand where the improvements in FLEURS S2TT X–eng directions were coming from, we bucketed languages by resource level (see the exact list of languages in Supplementary Table 12) and report average BLEU scores per resource level in Table 5. The results show that SEAMLESSM4T strongly improves the quality of translating from low-resource languages with an improvement of +10.2 BLEU (from 18.0 to 28.2, that is, 57% improvement over AUDIOPALM-2-8B-AST). We also average in column Low* over low-resource directions that are supervised in AUDIOPALM-2-8B-AST. The gain of +7.8 BLEU in that subset of directions suggests that this improvement goes beyond sheer supervision but instead should be attributed to the quality of supervised data and the training recipes.
Automatic and human evaluation
Semantic accuracy in speech translation is generally evaluated with the automatic metric BLEU10 for S2TT or its extension ASR-BLEU for S2ST. Moreover, we use BLASER 2.0 (ref. 24), an extension of BLASER25, which now enables modality-agnostic evaluation and quality estimation for both speech and text.
To complement the utility of automatic metrics, we also relied on extensive human evaluation of our models. In the following, we provide human evaluation for S2TT and S2ST tasks with the XSTS (cross-lingual semantic textual similarity) protocol26 and MOS (mean opinion score) protocol for speech outputs (see Methods, ‘Human evaluation’) on the FLEURS test set. However, we cover a limited number of models (SEAMLESSM4T-LARGE, SEAMLESSM4T-LARGE V2 and a cascaded baseline composed of WHISPER-LARGE-V2 for ASR, NLLB 3.3B for translation and YOURTTS or MMS for TTS for the S2ST task) and translation directions (23 languages from and into English, 10 languages X–eng for MOS) because of resource restrictions.
XSTS scores show that SEAMLESSM4T-LARGE V2 outperforms both the cascaded baseline systems and SEAMLESSM4T-LARGE in terms of both average language-level XSTS score and win rate (the fraction of evaluated languages for which XSTS performance is superior), for all tasks and language directions with high confidence. For the S2ST task, where relative performance of SEAMLESSM4T-V2 was the strongest, win rate of SEAMLESSM4T-V2 approaches 100% compared with both cascaded baseline and SEAMLESSM4T-LARGE for both X–eng and eng–X directions, with average language XSTS scores about 0.5 points higher than the cascaded baseline and 0.36–0.51 points higher compared with SEAMLESSM4T-LARGE for eng–X and X–eng, respectively. See Supplementary Tables 22 and 24 for full language-level and summarized XSTS results, respectively.
We also measure the quality of speech output for the S2ST using a Mean Opinion Score protocol that assesses (1) sound quality, (2) clarity of speech and (3) naturalness. We find that generally, across all MOS aspects, SEAMLESSM4T-LARGE V2 tends to be preferred to SEAMLESSM4T-LARGE, which tends to be preferred to the cascaded model baselines, with the exception of X–eng, for which SEAMLESSM4T-LARGE generations are strongly preferred (+1 point average difference between SEAMLESSM4T-LARGE and SEAMLESSM4T-LARGE V2 generations), an unexpected result that may be a consequence of differences in model architectures that otherwise have improved generation quality. See Supplementary Tables 23 and 24 for full language-level and summarized MOS results, respectively.
Using the XSTS evaluations of the S2ST task, BLASER 2.0 (averaged over all evaluation items in a given language direction) achieves superior Spearman correlations with calibrated language-level XSTS scores in both X–eng direction (0.845 for BLASER 2.0 compared with 0.74 for ASR-BLEU) and in particular for the eng–X direction (0.81 for BLASER 2.0 compared with 0.246 for ASR-BLEU). Similar results hold for the S2TT task (see Supplementary Table 21 for full results).
Finally, we tested our models for robustness in terms of noise and speaker variations by creating open robustness benchmarks based on FLEURS (see Methods, ‘Robustness’). To that end, we find that SEAMLESSM4T-V2 is, on average, approximately 42% and 66% more resilient against background noise and speaker variation, respectively, when compared with WHISPER-LARGE-V2 (see full results in Supplementary Information section V.2).
Responsible AI
Toxicity
Toxicity can be defined as instances of profanity or language that may incite hate, violence or abuse against an individual or a group (such as a religion, race or gender). When it comes to massively multilingual toxicity classifiers for text, the ETOX toolkit seems to be the only openly accessible option with the largest language coverage27. In the context of speech translation, we were primarily worried about added toxicity—the introduction in translations of toxic elements not present in source utterances. Speech toxicity has been evaluated for English in ref. 28 and, recently, for tens of languages with MuTox29.
Therefore, for speech and text multilingual toxicity detection, we used ETOX (or ASR-ETOX for S2ST) and MuTox (both for speech and text) as metrics to detect and evaluate added toxicity. For toxicity mitigation, we implemented two techniques to deal with added toxicity. Before training, we filtered out training pairs with imbalanced toxicity. Moreover, we used Mintox11 at inference time (see Methods, ‘Toxicity detection’).
We computed added toxicity in two datasets (FLEURS and HOLISTICBIAS27) across 24 translation directions with English (arb, ben, cat, ces, dan, deu, est, fin, fra, hin, ind, ita, nld, pes, pol, por, rus, slk, spa, swh, tgl, tur, urd, vie), using the languages at the intersection of the coverage of systems and MuTox having been benchmarked (note that MuTox has wider language coverage, similar to SONAR, but it has only been benchmarked in 30 languages29). See results with more translation directions evaluated with ETOX in Supplementary Information section VI.1. Table 6 shows that although levels and types of added toxicity vary significantly as a function of language and dataset, the added toxicity in our systems has a relatively low prevalence consistent across our two toxicity detection metrics (<0.4%). Table 6 shows that MinTox is capable of mitigating added toxicity consistently. The lowest toxicity for all modalities, directions and datasets is consistently obtained with SEAMLESSM4T-V2 + MinTox, achieving reductions of toxicity up to 5% in terms of MuTox (and up to 80% in terms of ETOX) when comparing with the same model without using MinTox, and up to 20% in terms of MuTox (up to 90% in terms of ETOX) when comparing with the baseline (see complete results in Supplementary Information section VI.1).
Gender bias
Gender bias in the context of MT can be defined as errors in grammatical gender determination. This bias may manifest explicitly as an overgeneralization to one gender when translating non-gendered to gendered forms (for example, outputs favouring masculine representations) or as a lack of robustness when varying the quality of the translation for sentences that differ only in gender inflection.
Previous work on this matter is mostly in the text modality30,31,32 and tends to be English-centric, with few demographic axes and multilingual references. Similar efforts for the speech modality remain sparse33,34.
We used MULTILINGUAL HOLISTICBIAS35 and its speech extension to compare the performance of S2TT and S2ST. The eng–X direction enables comparing performance in the presence of masculine or feminine references, and the X–eng direction enables robustness comparisons in translations when we alter gender inflection. A typical example of the English–Spanish language pair would be ‘I’m a homemaker’ and the corresponding translations ‘Soy amo de casa’ and ‘Soy ama de casa’ in Spanish. When translating from English to Spanish, we can measure if the system overgeneralizes to one gender, whereas in the other direction, we can evaluate the robustness of the translation to gender inflection (see Methods, ‘Speech exte