Our preliminary results has been presented in (Rajda et al. 2022) and finally presented in (Augustyniak et al. 2023) review at NeurIPS’23.
Benchmark results - F1 Macro scores
Models
Model | Inf. time [s] | #params | #langs | base | data | reference |
---|---|---|---|---|---|---|
mT5 | 1.69 | 277M | 101 | T5 | \(CC^b\) | (Xue et al. 2021) |
LASER | 1.64 | 52M | 93 | BiLSTM | \(OPUS^c\) | (Artetxe and Schwenk 2019) |
mBERT | 1.49 | 177M | 104 | BERT | Wiki | (Devlin et al. 2019) |
MPNet** | 1.38 | 278M | 53 | XLM-R | \(OPUS^c\), \(MUSE^d\), \(Wikititles^e\) | (Reimers and Gurevych 2020) |
XLM-R-dist** | 1.37 | 278M | 53 | XLM-R | \(OPUS^c\), \(MUSE^d\), \(Wikititles^e\) | (Reimers and Gurevych 2020) |
XLM-R | 1.37 | 278M | 100 | XLM-R | CC | (Conneau et al. 2020) |
LaBSE | 1.36 | 470M | 109 | BERT | CC, Wiki + mined bitexts | (Feng et al. 2020) |
DistilmBERT | 0.79 | 134M | 104 | BERT | Wiki | (Sanh et al. 2020) |
mUSE-dist** | 0.79 | 134M | 53 | DistilmBERT | \(OPUS^c\), \(MUSE^d\), \(Wikititles^e\) | (Reimers and Gurevych 2020) |
mUSE-transformer* | 0.65 | 85M | 16 | transformer | mined QA + bitexts, SNLI | (Yang et al. 2020) |
mUSE-cnn* | 0.12 | 68M | 16 | CNN | mined QA + bitexts, SNLI | (Yang et al. 2020) |
*
mUSE models were used in TensorFlow implementation in contrast to others in torcha
Base model is either monolingual version on which it was based or another multilingual model which was used and adoptedb
Colossal Clean Crawled Corpus in multilingual version (mC4)c
multiple datasets from OPUS website (https://opus.nlpl.eu)d
bilingual dictionaries from MUSE (https://github.com/facebookresearch/MUSE)e
just titles from wiki articles in multiple languages
Results
References
Artetxe, Mikel, and Holger Schwenk. 2019. “Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond.” Transactions of the Association for Computational Linguistics 7 (September): 597–610. https://doi.org/10.1162/tacl_a_00288.
Augustyniak, Łukasz, Szymon Woźniak, Marcin Gruza, Piotr Gramacki, Krzysztof Rajda, Mikołaj Morzy, and Tomasz Kajdanowicz. 2023. “Massively Multilingual Corpus of Sentiment Datasets and Multi-Faceted Sentiment Classification Benchmark.” https://arxiv.org/abs/2306.07902.
Conneau, Alexis, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. “Unsupervised Cross-Lingual Representation Learning at Scale.” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 8440–51. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.747.
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–86. Minneapolis, Minnesota: Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423.
Feng, Fangxiaoyu, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. 2020. “Language-agnostic BERT Sentence Embedding.” Computing Research Repository arXiv:2007.01852. https://arxiv.org/abs/2007.01852.
Rajda, Krzysztof, Lukasz Augustyniak, Piotr Gramacki, Marcin Gruza, Szymon Woźniak, and Tomasz Kajdanowicz. 2022. “Assessment of Massively Multilingual Sentiment Classifiers.” In Proceedings of the 12th Workshop on Computational Approaches to Subjectivity, Sentiment & Social Media Analysis, 125–40. Dublin, Ireland: Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.wassa-1.13.
Reimers, Nils, and Iryna Gurevych. 2020. “Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation.” In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 4512–25. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.365.
Sanh, Victor, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2020. “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.” Computing Research Repository arXiv:1910.01108. https://arxiv.org/abs/1910.01108.
Xue, Linting, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. “MT5: A Massively Multilingual Pre-Trained Text-to-Text Transformer.” In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 483–98. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.naacl-main.41.
Yang, Yinfei, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah Constant, Gustavo Hernandez Abrego, et al. 2020. “Multilingual Universal Sentence Encoder for Semantic Retrieval.” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 87–94. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-demos.12.