MMS Dataset Card

Dataset Card for https://huggingface.co/datasets/Brand24/mms

Easiness of using

One of the key ideas behind creating our library of datasets was to prioritize ease of use for researchers. Recognizing the importance of accessibility and convenience, we chose the HuggingFace platform as the storage and distribution platform for the datasets. HuggingFace provides a user-friendly interface and a wide range of tools and resources, making it easy for researchers to access and utilize the datasets.

To further enhance usability, we took the initiative to gather all the necessary citations for the datasets included in our library. By unifying the citations, we aimed to simplify and expedite the process of generating citations for researchers who utilize our datasets. This step reduces the time and effort required for researchers to acknowledge the datasets’ sources properly.

However, it is essential to note that while we have taken steps to streamline the citation process, researchers should still independently verify the licenses of the datasets, especially if they intend to use them for purposes beyond strict academic research. Ensuring compliance with licensing requirements is crucial to maintaining ethical and legal data use standards.

Overall, our overarching goal in creating this unified corpus of datasets is accelerating academic sentiment analysis research. By providing a comprehensive collection of high-quality datasets and facilitating their accessibility, we aim to support researchers in exploring and advancing sentiment analysis techniques and methodologies.

Data ready to slice and dice and train a model

Our dataset is designed to be versatile and allows researchers to slice and dice the data for training and modeling according to their specific needs. Drawing from the field of linguistic typology, which examines the characteristics of languages, we have incorporated various linguistic features into our dataset selection process. These features include the text itself, sentiment labels, the original dataset source, domain, language, language family, genus, the presence or absence of definite and indefinite articles, the number of cases, word order, negative morphemes, polar questions, the position of negative morphemes, prefixing vs. suffixing, coding of nominal plurals, and grammatical genders. Researchers can easily access datasets that match their desired linguistic typology criteria by offering these features as filtering options in our library.

For instance, researchers can download datasets specific to Slavic languages with interrogative word order for polar questions or datasets from the Afro-Asiatic language family without morphological case-making. This flexibility empowers researchers to tailor their analyses and models to their linguistic interests and research questions.

import datasets

mms_dataset = datasets.load_dataset("Brand24/mms")
mms_dataset_df = mms_dataset["train"].to_pandas()

All features in dataset

mms_dataset_df.sample(5)
_id text label original_dataset domain language Family Genus Definite articles Indefinite articles Number of cases Order of subject, object, verb Negative morphemes Polar questions Position of negative word wrt SOV Prefixing vs suffixing Coding of nominal plurality Grammatical genders cleanlab_self_confidence
1117023 1117023 hlucnost mi prijde uplne v pohode, pere dobre,... 2 cs_mall_product_reviews reviews cs Indo-European Slavic no article no article 6-7 SVO negative affix interrogative word order MorphNeg weakly suffixing plural suffix masculine, feminine, neuter 0.679376
824580 824580 “فندق جميل ولكن الخدمة جدا سيئه”. . الخدمة غير... 0 ar_hard reviews ar Afro-Asiatic Semitic definite affix no article 3 SVO negative particle interrogative intonation only SNegVO weakly suffixing mixed morphological plural masculine, feminine 0.725264
6014593 6014593 刚开始不习惯…之后还挺好用的…很轻便 很细…调节长度也很方便 2 zh_multilan_amazon reviews zh Sino-Tibetan Chinese no article indefinite word same as one no morphological case-making SVO negative particle question particle SNegVO little affixation no plural noun classifiers 0.907645
5313872 5313872 Чемпионы. И этим все сказано. 2 ru_sentiment social_media ru Indo-European Slavic no article no article 6-7 SVO negative particle question particle SNegVO strongly suffixing plural suffix masculine, feminine, neuter 0.109386
4290632 4290632 “@UnCharroDice: Y no ha de sobrar, quien con c... 1 es_twitter_sentiment social_media es Indo-European Romance definite word distinct from demonstrative indefinite word same as one no morphological case-making SVO negative particle interrogative word order SNegVO strongly suffixing plural suffix masculine, feminine 0.164549

Linguistic Typology

The field of language typology focuses on studying the similarities and differences among languages. These differences can be categorized into phonological (sounds), syntactic (structures), lexical (vocabulary), and theoretical aspects. Linguistic typology analyzes the current state of languages, contrasting with genealogical linguistics, which examines historical relationships between languages.

Genealogical linguistics studies language families and genera. A language family consists of languages that share a common ancestral language, while genera are branches within a language family. The Indo-European family, for example, includes genera such as Slavic, Romance, Germanic, and Indic. Over 7000 languages are categorized into approximately 150 language families, with Indo-European, Sino-Tibetan, Turkic, Afro-Asiatic, Nilo-Saharan, Niger-Congo, and Eskimo-Aleut being some of the largest families.

Within linguistic typology, languages are described using various linguistic features. Our work focuses on sentiment classification and selects ten relevant features:

  • text: The feature text represents the actual text of the sentiment dataset. It is of type string and contains the text samples or sentences for sentiment analysis.
  • label: The feature label corresponds to the sentiment labels of the text samples. It is of type ClassLabel and has three possible values: negative, neutral, and positive. These labels indicate the sentiment or emotional polarity associated with the text.
  • original_dataset: The feature original_dataset refers to the name or identifier of the original dataset from which the text samples were extracted. It is of type string and provides information about the source dataset.
  • domain: The feature domain represents the domain or topic of the sentiment dataset. It is of type string and provides context regarding the subject matter of the text samples.
  • language: The feature language indicates the language of the text samples in the sentiment dataset. It is of type string and specifies the language in which the text is written.
  • Family: The feature Family represents the language family to which a specific language belongs. It is of type string and provides information about the broader categorization of languages into language families.
  • Genus: The feature Genus corresponds to the genus or branch within a language family. It is of type string and indicates the specific subgrouping of languages within a language family.
  • Definite article: Half of the languages do not use the definite article, which signals uniqueness or definiteness of a concept.
  • Indefinite article: Half of the languages do not use the indefinite article, with some languages using a separate article or the numeral “one.”
  • Number of cases: Languages vary greatly in the number of morphological cases used.
  • Order of subject, verb, and object: Different languages have different word orderings, with variations like SOV, SVO, VSO, VOS, OVS, and OSV.
  • Negative morphemes: Negative morphemes indicate clausal negation in declarative sentences.
  • Polar questions: Questions with yes/no answers, which can be formed using question particles, interrogative morphology, or intonation.
  • Position of the negative morpheme: The position of the negative morpheme can vary in relation to subjects and objects.
  • Prefixing vs. suffixing: Languages differ in their use of prefixes and suffixes in inflectional morphology.
  • Coding of nominal plurals: Plurals can be expressed through morphological changes or the use of plurality indicator morphemes.
  • Grammatical genders: Languages vary in the number of grammatical genders used, or may not use the concept at all.

These language features are available as filtering options in our library. Users can download specific facets of the collection, such as datasets in Slavic languages with interrogative word order for polar questions or datasets from the Afro-Asiatic language family without morphological case-making.

Datasheets for Datasets

The datasheets provide detailed information about the datasets, including data collection methods, annotation guidelines, and potential biases. They also specify the intended uses and potential limitations of the datasets.

The initial pool of sentiment datasets was gathered through an extensive search using sources such as Google Scholar, GitHub repositories, and the HuggingFace datasets library. This search yielded a total of 345 datasets.

To ensure the quality of the datasets, a set of quality assurance criteria was applied to manually filter the initial pool of datasets. The following criteria were used:

  1. Strong Annotations: Datasets containing weak annotations, such as labels based on emoji occurrence or automatically generated through classification by machine learning models, were rejected. This decision was made to minimize the presence of noise in the datasets, ensuring higher quality annotations.
  2. Well-Defined Annotation Protocol: Datasets without sufficient information about the annotation protocol, including whether the annotation was done manually or automatically and the number of annotators involved, were rejected. This step aimed to avoid merging datasets with contradicting annotation instructions, ensuring consistency across the selected datasets.
  3. Numerical Ratings: Datasets with numerical ratings were accepted. Specifically, Likert-type 5-point scales were mapped into three class sentiment labels. Ratings 1 and 2 were mapped to “negative,” rating 3 was mapped to “neutral,” and ratings 4 and 5 were mapped to “positive.” This mapping allowed for consistent sentiment labeling across the datasets.
  4. Three Classes Only: Datasets annotated with binary sentiment labels were rejected. The decision to focus on datasets with three sentiment classes (negative, neutral, and positive) was made based on the unsatisfactory performance of binary sentiment labeling in three-class settings.
  5. Monolingual Datasets: In cases where a dataset contained samples in multiple languages, it was divided into independent datasets for each constituent language. This approach ensured that the corpus includes separate datasets for different languages, allowing for targeted analysis and evaluation.

By applying these quality assurance criteria, we were able to filter the initial pool of sentiment datasets and select a final set of 79 datasets that met the specified standards for inclusion in the multilingual corpus.

f"We cover {mms_dataset_df.original_dataset.nunique()} datasets in {mms_dataset_df.language.nunique()} languages."
'We cover 79 datasets in 27 languages.'
f"The classes that we cover: {mms_dataset_df.label_name.unique()}"
"The classes that we cover: ['positive' 'neutral' 'negative']"

Limitations

Despite the fact that our collection is the largest public collection of multilingual sentiment datasets, it still covers only 27 languages. The collection of datasets is highly biased towards the Indo-European family of languages, English in particular. We attribute this bias to the general culture of scientific publishing and its enforcement of English as the primary carrier of scientific discovery. Our work’s main potential negative social impact is that the models developed and trained using the provided datasets may still exhibit better performance for the major languages. This could further perpetuate the existing language disparities and inequality in sentiment analysis capabilities across different languages. Addressing this limitation and working towards more equitable representation and performance across languages is crucial to avoid reinforcing language biases and the potential marginalization of underrepresented languages. The ethical implications of such disparities should be thoroughly discussed and considered.

Data Quality

An important limitation of our dataset collection is a significant variance in sample quality across all datasets and all languages. Above figure presents the distribution of self-confidence label-quality score for each data point computed by the cleanlab (Northcutt, Jiang, and Chuang 2021). The distribution of quality is skewed in favor of popular languages, with low-resource languages suffering from data quality issues. A related limitation is caused by an unequal distribution of data modalities across languages. For instance, our benchmark clearly shows that all models universally underperform when tested on Portuguese datasets. This is the direct result of the fact that data points for Portuguese almost exclusively represent the domain of social media. As a consequence, some combinations of filtering facets in our dataset collection produce very little data (i.e., asking for social media data in the Germanic genus of Indo-European languages will produce a significantly larger dataset than asking for news data representing Afro-Asiatic languages).

Finally, we acknowledge the lack of internal coherence of annotation protocols between datasets and languages. We have enforced strict quality criteria and rejected all datasets published without the annotation protocol, but we were unable, for obvious reasons, to unify annotation guidelines. The annotation of sentiment expressions and the assignment of sentiment labels are heavily subjective and, at the same time, influenced by cultural and linguistic features. Unfortunately, it is possible that semantically similar utterances will be assigned conflicting labels if they come from different datasets or modalities.

Filter examples by annotation qualitym

We know how imporant data quality is for the model training processes. Hence, we added cleanlab scores to each of 6M+ examples in all datasets. Now, it is enalbe to filter examples based on how good quality of data do you need for traning.

We can sort examples by top data quality. Cleanlab’s self confidence is a function to compute label-quality scores for classification datasets, where lower scores indicate labels less likely to be correct. Hence, for the best quality we want to have the highest scores.

clean_labels_data = mms_dataset_df.sort_values(by="cleanlab_self_confidence", ascending=False).head(10_000)
clean_labels_data.head()
_id text label original_dataset domain language Family Genus Definite articles Indefinite articles Number of cases Order of subject, object, verb Negative morphemes Polar questions Position of negative word wrt SOV Prefixing vs suffixing Coding of nominal plurality Grammatical genders cleanlab_self_confidence label_name
3075302 3075302 Great addition to any fan's yard! Show your te... 2 en_amazon reviews en Indo-European Germanic definite word distinct from demonstrative indefinite word distinct from one 2 SVO negative particle interrogative word order SNegVO strongly suffixing plural suffix no grammatical gender 0.999981 positive
629922 629922 مخيب للأمل. . ىحَ 0 ar_hard reviews ar Afro-Asiatic Semitic definite affix no article 3 SVO negative particle interrogative intonation only SNegVO weakly suffixing mixed morphological plural masculine, feminine 0.999964 negative
2858237 2858237 This is a great flag to display your love of A... 2 en_amazon reviews en Indo-European Germanic definite word distinct from demonstrative indefinite word distinct from one 2 SVO negative particle interrogative word order SNegVO strongly suffixing plural suffix no grammatical gender 0.999950 positive
3110031 3110031 One of the best knives I now proudly own! Am a... 2 en_amazon reviews en Indo-European Germanic definite word distinct from demonstrative indefinite word distinct from one 2 SVO negative particle interrogative word order SNegVO strongly suffixing plural suffix no grammatical gender 0.999950 positive
2052971 2052971 Amen! My Savior Loves! Wonderful testimony! 2 en_amazon reviews en Indo-European Germanic definite word distinct from demonstrative indefinite word distinct from one 2 SVO negative particle interrogative word order SNegVO strongly suffixing plural suffix no grammatical gender 0.999948 positive

Datasets

We added all necessary citations to the HuggingFace datasets card. You can find them inside citation key. We added a helper fuinctions to parse them.

We can load citations as strings - easy adding to bibtex.

from mms_benchmark.citations import get_citations
print(get_citations(mms_dataset["train"], citation_as_dict=False)["pl_polemo"])
@inproceedings{dataset_pl_polemo,
    title = "Multi-Level Sentiment Analysis of {P}ol{E}mo 2.0: Extended Corpus of Multi-Domain Consumer Reviews",
    author = "Koco{\'n}, Jan  and
        Mi{\l}kowski, Piotr  and
        Za{\'s}ko-Zieli{\'n}ska, Monika",
    booktitle = "Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)",
    month = nov,
    year = "2019",
    address = "Hong Kong, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/K19-1092",
    doi = "10.18653/v1/K19-1092",
    pages = "980--991"
}
% ------------------------------------------------------------------------------------------

Or as dictionary for working with them.

citations = get_citations(mms_dataset["train"], citation_as_dict=True)
citations["pl_polemo"]
{'pages': '980--991',
 'doi': '10.18653/v1/K19-1092',
 'url': 'https://aclanthology.org/K19-1092',
 'publisher': 'Association for Computational Linguistics',
 'address': 'Hong Kong, China',
 'year': '2019',
 'month': 'November',
 'booktitle': 'Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)',
 'author': "Koco{\\'n}, Jan  and\nMi{\\l}kowski, Piotr  and\nZa{\\'s}ko-Zieli{\\'n}ska, Monika",
 'title': 'Multi-Level Sentiment Analysis of {P}ol{E}mo 2.0: Extended Corpus of Multi-Domain Consumer Reviews',
 'ENTRYTYPE': 'inproceedings',
 'ID': 'dataset_pl_polemo'}

Show all datasets with citations in a table

mms_dataset_df["citation"] = mms_dataset_df["original_dataset"].apply(lambda x: f'[@{citations[x]["ID"]}]')
mms_dataset_df[DATASET_COLS].drop_duplicates().sort_values("language").reset_index(drop=True)
language original_dataset domain Family Genus Definite articles Indefinite articles Number of cases Order of subject, object, verb Negative morphemes Polar questions Position of negative word wrt SOV Prefixing vs suffixing Coding of nominal plurality Grammatical genders citation
0 ar ar_arsentdl social_media Afro-Asiatic Semitic definite affix no article 3 SVO negative particle interrogative intonation only SNegVO weakly suffixing mixed morphological plural masculine, feminine [@dataset_ar_arsentdl]
1 ar ar_semeval_2017 mixed Afro-Asiatic Semitic definite affix no article 3 SVO negative particle interrogative intonation only SNegVO weakly suffixing mixed morphological plural masculine, feminine [@dataset_semeval_2017]
2 ar ar_oclar reviews Afro-Asiatic Semitic definite affix no article 3 SVO negative particle interrogative intonation only SNegVO weakly suffixing mixed morphological plural masculine, feminine [@dataset_ar_oclar]
3 ar ar_labr reviews Afro-Asiatic Semitic definite affix no article 3 SVO negative particle interrogative intonation only SNegVO weakly suffixing mixed morphological plural masculine, feminine [@dataset_ar_labr]
4 ar ar_syria_corpus social_media Afro-Asiatic Semitic definite affix no article 3 SVO negative particle interrogative intonation only SNegVO weakly suffixing mixed morphological plural masculine, feminine [@dataset_ar_bbn]
5 ar ar_brad reviews Afro-Asiatic Semitic definite affix no article 3 SVO negative particle interrogative intonation only SNegVO weakly suffixing mixed morphological plural masculine, feminine [@dataset_ar_brad]
6 ar ar_bbn social_media Afro-Asiatic Semitic definite affix no article 3 SVO negative particle interrogative intonation only SNegVO weakly suffixing mixed morphological plural masculine, feminine [@dataset_ar_bbn]
7 ar ar_astd social_media Afro-Asiatic Semitic definite affix no article 3 SVO negative particle interrogative intonation only SNegVO weakly suffixing mixed morphological plural masculine, feminine [@dataset_ar_astd]
8 ar ar_hard reviews Afro-Asiatic Semitic definite affix no article 3 SVO negative particle interrogative intonation only SNegVO weakly suffixing mixed morphological plural masculine, feminine [@dataset_ar_hard]
9 bg bg_twitter_sentiment social_media Indo-European Slavic definite word distinct from demonstrative no article no morphological case-making SVO negative particle question particle SNegVO strongly suffixing plural suffix masculine, feminine, neuter [@dataset_twitter_sentiment]
10 bs bs_twitter_sentiment social_media Indo-European Slavic no article no article 5 SVO negative particle question particle other strongly suffixing plural suffix masculine, feminine, neuter [@dataset_twitter_sentiment]
11 cs cs_facebook social_media Indo-European Slavic no article no article 6-7 SVO negative affix interrogative word order MorphNeg weakly suffixing plural suffix masculine, feminine, neuter [@dataset_cs_social_media]
12 cs cs_mall_product_reviews reviews Indo-European Slavic no article no article 6-7 SVO negative affix interrogative word order MorphNeg weakly suffixing plural suffix masculine, feminine, neuter [@dataset_cs_social_media]
13 cs cs_movie_reviews reviews Indo-European Slavic no article no article 6-7 SVO negative affix interrogative word order MorphNeg weakly suffixing plural suffix masculine, feminine, neuter [@dataset_cs_social_media]
14 cs cs_news_stance social_media Indo-European Slavic no article no article 6-7 SVO negative affix interrogative word order MorphNeg weakly suffixing plural suffix masculine, feminine, neuter [@dataset_cs_social_media]
15 de de_twitter_sentiment social_media Indo-European Germanic definite word distinct from demonstrative indefinite word same as one 4 no dominant order negative particle interrogative word order more than one position strongly suffixing plural suffix masculine, feminine, neuter [@dataset_twitter_sentiment]
16 de de_omp social_media Indo-European Germanic definite word distinct from demonstrative indefinite word same as one 4 no dominant order negative particle interrogative word order more than one position strongly suffixing plural suffix masculine, feminine, neuter [@dataset_de_omp]
17 de de_sb10k social_media Indo-European Germanic definite word distinct from demonstrative indefinite word same as one 4 no dominant order negative particle interrogative word order more than one position strongly suffixing plural suffix masculine, feminine, neuter [@dataset_de_sb10k]
18 de de_ifeel social_media Indo-European Germanic definite word distinct from demonstrative indefinite word same as one 4 no dominant order negative particle interrogative word order more than one position strongly suffixing plural suffix masculine, feminine, neuter [@dataset_dai_labor]
19 de de_dai_labor social_media Indo-European Germanic definite word distinct from demonstrative indefinite word same as one 4 no dominant order negative particle interrogative word order more than one position strongly suffixing plural suffix masculine, feminine, neuter [@dataset_dai_labor]
20 de de_multilan_amazon reviews Indo-European Germanic definite word distinct from demonstrative indefinite word same as one 4 no dominant order negative particle interrogative word order more than one position strongly suffixing plural suffix masculine, feminine, neuter [@dataset_multilan_amazon]
21 en en_vader_twitter social_media Indo-European Germanic definite word distinct from demonstrative indefinite word distinct from one 2 SVO negative particle interrogative word order SNegVO strongly suffixing plural suffix no grammatical gender [@dataset_en_vader]
22 en en_vader_nyt news Indo-European Germanic definite word distinct from demonstrative indefinite word distinct from one 2 SVO negative particle interrogative word order SNegVO strongly suffixing plural suffix no grammatical gender [@dataset_en_vader]
23 en en_vader_movie_reviews reviews Indo-European Germanic definite word distinct from demonstrative indefinite word distinct from one 2 SVO negative particle interrogative word order SNegVO strongly suffixing plural suffix no grammatical gender [@dataset_en_vader]
24 en en_vader_amazon reviews Indo-European Germanic definite word distinct from demonstrative indefinite word distinct from one 2 SVO negative particle interrogative word order SNegVO strongly suffixing plural suffix no grammatical gender [@dataset_en_vader]
25 en en_twitter_sentiment social_media Indo-European Germanic definite word distinct from demonstrative indefinite word distinct from one 2 SVO negative particle interrogative word order SNegVO strongly suffixing plural suffix no grammatical gender [@dataset_twitter_sentiment]
26 en en_tweets_sanders social_media Indo-European Germanic definite word distinct from demonstrative indefinite word distinct from one 2 SVO negative particle interrogative word order SNegVO strongly suffixing plural suffix no grammatical gender [@dataset_en_tweets_sanders]
27 en en_tweet_airlines social_media Indo-European Germanic definite word distinct from demonstrative indefinite word distinct from one 2 SVO negative particle interrogative word order SNegVO strongly suffixing plural suffix no grammatical gender [@dataset_en_tweet_airlines]
28 en en_silicone_sem chats Indo-European Germanic definite word distinct from demonstrative indefinite word distinct from one 2 SVO negative particle interrogative word order SNegVO strongly suffixing plural suffix no grammatical gender [@dataset_en_silicone]
29 en en_sentistrength social_media Indo-European Germanic definite word distinct from demonstrative indefinite word distinct from one 2 SVO negative particle interrogative word order SNegVO strongly suffixing plural suffix no grammatical gender [@dataset_en_sentistrength]
30 en en_semeval_2017 mixed Indo-European Germanic definite word distinct from demonstrative indefinite word distinct from one 2 SVO negative particle interrogative word order SNegVO strongly suffixing plural suffix no grammatical gender [@dataset_semeval_2017]
31 en en_poem_sentiment poems Indo-European Germanic definite word distinct from demonstrative indefinite word distinct from one 2 SVO negative particle interrogative word order SNegVO strongly suffixing plural suffix no grammatical gender [@dataset_en_poem_sentiment]
32 en en_per_sent news Indo-European Germanic definite word distinct from demonstrative indefinite word distinct from one 2 SVO negative particle interrogative word order SNegVO strongly suffixing plural suffix no grammatical gender [@dataset_en_per_sent]
33 en en_multilan_amazon reviews Indo-European Germanic definite word distinct from demonstrative indefinite word distinct from one 2 SVO negative particle interrogative word order SNegVO strongly suffixing plural suffix no grammatical gender [@dataset_multilan_amazon]
34 en en_financial_phrasebank_sentences_75agree news Indo-European Germanic definite word distinct from demonstrative indefinite word distinct from one 2 SVO negative particle interrogative word order SNegVO strongly suffixing plural suffix no grammatical gender [@dataset_en_financial_phrasebank_sentences_75agree]
35 en en_dai_labor social_media Indo-European Germanic definite word distinct from demonstrative indefinite word distinct from one 2 SVO negative particle interrogative word order SNegVO strongly suffixing plural suffix no grammatical gender [@dataset_dai_labor]
36 en en_amazon reviews Indo-European Germanic definite word distinct from demonstrative indefinite word distinct from one 2 SVO negative particle interrogative word order SNegVO strongly suffixing plural suffix no grammatical gender [@dataset_en_amazon]
37 en en_silicone_meld_s chats Indo-European Germanic definite word distinct from demonstrative indefinite word distinct from one 2 SVO negative particle interrogative word order SNegVO strongly suffixing plural suffix no grammatical gender [@dataset_en_silicone]
38 es es_twitter_sentiment social_media Indo-European Romance definite word distinct from demonstrative indefinite word same as one no morphological case-making SVO negative particle interrogative word order SNegVO strongly suffixing plural suffix masculine, feminine [@dataset_twitter_sentiment]
39 es es_semeval2020 social_media Indo-European Romance definite word distinct from demonstrative indefinite word same as one no morphological case-making SVO negative particle interrogative word order SNegVO strongly suffixing plural suffix masculine, feminine [@dataset_semeval_2020]
40 es es_multilan_amazon reviews Indo-European Romance definite word distinct from demonstrative indefinite word same as one no morphological case-making SVO negative particle interrogative word order SNegVO strongly suffixing plural suffix masculine, feminine [@dataset_multilan_amazon]
41 es es_muchocine reviews Indo-European Romance definite word distinct from demonstrative indefinite word same as one no morphological case-making SVO negative particle interrogative word order SNegVO strongly suffixing plural suffix masculine, feminine [@dataset_es_muchocine]
42 es es_paper_reviews reviews Indo-European Romance definite word distinct from demonstrative indefinite word same as one no morphological case-making SVO negative particle interrogative word order SNegVO strongly suffixing plural suffix masculine, feminine [@dataset_es_paper_reviews]
43 fa fa_sentipers reviews Indo-European Iranian no article indefinite word same as one 2 SOV negative affix question particle MorphNeg weakly suffixing plural suffix no grammatical gender [@dataset_fa_sentipers]
44 fr fr_dai_labor social_media Indo-European Romance definite word distinct from demonstrative indefinite word same as one no morphological case-making SVO negative particle question particle OptDoubleNeg strongly suffixing plural suffix masculine, feminine [@dataset_dai_labor]
45 fr fr_ifeel social_media Indo-European Romance definite word distinct from demonstrative indefinite word same as one no morphological case-making SVO negative particle question particle OptDoubleNeg strongly suffixing plural suffix masculine, feminine [@dataset_dai_labor]
46 fr fr_multilan_amazon reviews Indo-European Romance definite word distinct from demonstrative indefinite word same as one no morphological case-making SVO negative particle question particle OptDoubleNeg strongly suffixing plural suffix masculine, feminine [@dataset_multilan_amazon]
47 he he_hebrew_sentiment social_media Afro-Asiatic Semitic definite affix indefinite word same as one no morphological case-making SVO negative particle question particle SNegVO weakly suffixing plural suffix masculine, feminine [@dataset_he_hebrew_sentiment]
48 hi hi_semeval2020 social_media Indo-European Indic no article no article 3 SOV negative particle question particle SONegV strongly suffixing plural suffix masculine, feminine [@dataset_semeval_2020]
49 hr hr_sentiment_news_document news Indo-European Slavic no article no article 5 SVO negative particle question particle other strongly suffixing plural suffix masculine, feminine, neuter [@dataset_hr_sentiment_news_document]
50 hr hr_twitter_sentiment social_media Indo-European Slavic no article no article 5 SVO negative particle question particle other strongly suffixing plural suffix masculine, feminine, neuter [@dataset_twitter_sentiment]
51 hu hu_twitter_sentiment social_media Uralic Ugric definite word distinct from demonstrative indefinite word distinct from one 10 or more no dominant order negative particle question particle SNegVO strongly suffixing plural suffix no grammatical gender [@dataset_twitter_sentiment]
52 it it_evalita2016 social_media Indo-European Romance definite word distinct from demonstrative indefinite word same as one no morphological case-making SVO negative particle interrogative intonation only SNegVO strongly suffixing plural suffix masculine, feminine [@dataset_it_evalita2016]
53 it it_multiemotions social_media Indo-European Romance definite word distinct from demonstrative indefinite word same as one no morphological case-making SVO negative particle interrogative intonation only SNegVO strongly suffixing plural suffix masculine, feminine [@dataset_it_multiemotions]
54 ja ja_multilan_amazon reviews Japanese Japanese no article indefinite word distinct from one 8-9 SOV negative affix question particle MorphNeg strongly suffixing plural suffix no grammatical gender [@dataset_multilan_amazon]
55 lv lv_ltec_sentiment social_media Indo-European Baltic demonstrative word used as definite article indefinite word same as one 5 SVO negative affix question particle MorphNeg weakly suffixing plural suffix masculine, feminine [@dataset_lv_ltec_sentiment]
56 pl pl_twitter_sentiment social_media Indo-European Slavic no article no article 6-7 SVO negative particle question particle SNegVO strongly suffixing plural suffix masculine, feminine, neuter [@dataset_twitter_sentiment]
57 pl pl_polemo reviews Indo-European Slavic no article no article 6-7 SVO negative particle question particle SNegVO strongly suffixing plural suffix masculine, feminine, neuter [@dataset_pl_polemo]
58 pl pl_klej_allegro_reviews reviews Indo-European Slavic no article no article 6-7 SVO negative particle question particle SNegVO strongly suffixing plural suffix masculine, feminine, neuter [@dataset_pl_klej_allegro_reviews]
59 pl pl_opi_lil_2012 social_media Indo-European Slavic no article no article 6-7 SVO negative particle question particle SNegVO strongly suffixing plural suffix masculine, feminine, neuter [@dataset_pl_opi_lil_2012]
60 pt pt_dai_labor social_media Indo-European Romance definite word distinct from demonstrative indefinite word same as one no morphological case-making SVO negative particle question particle SNegVO strongly suffixing plural suffix masculine, feminine [@dataset_dai_labor]
61 pt pt_ifeel social_media Indo-European Romance definite word distinct from demonstrative indefinite word same as one no morphological case-making SVO negative particle question particle SNegVO strongly suffixing plural suffix masculine, feminine [@dataset_dai_labor]
62 pt pt_tweet_sent_br social_media Indo-European Romance definite word distinct from demonstrative indefinite word same as one no morphological case-making SVO negative particle question particle SNegVO strongly suffixing plural suffix masculine, feminine [@dataset_pt_tweet_sent_br]
63 pt pt_twitter_sentiment social_media Indo-European Romance definite word distinct from demonstrative indefinite word same as one no morphological case-making SVO negative particle question particle SNegVO strongly suffixing plural suffix masculine, feminine [@dataset_twitter_sentiment]
64 ru ru_sentiment social_media Indo-European Slavic no article no article 6-7 SVO negative particle question particle SNegVO strongly suffixing plural suffix masculine, feminine, neuter [@dataset_ru_sentiment]
65 ru ru_twitter_sentiment social_media Indo-European Slavic no article no article 6-7 SVO negative particle question particle SNegVO strongly suffixing plural suffix masculine, feminine, neuter [@dataset_twitter_sentiment]
66 sk sk_twitter_sentiment social_media Indo-European Slavic no article no article 6-7 SVO negative affix interrogative word order MorphNeg weakly suffixing plural suffix masculine, feminine, neuter [@dataset_twitter_sentiment]
67 sl sl_sentinews news Indo-European Slavic no article no article 6-7 SVO negative particle question particle SNegVO strongly suffixing plural suffix masculine, feminine, neuter [@Bučar2018]
68 sl sl_twitter_sentiment social_media Indo-European Slavic no article no article 6-7 SVO negative particle question particle SNegVO strongly suffixing plural suffix masculine, feminine, neuter [@dataset_twitter_sentiment]
69 sq sq_twitter_sentiment social_media Indo-European Albanian definite affix indefinite word distinct from one 4 SVO negative particle question particle SNegVO strongly suffixing plural suffix masculine, feminine [@dataset_twitter_sentiment]
70 sr sr_movie_reviews reviews Indo-European Slavic no article no article 5 SVO negative particle question particle other strongly suffixing plural suffix masculine, feminine, neuter [@dataset_sr_serb_movie_reviews]
71 sr sr_senticomments reviews Indo-European Slavic no article no article 5 SVO negative particle question particle other strongly suffixing plural suffix masculine, feminine, neuter [@dataset_sr_senticomments]
72 sr sr_twitter_sentiment social_media Indo-European Slavic no article no article 5 SVO negative particle question particle other strongly suffixing plural suffix masculine, feminine, neuter [@dataset_twitter_sentiment]
73 sv sv_twitter_sentiment social_media Indo-European Germanic definite affix indefinite word same as one 2 SVO negative particle interrogative word order more than one position strongly suffixing plural suffix common, neuter [@dataset_twitter_sentiment]
74 th th_wongnai_reviews reviews Tai-Kadai Kam-Tai no article indefinite word distinct from one no morphological case-making SVO negative auxiliary verb question particle SNegVO little affixation mixed morphological plural noun classifiers [@dataset_th_wongnai_reviews]
75 th th_wisesight_sentiment social_media Tai-Kadai Kam-Tai no article indefinite word distinct from one no morphological case-making SVO negative auxiliary verb question particle SNegVO little affixation mixed morphological plural noun classifiers [@dataset_th_wisesight_sentiment]
76 ur ur_roman_urdu mixed Indo-European Indic no article no article 2 SOV negative affix question particle SONegV strongly suffixing plural suffix masculine, feminine [@dataset_ur_roman_urdu]
77 zh zh_hotel_reviews reviews Sino-Tibetan Chinese no article indefinite word same as one no morphological case-making SVO negative particle question particle SNegVO little affixation no plural noun classifiers [@dataset_zh_hotel_reviews]
78 zh zh_multilan_amazon reviews Sino-Tibetan Chinese no article indefinite word same as one no morphological case-making SVO negative particle question particle SNegVO little affixation no plural noun classifiers [@dataset_multilan_amazon]

Dataset Stats

Datasets per language

pd.DataFrame(mms_dataset_df.groupby("language").original_dataset.nunique().sort_values(ascending=False))
original_dataset
language
en 17
ar 9
de 6
es 5
pl 4
cs 4
pt 4
sr 3
fr 3
th 2
sl 2
ru 2
it 2
hr 2
zh 2
bg 1
ja 1
lv 1
hu 1
hi 1
sk 1
he 1
sq 1
fa 1
sv 1
bs 1
ur 1

Labels per language

pd.DataFrame(mms_dataset_df.groupby(by=["language", "label_name"]).count()["text"])
text
language label_name
ar negative 138899
neutral 192774
positive 600402
bg negative 13930
neutral 28657
positive 19563
bs negative 11974
neutral 11145
positive 13064
cs negative 39674
neutral 59200
positive 97413
de negative 104667
neutral 100071
positive 111149
en negative 304939
neutral 290823
positive 1734724
es negative 108733
neutral 122493
positive 187486
fa negative 1602
neutral 5091
positive 6832
fr negative 84187
neutral 43245
positive 83199
he negative 2279
neutral 243
positive 6097
hi negative 4992
neutral 6392
positive 5615
hr negative 19757
neutral 19470
positive 38367
hu negative 8974
neutral 17621
positive 30087
it negative 4043
neutral 4193
positive 3829
ja negative 83982
neutral 41979
positive 83819
lv negative 1378
neutral 2618
positive 1794
pl negative 77422
neutral 62074
positive 97192
pt negative 56827
neutral 55165
positive 45842
ru negative 31770
neutral 48106
positive 31054
sk negative 14431
neutral 12842
positive 29350
sl negative 33694
neutral 50553
positive 29296
sq negative 6889
neutral 14757
positive 22638
sr negative 25089
neutral 32283
positive 18996
sv negative 16266
neutral 13342
positive 11738
th negative 9326
neutral 28616
positive 34377
ur negative 5239
neutral 8585
positive 5836
zh negative 117967
neutral 69016
positive 144719

Texts in Language Family and Genus

pd.DataFrame(mms_dataset_df.groupby(by=['Family', 'Genus',]).count()["text"])
text
Family Genus
Afro-Asiatic Semitic 940694
Indo-European Albanian 44284
Baltic 5790
Germanic 2687719
Indic 36659
Iranian 13525
Romance 799242
Slavic 966366
Japanese Japanese 209780
Sino-Tibetan Chinese 331702
Tai-Kadai Kam-Tai 72319
Uralic Ugric 56682

Examples per domain

pd.DataFrame(mms_dataset_df.groupby(by=["domain"]).count()["text"])
text
domain
chats 16781
mixed 94122
news 26413
poems 1052
reviews 4510893
social_media 1515501

Hosting, Licensing, and Maintenance Plan

  • Hosting: The datasets and benchmark will be hosted on a reliable and scalable cloud infrastructure to ensure accessibility and availability (HuggingFace Hub). The choice of hosting platform will be based on factors such as reliability, performance, and cost-effectiveness.
  • Licensing: We will clearly state the data license under which the datasets are released, ensuring that the terms of use are explicitly defined. We will consider licenses that facilitate research and allow for derivative works, while also addressing potential ethical considerations. See the license in repository.
  • Maintenance: We (see Dataset Curators section) are committed to providing ongoing maintenance and support for the datasets and benchmark. This includes regular updates, bug fixes, and addressing any user feedback or inquiries. We will also establish a communication channel for users to report issues or request assistance.

References

Northcutt, Curtis, Lu Jiang, and Isaac Chuang. 2021. “Confident Learning: Estimating Uncertainty in Dataset Labels.” Journal of Artificial Intelligence Research 70: 1373–1411.