Run description: Our model performs a large number of preprocessing operations on the data and then inputs it into the fast-bert model and use the user-defined importance score function to output the importance score.
Run description: This run first applies BERT_base for multi-task learning in information type categorisation and priority estimation. Then, a fine-tuned ELECTRA is used to re-predict those that are not assigned any labels by BERT in the first stage. The fine-tuned ELECTRA is trained on the training tweets that belong to important information types (a ranking model that uses the query description and raw tweet text as inputs)
Run description: Similar to Run2, the difference is that the training data is augmented using distilGPT2 multi-task fine-tuning for generation (with control codes).
Run description: This run applies bert_base fine-tuned on label and text pairs for estimating which information types a test tweet matches with well. Priority is simply estimated by the numeric conversion of the predicted information types.
Run description: ELMo word embeddings used to represent tweet text. Other features include binary representation of the presence of location in the tweet text, numeric features of hashtag, URL, media metadata and one-hot encoding of crisis category as per the topics file (ie. earthquake, flood etc.) Model uses the binary relevance approach for multi-label classification and a Balanced Random Forest Classifier which randomly undersamples on each iteration to address class imbalance.
Run description: ELMo word embeddings used to represent tweet text. Other features include binary representation of the presence of location in the tweet text, numeric features of hashtag, URL, media metadata and one-hot encoding of crisis category as per the topics file (ie. earthquake, flood etc.) Model uses the binary relevance approach for multi-label classification and an 'Easy Ensemble Classifier' which uses a combination of AdaBoost learners to boost minority labels and random undersampling on the majority ones.
Run description: ELMo word embeddings used to represent tweet text with additional TF-IDF of 500 most common terms. Other features include binary representation of the presence of location in the tweet text, numeric features of hashtag, URL, media metadata and one-hot encoding of crisis category as per the topics file (ie. earthquake, flood etc.). Model uses the binary relevance approach for multi-label classification and a Balanced Random Forest Classifier which randomly undersamples the majority classes on each iteration.
Run description: ELMo word embeddings used to represent tweet text with additional TF-IDF of 500 most common terms. Other features include binary representation of the presence of location in the tweet text, numeric features of hashtag, URL, media metadata and one-hot encoding of crisis category as per the topics file (ie. earthquake, flood etc.). Model uses the binary relevance approach for multi-label classification and an 'Easy Ensemble Classifier' which uses a combination of AdaBoost learners to boost minority labels and random undersampling on the majority ones.
Run description: This system uses transfer learning to construct embeddings of social media content, learns a mapping from these embeddings to the TREC-IS label space, and generates labels of information type and priority from these models. We use pre-trained RoBERTa models provided by the simpletransformers interface to HuggingFace's Transformers library for these models.
Run description: This system uses transfer learning to construct embeddings of social media content, learns a mapping from these embeddings to the TREC-IS label space, and generates labels of information type and priority from these models. We use pre-trained RoBERTa models provided by the simpletransformers interface to HuggingFace's Transformers library for these models.
Run description: This system uses transfer learning to construct embeddings of social media content, learns a mapping from these embeddings to the TREC-IS label space, and generates labels of information type and priority from these models. We use pre-trained RoBERTa models provided by the simpletransformers interface to HuggingFace's Transformers library for these models.
Run description: This system uses transfer learning to construct embeddings of social media content, learns a mapping from these embeddings to the TREC-IS label space, and generates labels of information type and priority from these models. We use pre-trained RoBERTa models provided by the simpletransformers interface to HuggingFace's Transformers library for these models. We also employ a simple text augmentation heuristic to expand our training data by replacing words with synonyms defined in NLTK.
Run description: This system uses transfer learning to construct embeddings of social media content, learns a mapping from these embeddings to the TREC-IS label space, and generates labels of information type and priority from these models. We use pre-trained RoBERTa models provided by the simpletransformers interface to HuggingFace's Transformers library for these models. We also employ a simple text augmentation heuristic to expand our training data by replacing words with synonyms defined in NLTK.
Run description: This system uses transfer learning to construct embeddings of social media content, learns a mapping from these embeddings to the TREC-IS label space, and generates labels of information type and priority from these models. We use pre-trained RoBERTa models provided by the simpletransformers interface to HuggingFace's Transformers library for these models. We also employ a simple text augmentation heuristic to expand our training data by replacing words with synonyms defined in NLTK.
Run description: This run uses RoBERTa to generate embeddings for tweets and then uses SVM and SVR on the resulting embeddings to generate information types and priorities. To deal with class imbalance, we augment the training data with weak-supervision-based training examples wherein we replace nouns and verbs with synonyms using the WordNet external resource.
Run description: This run uses RoBERTa to generate embeddings for tweets and then uses SVM and SVR on the resulting embeddings to generate information types and priorities. To deal with class imbalance, we augment the training data with weak-supervision-based training examples wherein we replace nouns and verbs with synonyms using the WordNet external resource.
Run description: This run uses RoBERTa to generate embeddings for tweets and then uses SVM and SVR on the resulting embeddings to generate information types and priorities. To deal with class imbalance, we augment the training data with weak-supervision-based training examples wherein we replace nouns and verbs with synonyms using the WordNet external resource.
Run description: This run uses RoBERTa to generate embeddings for tweets and then uses SVM and SVR on the resulting embeddings to generate information types and priorities. We also augment the training data using non-humanitarian labels from CrisisMMD labeling and with our weak-supervision-based text augmentation. From CrisisMMD, we create a classifier that tags tweets with CrisisMMD labels, and we augment the text with those labels.
Run description: This run uses RoBERTa to generate embeddings for tweets and then uses SVM and SVR on the resulting embeddings to generate information types and priorities. We also augment the training data using non-humanitarian labels from CrisisMMD labeling and with our weak-supervision-based text augmentation. From CrisisMMD, we create a classifier that tags tweets with CrisisMMD labels, and we augment the text with those labels.
Run description: This run uses RoBERTa to generate embeddings for tweets and then uses SVM and SVR on the resulting embeddings to generate information types and priorities. We also augment the training data using non-humanitarian labels from CrisisMMD labeling and with our weak-supervision-based text augmentation. From CrisisMMD, we create a classifier that tags tweets with CrisisMMD labels, and we augment the text with those labels.
Run description: This run uses RoBERTa to generate embeddings for tweets and then uses SVM and SVR on the resulting embeddings to generate information types and priorities. We also augment the training data using non-humanitarian labels from CrisisMMD labeling and with our weak-supervision-based text augmentation. From CrisisMMD, we create a classifier that tags tweets with CrisisMMD labels, and we augment the text with those labels. We also integrate CrisisMMD's image labels to generate priority scores for tweets with images, where we then take the max priority label between the text and image.
Run description: This run uses RoBERTa to generate embeddings for tweets and then uses SVM and SVR on the resulting embeddings to generate information types and priorities. We also augment the training data using non-humanitarian labels from CrisisMMD labeling and with our weak-supervision-based text augmentation. From CrisisMMD, we create a classifier that tags tweets with CrisisMMD labels, and we augment the text with those labels. We also integrate CrisisMMD's image labels to generate priority scores for tweets with images, where we then take the max priority label between the text and image.
Run description: This run uses RoBERTa to generate embeddings for tweets and then uses SVM and SVR on the resulting embeddings to generate information types and priorities. We also augment the training data using non-humanitarian labels from CrisisMMD labeling and with our weak-supervision-based text augmentation. From CrisisMMD, we create a classifier that tags tweets with CrisisMMD labels, and we augment the text with those labels. We also integrate CrisisMMD's image labels to generate priority scores for tweets with images, where we then take the max priority label between the text and image.
Run description: Random Forest estimator with tfidf vectorizer with some word cleaning and auto-correction. Current approach is clearly not effective as most failed to match but submitted anyway.
Run description: Random Forest estimator with tfidf vectorizer with some word cleaning and auto-correction. Current approach is clearly not effective as most failed to match but submitted anyway.
Run description: Random Forest estimator with tfidf vectorizer with some word cleaning and auto-correction. Current approach is clearly not effective as most failed to match but submitted anyway.
Run description: Random Forest estimator with tfidf vectorizer with some word cleaning and auto-correction. Current approach is clearly not effective as most failed to match but submitted anyway.
Run description: This is a baseline with techniques developed in 2019-A edition, i.e., multi-task learning transformers for crisis tweet categorisation.
Run description: This run applies BM25 for matching the similarities between a test tweet and each information type as required in this task 3. The information types used as the queries are constructed manually. Those not assigned any types are further estimated by a ELECTRA-base ranking model fine-tuned on the training set. In this run, priority is simply converted from the predicted information types based on the analysis of information types with respect to priority levels of training tweets.
Run description: This run fine-tunes a BERT-base model for multi-task learning, only on the training tweets with information type as required in this task 3. The information type categorisation is defined as a multi-label classification task and priority estimation is defined as a regression task.
Run description: Similar to R2, the difference is that the training dataset is augmented by the GPT-2 generative model (distilled version) that is adapted to this domain by fine-tuning with control codes.
Run description: XGBoost classifiers and regressors using the same feature set. Feature set comprises basic statistics of the tweet's text, named entities features, part of speech tagging features, and sentiment analysis features.
Run description: XGBoost classifiers and regressors using the same feature set. Feature set comprises basic statistics of the tweet's text, named entities features, part of speech tagging features, and sentiment analysis features.