Solving Classification Tasks With LLMs for Comparative Research

Mate, Akos, Miklós SebÅ‘k, Lukasz Wordliczek, Dariusz Stolicki, Ádám Feldmann. Machine Translation As an Underrated Ingredient? Solving Classification Tasks With Large Language Models for Comparative Research. „Computational Communication Research”, Vol. 5 (2), 2023, https://doi.org/10.5117/CCR2023.2.6.MATE

While large language models have revolutionised computational textanalysis methods, the field is still tilted towards English language resources.Even as there are pre-trained models for some "smaller" languages, thecoverage is far from universal, and pre-training large language models isan expensive and complicated task. This uneven language coverage limitscomparative social research in terms of its geographical and linguisticscope. We propose a solution that sidesteps these issues by leveragingtransfer learning and open-source machine translation. We use English as abridge language between Hungarian and Polish bills and laws to solve aclassification task related to the Comparative Agendas Project (CAP) codingscheme. Using the Hungarian corpus as training data for model fine-tuning,we categorise the Polish laws into 20 CAP categories. In doing so, wecompare the performance of Transformer-based deep learning models(monolinguals, such as BERT, and multilinguals such as XLM-RoBERTa) andmachine learning algorithms (e.g., SVM). Results show that the fine-tunedlarge language models outperform the traditional supervised learningbenchmarks but are themselves surpassed by the machine translationapproach. Overall, the proposed solution demonstrates a viable option forapplying a transfer learning framework for low-resource languages andachieving state-of-the-art results without requiring expensive pre-training.

Keywords: Machine learning, Deep learning, Natural language processing,Classification, Policy topics, Comparative Agendas Project

Machine Translation as an Underrated Ingredient_2.pdf