Evaluating the potential of language-family-specific generative models for low resource data augmentation: a Faroese case study

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

We investigate GPT-Sw3, a generative language model for the Nordic languages, to assess its understanding of low-resource Faroese. Our aim is to demonstrate the advantages of using language-family-specific generative models to augment data for related languages with fewer resources. We evaluate GPT-Sw3 by prompting it for Faroese to English translation in a zero, one and few-shot setting. We assess such translations with an ensemble score consisting of an arithmetic average between the BLEU and a semantic similarity score (SBERT). Moreover, we challenge the model’s Faroese language understanding capabilities on a small dataset of curated Faroese trick sentences. There, we compare the model’s performance with Open AI’s GPT 3.5 and GPT 4, demonstrating the advantages of using a language family specific generative model for navigating non trivial scenarios. We evaluate the pipeline thus created and use it, as a proof of concept, to create an automatically annotated Faroese semantic textual similarity (STS) dataset.
Original languageEnglish
Title of host publicationTBD
Publication statusAccepted/In press - 2024
EventLREC-COLING 2024 - Torino, Italy
Duration: 20 May 202425 May 2024
https://lrec-coling-2024.org/

Conference

ConferenceLREC-COLING 2024
Country/TerritoryItaly
CityTorino
Period20/05/2425/05/24
Internet address

Keywords

  • Semantic Textual Similarity
  • low-resource language
  • Machine translation
  • Data augmentation

Fingerprint

Dive into the research topics of 'Evaluating the potential of language-family-specific generative models for low resource data augmentation: a Faroese case study'. Together they form a unique fingerprint.

Cite this