The PaCorES collection comprises three parallel bilingual bidirectional corpora: Spanish/German, Spanish/English, and Spanish/Chinese (1). The core corpora of the collection consist of literary texts from the late 20th and early 21st centuries, which were manually selected and sentence-aligned with their corresponding translations.
The fundamental step in creating parallel corpora is the alignment. Sentence alignment is the issue of finding correspondence between source sentences and their equivalent translations in the target text.
Recent advances in automatic alignment tools, including neural network-based methods have achieved accuracy levels between 90% and 95% for closely related languages like German or English. However, these methods are primarily optimized for non-literary texts, and their accuracy declines significantly with literary texts, necessitating manual revision
The challenges of sentence alignment are especially pronounced in the Spanish/Chinese language pair due to significant structural and linguistic differences.
This paper describes methods aimed at minimizing the need for subsequent manual revision. To address frequent misalignments caused by improper segmentation, we developed a Python script (2) tailored to the specific linguistic characteristics of each language. We evaluated three well-known tools for sentence alignment: LF-Aligner (Hunalign), Vecalign, and Bertalign (3). Aligning bilingual literary poses unique challenges, since most of the translation is interpretative and not based on 1-to-1 mappings between source and target sentences. Existing alignment methods have difficulty coping with 1-to-many and many to-many alignments that are common in literary texts.
We evaluated the performance of each aligner using standard metrics: precision, recall, and F1 score (the harmonic mean of precision and recall). These metrics were calculated for each of the three language pairs.References
Spanish/German: www.corpuspages.eu
Spanish/English: www.corpuspaens.eu
Spanish/Chinese: www.corpuspaches.eu
https://github.com/michaeljlang/PaCorEs-Splitter
LF-Aligner: https://sourceforge.net/projects/aligner/
Vecalign: https://github.com/thompsonb/vecalign/
Bertalign : https://github.com/bfsujason/bertalign/