UD Tree Alignment

May 2024

An automated appraoch to translation aligment using syntax trees

This project started as a class final and evolved into a section of my master’s thesis. During my work at the Perseus Project, I’d come up against the limits of automated translation alignment models for low-resource languages. Most state-of-the-art alignment models demanded large parallel training copora between the languages being aligned, something that simply doesn’t exist for many low-resource language pairs. At the same time, coverage of some of those languages was expanding for Universal Dependencies (UD) parsing models.

The central thesis of UD is that there is universal parallelism between all human languages, which can be modelled using a finite set of tags and dependency relationships. So I figured, if we took that assumption at face value, we would expect the dependency trees for the same sentence in two languages to have some amount of structural parallelism, given a fairly literal translation. They wouldn’t be identical, of course, but if the UD assumption were true, there ought to be something that could be exploited to map tokens from one language onto the other. This approach was both a means of testing the UD hypothesis, and also, if it worked, a potentially new method to support the automated alignment of low-resource language pairs, since UD parsers can be trained on monolingual data sets.

My algorithm was ultimately fairly simple. I created two trees from parallel sentences then recursively mapped the nodes in each tree based on topographical similarity. I then tested the algorithm on a gold-standard aligned English translation of the first book of the Iliad. You can read all about my findings in the second chapter of my thesis and examine the code here

Topics

Syntax Trees
Translation Alignment
Low Resource Languages

Technologies Used

Stanza
NLTK
NetworkX

Challenges & Learnings

While the version of the algorithm I tested wasn’t accurate enough to be really useful, it did match the performance of the state-of-the-art language-agnostic model, despite being much less complex. This finding suggests that a more sophisticated iteration, perhaps involving a more nuanced measure of node similarity could potentially improve upon its results. Additionally, that fact that it worked at all suggests that some amount of structural parallelism was exploitable in the UD trees, though admittedly further testing should be done on language pairs more distantly related then Ancient Greek and English.

← Back to Portfolio