Pipeline for Enriching Digital Arabic

September 2023

A proposed arcitecture linking together advances in Arabic NLP

This work came very directly out of my previous effort to produce an enriched digital edition of the Autobiography Omar ibn Said. During that project, I’d come to the realization that a number of key tool advances in the previous couple of years meant we were very close to being able to automate the production of all of the annotations I had painstakingly produced. But at the time, no one had strung them all together into a single pipeline. The motivation was the massive online repositories of digital Arabic generated by the OpenITI and KITAB projects. Those collections could absolutely serve as the backbone of a huge expansion of the Perseus Project into Arabic sources, but they were too large for the kind of annotations we wanted to support to be manually produced.

I presented a poster on this project at the Tufts Graduate Student Symposium. You can check that out here.

Topics

  • Digital Humanities
  • Digital Libraries
  • Arabic NLP

Technologies Used

  • Kraken
  • ArSummarizer
  • CAMeL Tools
  • CamelParser
  • Stanza

Challenges & Learnings

My big realization from this project was that there still wasn’t a good model for the translation alignment task, which in turn inspired some of my thesis work.