Pipeline for Enriching Digital Arabic
September 2023
A proposed arcitecture linking together advances in Arabic NLP
This work came very directly out of my previous effort to produce an enriched digital edition of the Autobiography Omar ibn Said. During that project, I’d come to the realization that a number of key tool advances in the previous couple of years meant we were very close to being able to automate the production of all of the annotations I had painstakingly produced. But at the time, no one had strung them all together into a single pipeline. The motivation was the massive online repositories of digital Arabic generated by the OpenITI and KITAB projects. Those collections could absolutely serve as the backbone of a huge expansion of the Perseus Project into Arabic sources, but they were too large for the kind of annotations we wanted to support to be manually produced.
I presented a poster on this project at the Tufts Graduate Student Symposium. You can check that out here.
Topics
- Digital Humanities
- Digital Libraries
- Arabic NLP
Technologies Used
- Kraken
- ArSummarizer
- CAMeL Tools
- CamelParser
- Stanza
Challenges & Learnings
My big realization from this project was that there still wasn’t a good model for the translation alignment task, which in turn inspired some of my thesis work.