Natural Language Processing in Ottoman Turkish: Opportunities and Drawbacks
Ahmet Yusuf Yüksek
New York University
Natural Language Processing (also known as NLP) has increasing popularity thanks to its growing implications in our daily lives, from speech recognition (e.g., Siri) to machine translation (e.g., Google Translate). In humanities disciplines, scholars, especially linguists, have been applying various methods to do large-scale text analysis for different purposes such as topic modeling, text mining, sentiment analysis, etc. As researchers develop and improve techniques and methods in NLP, our ability to analyze texts at scale has been dramatically transformed since the late 90s. Despite the increasing role of NLP tools in humanities research, computational textual analysis is still primarily limited to western languages, especially English and German. This post will briefly discuss the possibilities and drawbacks of NLP in Ottoman Turkish and offer a few suggestions for further projects and studies using NLP in Ottoman Turkish texts.
Digital Humanities is still a fledgling field in Ottoman Studies, consisting of a small community of scholars despite the
recent growing attention. Tools like Lexiqamus (an online software to decipher illegible words in Ottoman Turkish) help facilitate research in Ottoman Turkish faster, easier, and more accessible.
Considering the massive corpora of documents in Ottoman archives and manuscript libraries, one would expect more DH-based research in these documents based on computational text analysis methods. And indeed, wide-ranging investment in NLP tools and techniques for Ottoman Turkish will definitely save labor, time, and money in the long run and drastically change the nature of historical and linguistic research in Ottoman studies by enabling large-scale textual analysis.
However, two significant issues can cause major drawbacks for NLP in Ottoman Turkish.
The first difficulty is related to the language itself. (Ottoman) Turkish is an agglutinative language, meaning suffixes play a vital role in phrasal construction and word inflection. Therefore, additional procedures such as stemming and lemmatization (the process of reducing inflected words to their word stem) are required to conduct a large-scale computational textual analysis. Despite the growing attention to NLP in Modern Turkish and the increasing number of available open-source tools and libraries, the current NLP technology is not compatible with Ottoman Turkish documents. Additionally, the grammatical and syntactic differences between Turkish and Western languages make it even more challenging to implement the widely used NLP methods such as TF-IDF (Term Frequency-Inverse Document Frequency) and topic modeling in Ottoman Turkish texts. Therefore, there should be a significant technology investment to develop tools properly working in Ottoman Turkish.
There is another issue that, in my opinion, causes a more significant drawback. It is about how scholars generally deal with texts in Ottoman Turkish. It is a common practice among Ottomanists to transliterate and transcribe Ottoman Turkish documents from Arabic script to the Latin alphabet. However, there is no commonly-used standard way of transliterating words and characters, which are transliterated according to how they sound in Modern Turkish rather than how they are written in the original form, which causes irregularities in transliteration even for same words.
For instance, the word "ايوب" can be transliterated as Eyüp, Eyüb, Eyyüb, Eyub etc. A human reader can recognize the similarities among these variations. But unlike humans, computers identify each of these versions differently. Similarly, the letters "s, h, k, t, n" in Turkish have more than one corresponding character in Ottoman Turkish, leading to inconsistency in any computational textual analysis. Therefore, I firmly believe that as long as the scholarship primarily relies on transliterated documents, any NLP-based project in Ottoman Turkish will be bound to have incorrect and inconsistent results.
Regarding these two major issues, I offer the following short suggestion for future works:
There are around 100 million documents only in the Directorate of Ottoman Archives in Istanbul. Investing in computational textual methods will contribute to Ottoman studies in the long run and revolutionize academic research.
The first task should be building a scholarly network among historians/archivists and computer scientists to develop technology compatible with documents and language.
In the short run, an open-source stemmer for Ottoman Turkish is a must to perform any meaningful textual analysis methods.
The scholarship should rely more on Ottoman Turkish documents in their original written format. Even though transliteration makes Ottoman Turkish documents more legible and available to Turkish speakers, it causes flaws in computer-based research. Therefore, OCR (optical character recognition) or HTR (handwritten text recognition) tools for Ottoman Turkish are a must to prepare texts to be processed by a computer.