Ottoman Turkish and AI Applications: The example of osmanlica.com
The size of Ottoman Turkish archives (archival documents, manuscripts, printed books, periodicals,) makes it significant to read this language, access it, make the documents scannable with Optical Character Recognition (OCR), or transcript it to Latin scripts. Besides, the issue of accessibility and intelligibility of Ottoman Turkish texts is of interest to researchers in many other disciplines, as well as Turkish literature and history, with different themes and methods.
With the increase in interdisciplinary studies and the acceleration of Digital Humanities, the methods and tools for reading and processing Ottoman Turkish texts have also risen. For example, the Optical Character Recognition (OCR) method, which is the most common in our lives, makes computers can process Ottoman texts. Aside from advanced processing and analysis techniques, making the enormous Ottoman archives scannable will significantly facilitate and accelerate research (based) on the Ottoman Turkish corpus in terms of time and amount of documents to be examined. Commercial formations such as Muteferriqa or Wikilala provide an important service in this field by offering the opportunity to scan, especially in Ottoman printed texts, through OCR technology. These applications, which we can call digital libraries, allow word-based searches within the printed Ottoman corpus, and they are constantly expanding their databases.
Osmanlica.com has recently joined among these commercial applications that promise to make Ottoman texts readable/scannable/processable by using OCR and AI. The application has been developed by Assoc. Prof. Atakan Kurt from, the Engineering Faculty of Istanbul University and his Ph.D. student İshak Dölek. The application promises not only OCR but also transcribing of Ottoman Turkish documents into Latin script and translation into Modern Turkish. This application is the output of the project "AI Supported End-to-End Transfer from Ottoman Turkish to Contemporary Turkish" supported by TUBITAK. However, it should be noted that the application has not yet reached its final goal and still continues.
In addition to other applications that create their own digital libraries and enable scanning in that library, osmanlica.com has developed an infrastructure that will enable users to upload their own documents and OCR and Latinize them.
The first and essential step is to convert image files containing Ottoman text into text files with OCR. When the image file is uploaded to the system via the website, researchers can have editable text output. There is also a virtual keyboard to facilitate editing the text with Arabic script. The next step is to Latinize this Arabic script, that is, the transcription process. Thanks to the option to display side by side with the original text, researchers can easily edit texts translated into Latin script. As a unique aspect of the application, it gives the opportunity to see and select alternative translations. Although it is not yet available to users, the Language Translation option translates the text into contemporary Turkish. When the application successfully completes all these stages, the printed Ottoman Turkish texts will be available to researchers who are not experts in Ottoman Turkish.
While the digitization services for printed Ottoman texts are increasing rapidly, it is also necessary to discuss how valuable these initiatives are for researchers. The success and accuracy rates of these applications and databases, which were prepared with a serious effort, should be shared with researchers transparently. For example, the accuracy rate of the result list of scanned words is quite vital for the quality of an academic study. For this reason, it would be appropriate to prepare these initiatives for using scholars with a transparent policy as well as their infrastructure and technical aspects.