TY - GEN
T1 - Linguistic Knowledge Within Handwritten Text Recognition Models
T2 - 17th International Conference on Document Analysis and Recognition, ICDAR 2023
AU - Londner, Samuel
AU - Phillips, Yoav
AU - Miller, Hadar
AU - Dershowitz, Nachum
AU - Kuflik, Tsvi
AU - Lavee, Moshe
N1 - Publisher Copyright:
© 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.
PY - 2023/1/1
Y1 - 2023/1/1
N2 - State-of-the-art handwritten text recognition models make frequent use of deep neural networks, with recurrent and connectionist temporal classification layers, which perform recognition over sequences of characters. This architecture may lead to the model learning statistical linguistic features of the training corpus, over and above graphic features. This in turn could lead to degraded performance if the evaluation dataset language differs from the training corpus language. We present a fundamental study aiming to understand the inner workings of OCR models and further our understanding of the use of RNNs as decoders. We examine a real-world example of two graphically similar medieval documents but in different languages: rabbinical Hebrew and Judeo-Arabic. We analyze, computationally and linguistically, the cross-language performance of the models over these documents, so as to gain some insight into the implicit language knowledge the models may have acquired. We find that the implicit language model impacts the final word error by around 10%. A combined qualitative and quantitative analysis allow us to isolate manifest linguistic hallucinations. However, we show that leveraging a pretrained (Hebrew, in our case) model allows one to boost the OCR accuracy for a resource-scarce language (such as Judeo-Arabic). All our data, code, and models are openly available at https://github.com/anutkk/ilmja.
AB - State-of-the-art handwritten text recognition models make frequent use of deep neural networks, with recurrent and connectionist temporal classification layers, which perform recognition over sequences of characters. This architecture may lead to the model learning statistical linguistic features of the training corpus, over and above graphic features. This in turn could lead to degraded performance if the evaluation dataset language differs from the training corpus language. We present a fundamental study aiming to understand the inner workings of OCR models and further our understanding of the use of RNNs as decoders. We examine a real-world example of two graphically similar medieval documents but in different languages: rabbinical Hebrew and Judeo-Arabic. We analyze, computationally and linguistically, the cross-language performance of the models over these documents, so as to gain some insight into the implicit language knowledge the models may have acquired. We find that the implicit language model impacts the final word error by around 10%. A combined qualitative and quantitative analysis allow us to isolate manifest linguistic hallucinations. However, we show that leveraging a pretrained (Hebrew, in our case) model allows one to boost the OCR accuracy for a resource-scarce language (such as Judeo-Arabic). All our data, code, and models are openly available at https://github.com/anutkk/ilmja.
KW - Handwritten text recognition
KW - Hebrew manuscripts
KW - Language model
KW - Optical character recognition
KW - Transfer learning
UR - http://www.scopus.com/inward/record.url?scp=85173582840&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-41685-9_10
DO - 10.1007/978-3-031-41685-9_10
M3 - Conference contribution
AN - SCOPUS:85173582840
SN - 9783031416842
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 147
EP - 164
BT - Document Analysis and Recognition – ICDAR 2023 - 17th International Conference, Proceedings
A2 - Fink, Gernot A.
A2 - Jain, Rajiv
A2 - Kise, Koichi
A2 - Zanibbi, Richard
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 21 August 2023 through 26 August 2023
ER -