Abstract
In this research, we apply deep-learning techniques to Hebrew paleography to automatically classify and process medieval Hebrew manuscripts. Our
work is based on contemporary Hebrew paleography (Malachi Beit-Arié, Colette
Sirat, Norman Golb, Ada Yardeni, Benjamin Richler) that recognizes fifteen subtypes of medieval Hebrew script. Automatic recognition of these scripts allows to determine the approximate origin and date of writing for not-dated, fragmentary, and damaged manuscripts. To train the deep neural network, we compile a Visual Media Lab – Hebrew Paleography (VML-HP) dataset that contains 537 high- resolution manuscript page images. The images were hand-picked from the SfarData (http:/sfardata.nli.org.il/) dataset; in some rare cases, we also included pages from other manuscripts’ collections. For testing the model, we define a notion of typical and blind test sets. The typical test set consists of the unseen pages of the manuscripts used in training. The blind test set, on the contrary, consists of pages from unseen manuscripts, thus, providing us with a real-life scenario. To train the model, we used patches extracted from the documents’ pages. To filter irrelevant patches (empty patches or patches that contain decorations), we developed a clean patch generation algorithm that can generate patches containing pure text regions (for the VML-HP dataset, we generated 150K train patches). In all the experiments, we trained the network on the training set and tested it on both test sets, typical and blind. The objective training function was cross-entropy loss and was minimized using the Adam optimizer algorithm. The training was performed until there was no improvement in validation loss with five epochs’ patience. The model with the least validation loss was used for testing.
work is based on contemporary Hebrew paleography (Malachi Beit-Arié, Colette
Sirat, Norman Golb, Ada Yardeni, Benjamin Richler) that recognizes fifteen subtypes of medieval Hebrew script. Automatic recognition of these scripts allows to determine the approximate origin and date of writing for not-dated, fragmentary, and damaged manuscripts. To train the deep neural network, we compile a Visual Media Lab – Hebrew Paleography (VML-HP) dataset that contains 537 high- resolution manuscript page images. The images were hand-picked from the SfarData (http:/sfardata.nli.org.il/) dataset; in some rare cases, we also included pages from other manuscripts’ collections. For testing the model, we define a notion of typical and blind test sets. The typical test set consists of the unseen pages of the manuscripts used in training. The blind test set, on the contrary, consists of pages from unseen manuscripts, thus, providing us with a real-life scenario. To train the model, we used patches extracted from the documents’ pages. To filter irrelevant patches (empty patches or patches that contain decorations), we developed a clean patch generation algorithm that can generate patches containing pure text regions (for the VML-HP dataset, we generated 150K train patches). In all the experiments, we trained the network on the training set and tested it on both test sets, typical and blind. The objective training function was cross-entropy loss and was minimized using the Adam optimizer algorithm. The training was performed until there was no improvement in validation loss with five epochs’ patience. The model with the least validation loss was used for testing.
Original language | English |
---|---|
Title of host publication | Jewish Studies in the Digital Age |
Publisher | De Gruyter Oldenbourg |
Pages | 349-362 |
Number of pages | 14 |
ISBN (Electronic) | 978-3-11-074482-8, 978-3-11-074488-0 |
ISBN (Print) | 9783110744699 |
DOIs | |
State | Published - 2022 |