TY - GEN
T1 - The HHD Dataset
AU - Rabaev, Irina
AU - Kurar Barakat, Berat
AU - Churkin, Alexander
AU - El-Sana, Jihad
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/9/1
Y1 - 2020/9/1
N2 - Benchmark datasets are important in document image processing field, as they allow to analyze different approaches and compare their performances in a fair manner. There exist benchmark datasets for several alphabets such as Latin, Arabic and Chinese, but not the Hebrew alphabet. In this paper, a handwritten Hebrew dataset, HHD, is introduced. The HHD dataset is collected from hand-filled forms, and accompanied by their ground truth at character, word and text line levels. Presently, the dataset contains around 1000 document images, and we continue to further enlarge it. To the best of our knowledge, this is the first comprehensive corpus of Hebrew handwritten documents, and we believe it will help leveraging Hebrew documents processing and document processing in general. The dataset can be useful for various research applications, such as word spotting, word recognition, text line alignment, and writer identification. The initial small subset of the HDD for character classification can be downloaded from https://www.cs.bgu.ac.illr-vberatldatalhhd-dataset.zip together with the training and test sets subdivisions. We also provide baseline results for character classification on this initial subset. In the near future, the full HHD dataset will be made freely available to the research community.
AB - Benchmark datasets are important in document image processing field, as they allow to analyze different approaches and compare their performances in a fair manner. There exist benchmark datasets for several alphabets such as Latin, Arabic and Chinese, but not the Hebrew alphabet. In this paper, a handwritten Hebrew dataset, HHD, is introduced. The HHD dataset is collected from hand-filled forms, and accompanied by their ground truth at character, word and text line levels. Presently, the dataset contains around 1000 document images, and we continue to further enlarge it. To the best of our knowledge, this is the first comprehensive corpus of Hebrew handwritten documents, and we believe it will help leveraging Hebrew documents processing and document processing in general. The dataset can be useful for various research applications, such as word spotting, word recognition, text line alignment, and writer identification. The initial small subset of the HDD for character classification can be downloaded from https://www.cs.bgu.ac.illr-vberatldatalhhd-dataset.zip together with the training and test sets subdivisions. We also provide baseline results for character classification on this initial subset. In the near future, the full HHD dataset will be made freely available to the research community.
KW - Ground truth
KW - Handwritten document image dataset
KW - Hebrew handwritten documents
UR - http://www.scopus.com/inward/record.url?scp=85097790693&partnerID=8YFLogxK
U2 - 10.1109/ICFHR2020.2020.00050
DO - 10.1109/ICFHR2020.2020.00050
M3 - Conference contribution
AN - SCOPUS:85097790693
T3 - Proceedings of International Conference on Frontiers in Handwriting Recognition, ICFHR
SP - 228
EP - 233
BT - Proceedings - 2020 17th International Conference on Frontiers in Handwriting Recognition, ICFHR 2020
PB - Institute of Electrical and Electronics Engineers
T2 - 17th International Conference on Frontiers in Handwriting Recognition, ICFHR 2020
Y2 - 7 September 2020 through 10 September 2020
ER -