TY - GEN
T1 - The pinkas dataset
AU - Barakat, Berat Kurar
AU - El-Sana, Jihad
AU - Rabaev, Irina
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/9/1
Y1 - 2019/9/1
N2 - In historical document image processing, datasets account for a significant part of any research, and are crucial for the diversity and abundance of experimental results, which contribute to the development of new algorithms to meet the new challenge. Moreover, they are very important for benchmarking processing algorithms. Numerous publicly available document image datasets of different languages have been emerged. However, current segmentation and recognition performances are nearly saturated with respect to the present publicly available datasets. As such, collecting and labelling historical document images is a burden on historical document image processing researchers. This paper introduces a public historical document image dataset, Pinkas dataset, with new challenges to open room for improvement and identify strengths and weaknesses of available processing algorithms. It is the first dataset in medieval handwritten Hebrew and fully labeled at word, line and page level by an expert of historical Hebrew manuscripts. Pinkas dataset contributes to the diversity of benchmarking standards. In this paper we present meta features of Pinkas dataset and apply recent word spotting algorithms to analyze the room for improvement in terms of performance.
AB - In historical document image processing, datasets account for a significant part of any research, and are crucial for the diversity and abundance of experimental results, which contribute to the development of new algorithms to meet the new challenge. Moreover, they are very important for benchmarking processing algorithms. Numerous publicly available document image datasets of different languages have been emerged. However, current segmentation and recognition performances are nearly saturated with respect to the present publicly available datasets. As such, collecting and labelling historical document images is a burden on historical document image processing researchers. This paper introduces a public historical document image dataset, Pinkas dataset, with new challenges to open room for improvement and identify strengths and weaknesses of available processing algorithms. It is the first dataset in medieval handwritten Hebrew and fully labeled at word, line and page level by an expert of historical Hebrew manuscripts. Pinkas dataset contributes to the diversity of benchmarking standards. In this paper we present meta features of Pinkas dataset and apply recent word spotting algorithms to analyze the room for improvement in terms of performance.
KW - Handwritten dataset
KW - Handwritten hebrew dataset
KW - Historical document image analysis
UR - http://www.scopus.com/inward/record.url?scp=85079857621&partnerID=8YFLogxK
U2 - 10.1109/ICDAR.2019.00122
DO - 10.1109/ICDAR.2019.00122
M3 - Conference contribution
AN - SCOPUS:85079857621
T3 - Proceedings of the International Conference on Document Analysis and Recognition, ICDAR
SP - 732
EP - 737
BT - Proceedings - 15th IAPR International Conference on Document Analysis and Recognition, ICDAR 2019
PB - Institute of Electrical and Electronics Engineers
T2 - 15th IAPR International Conference on Document Analysis and Recognition, ICDAR 2019
Y2 - 20 September 2019 through 25 September 2019
ER -