TY - GEN
T1 - Automatic generation of composite image descriptions
AU - Liu, Chang
AU - Shmilovici, Armin
AU - Last, Mark
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2018/6/21
Y1 - 2018/6/21
N2 - Automatic generation of natural language descriptions for images has recently become an important research topic. In this paper, we propose a frame-based algorithm for generating a composite natural language description for a given image. The goal of this algorithm is to describe not only the objects appearing in the image but also the main activities happening in the image and the objects participating in those activities. The algorithm builds upon a pre-trained CRF (Conditional Random Field)-based structured prediction model, which generates a set of alternative frames for a given image. We use imSitu, a situation recognition dataset with 126,102 images, 504 activities, 11,538 objects, and 1,788 roles, as a test bed of our algorithm. We ask human evaluators to evaluate the quality of the descriptions for 20 images from the imSitu dataset. The results demonstrate that our composite description contains on average 16% more visual elements than the baseline method and gains a significantly higher accuracy score by the human evaluators.
AB - Automatic generation of natural language descriptions for images has recently become an important research topic. In this paper, we propose a frame-based algorithm for generating a composite natural language description for a given image. The goal of this algorithm is to describe not only the objects appearing in the image but also the main activities happening in the image and the objects participating in those activities. The algorithm builds upon a pre-trained CRF (Conditional Random Field)-based structured prediction model, which generates a set of alternative frames for a given image. We use imSitu, a situation recognition dataset with 126,102 images, 504 activities, 11,538 objects, and 1,788 roles, as a test bed of our algorithm. We ask human evaluators to evaluate the quality of the descriptions for 20 images from the imSitu dataset. The results demonstrate that our composite description contains on average 16% more visual elements than the baseline method and gains a significantly higher accuracy score by the human evaluators.
KW - composite image descriptions
KW - frames
KW - natural language processing
UR - http://www.scopus.com/inward/record.url?scp=85050204485&partnerID=8YFLogxK
U2 - 10.1109/FSKD.2017.8393188
DO - 10.1109/FSKD.2017.8393188
M3 - Conference contribution
AN - SCOPUS:85050204485
T3 - ICNC-FSKD 2017 - 13th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery
SP - 2612
EP - 2618
BT - ICNC-FSKD 2017 - 13th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery
A2 - Zhao, Liang
A2 - Wang, Lipo
A2 - Cai, Guoyong
A2 - Li, Kenli
A2 - Liu, Yong
A2 - Xiao, Guoqing
PB - Institute of Electrical and Electronics Engineers
T2 - 13th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery, ICNC-FSKD 2017
Y2 - 29 July 2017 through 31 July 2017
ER -