(1)Robust Scene Text Recognition With Automatic Rectification
RARE
白翔老师团队,华中科技大学
Baoguang Shi, Xinggang Wang, Pengyuan Lyu, Cong Yao, Xiang Bai; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4168-4176
@InProceedings{Shi_2016_CVPR,
author = {Shi, Baoguang and Wang, Xinggang and Lyu, Pengyuan and Yao, Cong and Bai, Xiang},
title = {Robust Scene Text Recognition With Automatic Rectification},
booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2016}
}
文章链接:
https://www.cv-foundation.org/openaccess/content_cvpr_2016/html/Shi_Robust_Scene_Text_CVPR_2016_paper.html
https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Shi_Robust_Scene_Text_CVPR_2016_paper.pdf
数据集
- IIIT 5K-Words [25] (IIIT5K) contains 3000 cropped word images for testing. The images are collected from the Internet. For each image, there is a 50-word lexicon and a 1000-word lexicon. All lexicons con- sist of a ground truth word and some randomly picked words.
- Street View Text [35] (SVT) is collected from Google Street View. Its test dataset consists of 647 word im- ages. Many images in SVT are severely corrupted by noise and blur, or have very low resolutions. Each sam- ple is associated with a 50-word lexicon.
- ICDAR 2003 [24] (IC03) contains 860 cropped word images, each associated with a 50-word lexicon de- fined by Wang et al. [35]. Following [35], we dis- card images that contain non-alphanumeric characters or have less than three characters. Besides, there is a “full lexicon” which contains all lexicon words, and the Hunspell [1] lexicon which has 50k words.
- ICDAR 2013 [20] (IC13) inherits most of its samples from IC03. After filtering samples as done in IC03, the dataset contains 857 samples.
代码:https://github.com/guojm14/TPS-SRN-tensorflow
(2)AON: Towards Arbitrarily-Oriented Text Recognition
复旦大学
Zhanzhan Cheng, Yangliu Xu, Fan Bai, Yi Niu, Shiliang Pu, Shuigeng Zhou; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 5571-5579
@InProceedings{Cheng_2018_CVPR,
author = {Cheng, Zhanzhan and Xu, Yangliu and Bai, Fan and Niu, Yi and Pu, Shiliang and Zhou, Shuigeng},
title = {AON: Towards Arbitrarily-Oriented Text Recognition},
booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2018}
}
链接:
https://openaccess.thecvf.com/content_cvpr_2018/html/Cheng_AON_Towards_Arbitrarily-Oriented_CVPR_2018_paper.html
数据集
- SVT-Perspective [28] contains 639 cropped images for testing. Images are picked from side-view angle snapshots in Google Street View, therefore one may observe severe perspective distortions. Each image is associated with a 50- word lexicon and a full lexicon.
- CUTE80 (CT80 in short) [29] is collected for evaluating curved text recognition. It contains 288 cropped natural images for testing. No lexicon is associated.
- ICDAR 2015 (IC15 in short) [21] contains 2077 cropped images where more than 200 irregular (arbitrarily-oriented, perspective or curved). No lexicon is associated.
- IIIT5K-Words (IIIT5K in short) [26] is collected from the Internet, containing 3000 cropped word images in its test set. Each image specifies a 50-word lexicon and a 1k- word lexicon, both of which contain the ground truth words as well as other randomly picked words.
- Street View Text (SVT in short) [35] is collected from the Google Street View, consists of 647 word images in its test set. Many images are severely corrupted by noise and blur, or have very low resolutions. Each image is associated with a 50-word lexicon.
- ICDAR 2003 (IC03 in short) [24] contains 251 scene images, labeled with text bounding boxes. Each image is associated with a 50-word lexicon defined by Wang et al. [35]. For fair comparison, we discard images that contain non-alphanumeric characters or have less than three characters, following [35]. The resulting dataset contains 867 cropped images. The lexicons include the 50-word lexicons and the full lexicon that combines all lexicon words.
代码:https://github.com/huizhang0110/AON
(3)ASTER: An Attentional Scene Text Recognizer with Flexible Rectification
B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao and X. Bai, "ASTER: An Attentional Scene Text Recognizer with Flexible Rectification," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 9, pp. 2035-2048, 1 Sept. 2019, doi: 10.1109/TPAMI.2018.2848939.
数据集
- Synth90k is the synthetic text dataset proposed in [24]. The dataset contains 9 million images generated from a set of 90k common English words. Words are rendered onto natural images with random transformations and effects. Every image in Synth90k is annotated with a groundtruth word. All of the images in this dataset are taken for training.
- SynthText is the synthetic text dataset proposed in [19]. The generation process is similar to that of [24]. But unlike [24], SynthText is targeted for text detection. Therefore, words are rendered onto full images. We crop the words using the groundtruth word bounding boxes.
- IIIT5k-Words (IIIT5k) [44] contains 3000 test images collected from the web. Each image is associated with a short, 50-word lexicon and a long, 1000-word lexicon. A lexicon consists of the groundtruth word and other random words.
- Street View Text (SVT) [60] is collected from the Google Street View. The test set contains 647 images of cropped words. Many images in SVT are severely corrupted by noise, blur, and low resolution. Each image is associated with a 50- word lexicon.
- ICDAR 2003 (IC03) [42] contains 860 images of cropped word after filtering. Following [60], we discard words that contain non-alphanumeric characters or have less than three characters. Each image has a 50-word lexicon defined in [60].
- ICDAR 2013 (IC13) [32] inherits most images from IC03 and extends it with new images. The dataset is filtered by removing words that contain non-alphanumeric characters. The dataset contains 1015 images. No lexicon is provided.
- ICDAR 2015 Incidental Text (IC15) is the Challenge 4 of the ICDAR 2015 Robust Reading Competition [31]. This challenge features incidental text images, which are taken by a pair of Google Glasses without careful positioning and focusing. Consequently, the dataset contains a lot of irregular text. Testing images are obtained by cropping the words using the groundtruth word bounding boxes.
- SVT-Perspective (SVTP) is proposed in [49] for evaluating the performance of recognizing perspective text. Images in SVTP are picked from the side-view images in Google Street View. Many of them are heavily distorted by the non-frontal view angle. The dataset consists of 639 cropped images for testing, each with a 50-word lexicon inherited from the SVT dataset.
- CUTE80 (CUTE) is proposed in [51]. The dataset focuses on curved text. It contains 80 high-resolution images taken in natural scenes. CUTE80 is originally proposed for detection tasks. We crop the annotated words and get a test set of 288 images. No lexicon is provided.
代码:
https://github.com/bgshih/aster
https://github.com/ayumiymk/aster.pytorch