Display-Semantic Transformer for Scene Text Recognition.

Xinqi Yang, Wushour Silamu, Miaomiao Xu, Yanbing Li
Author Information
  1. Xinqi Yang: College of Computer Science and Technology, Xinjiang University, No. 777 Huarui Street, Urumqi 830017, China.
  2. Wushour Silamu: College of Computer Science and Technology, Xinjiang University, No. 777 Huarui Street, Urumqi 830017, China.
  3. Miaomiao Xu: College of Computer Science and Technology, Xinjiang University, No. 777 Huarui Street, Urumqi 830017, China.
  4. Yanbing Li: College of Computer Science and Technology, Xinjiang University, No. 777 Huarui Street, Urumqi 830017, China.

Abstract

Linguistic knowledge helps a lot in scene text recognition by providing semantic information to refine the character sequence. The visual model only focuses on the visual texture of characters without actively learning linguistic information, which leads to poor model recognition rates in some noisy (distorted and blurry, etc.) images. In order to address the aforementioned issues, this study builds upon the most recent findings of the Vision Transformer, and our approach (called Display-Semantic Transformer, or DST for short) constructs a masked language model and a semantic visual interaction module. The model can mine deep semantic information from images to assist scene text recognition and improve the robustness of the model. The semantic visual interaction module can better realize the interaction between semantic information and visual features. In this way, the visual features can be enhanced by the semantic information so that the model can achieve a better recognition effect. The experimental results show that our model improves the average recognition accuracy on six benchmark test sets by nearly 2% compared to the baseline. Our model retains the benefits of having a small number of parameters and allows for fast inference speed. Additionally, it attains a more optimal balance between accuracy and speed.

Keywords

References

  1. IEEE Trans Pattern Anal Mach Intell. 2017 Nov;39(11):2298-2304 [PMID: 28055850]
  2. IEEE Trans Pattern Anal Mach Intell. 2019 Sep;41(9):2035-2048 [PMID: 29994467]

Word Cloud

Created with Highcharts 10.0.0modelvisualrecognitionsemanticinformationcanscenetextTransformerinteractionknowledgelinguisticimagesDisplay-SemanticmodulebetterfeaturesaccuracyspeedLinguistichelpslotprovidingrefinecharactersequencefocusestexturecharacterswithoutactivelylearningleadspoorratesnoisydistortedblurryetcorderaddressaforementionedissuesstudybuildsuponrecentfindingsVisionapproachcalledDSTshortconstructsmaskedlanguageminedeepassistimproverobustnessrealizewayenhancedachieveeffectexperimentalresultsshowimprovesaveragesixbenchmarktestsetsnearly2%comparedbaselineretainsbenefitssmallnumberparametersallowsfastinferenceAdditionallyattainsoptimalbalanceSceneTextRecognitioncross-modalattentiontransformer

Similar Articles

Cited By