Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation.

Maoyuan Ye, Jing Zhang, Juhua Liu, Chenyu Liu, Baocai Yin, Cong Liu, Bo Du, Dacheng Tao
Author Information

Abstract

The Segment Anything Model (SAM), a profound vision foundation model pretrained on a large-scale dataset, breaks the boundaries of general segmentation and sparks various downstream applications. This paper introduces Hi-SAM, a unified model leveraging SAM for hierarchical text segmentation. Hi-SAM excels in segmentation across four hierarchies, including pixel-level text, word, text-line, and paragraph, while realizing layout analysis as well. Specifically, we first turn SAM into a high-quality pixel-level text segmentation (TS) model through a parameter-efficient fine-tuning approach. We use this TS model to iteratively generate the pixel-level text labels in a semi-automatical manner, unifying labels across the four text hierarchies in the HierText dataset. Subsequently, with these complete labels, we launch the end-to-end trainable Hi-SAM based on the TS architecture with a customized hierarchical mask decoder. During inference, Hi-SAM offers both automatic mask generation (AMG) mode and promptable segmentation (PS) mode. In the AMG mode, Hi-SAM segments pixel-level text foreground masks initially, then samples foreground points for hierarchical text mask generation and achieves layout analysis in passing. As for the PS mode, Hi-SAM provides word, text-line, and paragraph masks with a single point click. Experimental results show the state-of-the-art performance of our TS model: 84.86% fgIOU on Total-Text and 88.96% fgIOU on TextSeg for pixel-level text segmentation. Moreover, compared to the previous specialist for joint hierarchical detection and layout analysis on HierText, Hi-SAM achieves significant improvements: 4.73% PQ and 5.39% F1 on the text-line level, 5.49% PQ and 7.39% F1 on the paragraph level layout analysis, requiring fewer training epochs.

Word Cloud

Created with Highcharts 10.0.0textHi-SAMsegmentationpixel-levelmodelhierarchicallayoutanalysisTSmodeSAMtext-lineparagraphlabelsmaskSegmentAnythingModeldatasetacrossfourhierarchieswordHierTextgenerationAMGPSforegroundmasksachievesfgIOUPQ539%F1levelprofoundvisionfoundationpretrainedlarge-scalebreaksboundariesgeneralsparksvariousdownstreamapplicationspaperintroducesunifiedleveragingexcelsincludingrealizingwellSpecificallyfirstturnhigh-qualityparameter-efficientfine-tuningapproachuseiterativelygeneratesemi-automaticalmannerunifyingSubsequentlycompletelaunchend-to-endtrainablebasedarchitecturecustomizeddecoderinferenceoffersautomaticpromptablesegmentsinitiallysamplespointspassingprovidessinglepointclickExperimentalresultsshowstate-of-the-artperformancemodel:8486%Total-Text8896%TextSegMoreovercomparedpreviousspecialistjointdetectionsignificantimprovements:473%49%7requiringfewertrainingepochsHi-SAM:MarryingHierarchicalTextSegmentation

Similar Articles

Cited By