A unified view of density-based methods for semi-supervised clustering and classification.

Jadson Castro Gertrudes, Arthur Zimek, Jörg Sander, Ricardo J G B Campello
Author Information
  1. Jadson Castro Gertrudes: SCC/ICMC/USP, University of São Paulo, Avenue Trabalhador São-carlense, 400 - Center, São Carlos, SP 13566-590 Brazil. ORCID
  2. Arthur Zimek: IMADA, University of Southern Denmark, Campusvej 55, 5230 Odense M, Denmark.
  3. Jörg Sander: Department of Computing Science, University of Alberta 1-001 CCIS, Edmonton, AB T6G-2E9 Canada.
  4. Ricardo J G B Campello: School of Mathematical and Physical Sciences, University of Newcastle, University Drive, Callaghan, NSW 2308 Australia.

Abstract

Semi-supervised learning is drawing increasing attention in the era of big data, as the gap between the abundance of cheap, automatically collected unlabeled data and the scarcity of labeled data that are laborious and expensive to obtain is dramatically increasing. In this paper, we first introduce a unified view of density-based clustering algorithms. We then build upon this view and bridge the areas of semi-supervised clustering and classification under a common umbrella of density-based techniques. We show that there are close relations between density-based clustering algorithms and the graph-based approach for transductive classification. These relations are then used as a basis for a new framework for semi-supervised classification based on building-blocks from density-based clustering. This framework is not only efficient and effective, but it is also statistically sound. In addition, we generalize the core algorithm in our framework, HDBSCAN*, so that it can also perform semi-supervised clustering by directly taking advantage of any fraction of labeled data that may be available. Experimental results on a large collection of datasets show the advantages of the proposed approach both for semi-supervised classification as well as for semi-supervised clustering.

Keywords

References

  1. Genome Biol. 2003;4(5):R34 [PMID: 12734014]
  2. Nucleic Acids Res. 2017 Jan 4;45(D1):D945-D954 [PMID: 27899562]
  3. Bioinformatics. 2001 Oct;17(10):977-87 [PMID: 11673243]
  4. BMC Bioinformatics. 2008 Nov 27;9:497 [PMID: 19038021]
  5. J Chem Inf Model. 2011 Dec 27;51(12):3036-49 [PMID: 22098113]
  6. J Med Chem. 2005 Apr 7;48(7):2687-94 [PMID: 15801859]
  7. J Med Chem. 2004 Oct 21;47(22):5541-54 [PMID: 15481990]

Word Cloud

Created with Highcharts 10.0.0clusteringsemi-supervisedclassificationdensity-baseddataSemi-supervisedviewframeworkincreasinglabeledunifiedalgorithmsshowrelationsapproachalsolearningdrawingattentionerabiggapabundancecheapautomaticallycollectedunlabeledscarcitylaboriousexpensiveobtaindramaticallypaperfirstintroducebuilduponbridgeareascommonumbrellatechniquesclosegraph-basedtransductiveusedbasisnewbasedbuilding-blocksefficienteffectivestatisticallysoundadditiongeneralizecorealgorithmHDBSCAN*canperformdirectlytakingadvantagefractionmayavailableExperimentalresultslargecollectiondatasetsadvantagesproposedwellmethodsDensity-based

Similar Articles

Cited By (5)