Umaer Hanif, Anis Aloulou, Flynn Crosbie, Paul Bouchequet, Mounir Chennaoui, Thomas Andrillon, Damien Leger
Polysomnography (PSG) is essential for diagnosing sleep disorders, but its manual interpretation is labor-intensive. Automated sleep staging algorithms are promising, yet their utility in complex sleep disorders such as insomnia remains uncertain. This study evaluates five of the most recognised sleep staging classifiers-U-Sleep, STAGES, GSSC, Luna and YASA-on PSG data from 904 patients with chronic insomnia. Performance was assessed using F1 scores, confusion matrices and predicted sleep metrics. The effect of demographics, sleepiness and PSG metrics on each classifier's performance was assessed using linear regression. Across all sleep stages, GSSC performed best (macro F1 score���=���0.66), followed by U-Sleep (0.62), Luna (0.56), STAGES (0.54) and YASA (0.52). GSSC achieved the highest F1 scores in Wake (0.83), N1 (0.22), N2 (0.80), N3 (0.71) and REM (0.76), while U-Sleep matched its performance in N1 and REM and Luna in N3. STAGES performed poorest in N3 (0.39) and YASA in REM (0.35). Common misclassifications included N1 vs. Wake/N2 and N3 vs. N2, with REM misclassified as Wake/N1/N2 by STAGES, Luna and YASA. GSSC and U-Sleep exhibited minimal demographic bias, while STAGES and Luna had more. No performance difference was observed between chronic insomnia patients with and without abnormal PSG. Sleep metric accuracy was highest for U-Sleep (TST, R���=���0.88), STAGES (SOL, R���=���0.82) and GSSC (WASO, R���=���0.82). These findings underscore the solid yet variable performance of the classifiers and highlight GSSC and U-Sleep as leading tools for sleep staging in patients with chronic insomnia.