In this study, we investigated cultural differences in multisensory perception of emotion between Chinese and Japanese Participants, focusing on mutual interference of visual and auditory emotional information. In this experiment, the face-voice pairs were consisted of congruent or incongruent emotions (e.g., a happy (an angry) face with a happy (an angry) voice in congruent pairs, and a happy (an angry) face with an angry (a happy) voice in incongruent pairs). Participants were asked to judge the emotion of targets focusing on either face or voice while ignoring the other modality’s information. In the voice-focus condition, the effect of to-be-ignored facial information was smaller in Japanese than Chinese Participants, only when the participant and the target belonged to the same cultures (in-group). This indicated that Japanese people were more likely to be based on the voice information in multisensory perception of emotion of in-group. Our study illuminated that although both Japanese and Chinese people belonged to the Eastern culture, there were cultural differences in perceiving emotion from visual and auditory cues.