YouZum

Co-AttenDWG: Co-Attentive Dimension-Wise Gating and Expert Fusion for Multi-Modal Offensive Content Detection

arXiv:2505.19010v2 Announce Type: replace-cross
Abstract: Multi-modal learning has emerged as a crucial research direction, as integrating textual and visual information can substantially enhance performance in tasks such as classification, retrieval, and scene understanding. Despite advances with large pre-trained models, existing approaches often suffer from insufficient cross-modal interactions and rigid fusion strategies, failing to fully harness the complementary strengths of different modalities. To address these limitations, we propose Co-AttenDWG, co-attention with dimension-wise gating, and expert fusion. Our approach first projects textual and visual features into a shared embedding space, where a dedicated co-attention mechanism enables simultaneous, fine-grained interactions between modalities. This is further strengthened by a dimension-wise gating network, which adaptively modulates feature contributions at the channel level to emphasize salient information. In parallel, dual-path encoders independently refine modality-specific representations, while an additional cross-attention layer aligns the modalities further. The resulting features are aggregated via an expert fusion module that integrates learned gating and self-attention, yielding a robust unified representation. Experimental results on the MIMIC and SemEval Memotion 1.0 datasets show that Co-AttenDWG achieves state-of-the-art performance and superior cross-modal alignment, highlighting its effectiveness for diverse multi-modal applications.

We use cookies to improve your experience and performance on our website. You can learn more at Datenschutzrichtlinie and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
de_DE