Submitted: 09 January 2026 You are already at the latest version Multimodal misinformation demands robust Cross-modal Entity Consistency (CEC) verification, aligning textual entities with visual depictions. Current large vision-language models (LVLMs) struggle with fine-grained entity verification, especially in complex "contextual mismatch" scenarios, failing to capture intricate relationships or leverage auxiliary information.