| Motivated by the growing demand for computational storytelling systems in domains such as digital entertainment and accessibility for visually impaired readers, ComicX converts static sequential art into dynamic, immersive media. Beginning with heterogeneous PDF-based comic inputs, the framework operationalizes panel detection and segmentation to isolate visual primitives, followed by panel sequencing to preserve discourse continuity. Through a dual-stage detection process, character instances are localized, after which character re-identification assigns consistent identity embeddings across panels, enabling narrative coherence in voice assignment and temporal tracking. Speech bubble regions and contextual explanation regions are simultaneously localized, enabling robust OCR-based text extraction for dialogue retrieval. A character–speech mapping module uses contextual cues to map utterances to speaker identities, while the text-to-speech module supports prosody transfer. In addition, onomatopoeic tokens are identified to provide realistic sound-effects. With the combination of computer vision, natural language processing, and generative speech synthesis constituting the architecture, ComicX enables multimodal alignment and audiovisual reconstruction of sequential art. |
*** Title, author list and abstract as submitted during Camera-Ready version delivery. Small changes that may have occurred during processing by Springer may not appear in this window.