| This paper investigates the performance of multimodal large language models (MLLMs) and the effect of low resource fine-tuning for a domain-specific understanding of an internal norm-database in an industrial context. In such settings, general-purpose MLLMs often struggle to deliver responses that are sufficiently precise, concise, and reliable for interpreting technical documents. To examine the trade-off between response quality and computational efficiency, multiple sizes of state-of-the-art open-source base and fine-tuned MLLMs were evaluated. The models were tested in benchmark tasks that involved the interpretation of technical norm content in a Danish enterprise context of tooling equipment manufacturing. The evaluation considered both qualitative response quality and computational characteristics, including inference time, token generation, and GPU memory allocation. The results show a clear trade-off between model size and performance. The 4B fine-tuned model achieved the strongest overall response quality with the most favorable balance between accuracy and computational cost. In contrast, the 2B models produced overly minimal answers and the 0.8B models showed limited robustness, including output degeneration on more difficult tasks. Although the 27B and 9B models performed well, their performance did not justify their computational cost compared to the 4B fine-tuned model. The experiments were performed on an Nvidia DGX Spark 128 GB AI server. However, fine-tuned variants preserved the correct behavior in most cases and also slightly increased memory usage. The results suggest that model capacity remains a key factor in document-based norm interpretation tasks and that parameter-efficient fine-tuning alone is insufficient to overcome the limitations of very small models, although it appears to improve quality in larger models. In general, the study highlights the importance of jointly evaluating answer quality, stability, and computational efficiency when selecting MLLMs for norm-oriented industrial chatbot systems. Further, this paper contributes with a framework for dataset preprocessing and fine-tuning for domain-specific knowledge performance. |
*** Title, author list and abstract as submitted during Camera-Ready version delivery. Small changes that may have occurred during processing by Springer may not appear in this window.