| This study introduces SHEA-SLM, a retrieval-augmented architecture designed to improve the performance of small language models that run locally under limited computing resources. The system uses a fixed Wikipedia corpus, dense retrieval, reranking, and grounded response generation to support factual question answering. The study evaluates three instruction-tuned Qwen2.5 models on a manually curated benchmark of 200 questions covering a wide variety of question scenarios. The results show that SHEA-SLM improves answer quality for the smaller models. However, these gains are accompanied by higher latency, showing a clear trade-off between answer quality and computational efficiency. The benefits are less stable for the largest model, suggesting that larger models may rely less on external retrieval. Overall, the findings indicate that retrieve-and-rerank methods can make small local language models more accurate and more useful for offline, privacy-sensitive, and resource-constrained applications. Notably, in some settings SHEA-SLM enables a smaller model to outperform a larger one, highlighting the value of retrieval augmentation under resource constraints. |
*** Title, author list and abstract as submitted during Camera-Ready version delivery. Small changes that may have occurred during processing by Springer may not appear in this window.