Large Language Models (LLMs) are celebrated for their extended context capabilities, but questions remain regarding their effective use of this capacity. While ‘needle-in-the-haystack’ tests have become standard benchmarks for in-context retrieval performance, they often fall short in reflecting the challenges of real-world applications. This study evaluates the performance of six open-weight LLMs and one proprietary model on a complex reasoning task. The task is designed to locate relevant passages within variable-length product descriptions and assess their compliance with a specified cease-and-desist declaration. We tested context lengths ranging from 64 × 2n | n = 0, 1, 2, . . . , 9 across needle document depths of 0%, 25%, 50%, 75%, and 100%. Our findings show that model performance tends to decline with longer contexts, with variation across models. Often, performance was poorer when the key information appeared early or in the middle of the input. Some models maintained more consistent performance, while others revealed significant degradation. These results suggest the need for improved LLM architectures to handle extended contexts and complex reasoning tasks effectively. |
*** Title, author list and abstract as submitted during Camera-Ready version delivery. Small changes that may have occurred during processing by Springer may not appear in this window.