In recent years, the convergence of natural language processing (NLP) and large language models (LLMs) has propelled the development of solutions enabling users to interact seamlessly with structured databases using natural language queries (NLQs). Existing NLQ-to-SQL models primarily approach this as a translation problem, converting NLQs into SQL queries for database interaction. However, challenges arise when dealing with extensive databases containing numerous tables, necessitating a robust approach for table selection to improve the efficiency of downstream NLQ-to-SQL models. This paper introduces a classification-based method for table selection, ad-dressing limitations in existing embedding-based approaches. By predicting the necessity of tables in query formulation, the proposed approach offers a more meaningful interpretation of model scores, facilitating the determination of a universal threshold for table selection. To validate this innovative approach, a custom dataset was curated, leveraging the Spider dataset for NLQ-to-SQL tasks, and a comprehensive set of experiments was conducted using various language models, including GPT-4, GPT-3.5, and DeBERTa. Results demonstrate the effectiveness of the fine-tuned DeBERTa model in consistently outperforming other models across key metrics, showcasing its ad-vancements in table selection tasks. This research not only addresses the chal-lenge of context length in NLQ-to-SQL models but also highlights the potential of smaller LLMs when fine-tuned for specific tasks. The proposed classification-based approach offers a practical solution for improving the accuracy and efficiency of NLQ-to-SQL models, paving the way for enhanced interactions between users and structured databases. |
*** Title, author list and abstract as seen in the Camera-Ready version of the paper that was provided to Conference Committee. Small changes that may have occurred during processing by Springer may not appear in this window.