SQL and Graph Query Generation: Natural Language Processing in Data Exploration
- 5 minutes read - 1045 wordsTable of Contents
Organizations face the challenge of converting natural language into SQL and Cypher queries accurately and efficiently, which is crucial for enabling compelling data exploration.
Solution Approach
By leveraging an LLM, organizations can fine-tune the model with labeled SQL and Cypher queries to improve performance. This enhances the system’s ability to generate precise queries from natural language inputs.
Example Implementation
Starting with a 60% or 70% accuracy rate, fine-tuning the LLM with labeled data significantly boosts performance. This involves training the model with labeled SQL or Cypher queries, though unconventional naming conventions might necessitate adjustments to the data model. For comparison, human users have an accuracy rate of 92%.
For example, assume a website, and you want to analyze the website performance using a text interface. The data are stored in a SQL database. The schema is like:
- base_table_aggregated_wo_search_terms (page views per page per day)
- gsc_data (impressions, ctr, clicks per page per day)
- fact_table_w_session_aggregated (visits, session count per page per day)
Transformer-based LLMs trained on vast amounts of data predict the next -most- likely token. If the most common example the LLM was trained on was using table names like:
- web_statistics
- search_statistics
- session_data
Given these most common table names, the LLM will prefer them, even if the correct scheme is provided in the session context. This explains why the accuracy you can expect from an LLM is just 60% and why fine-tuning can improve the accuracy. It can only enhanced to an extent. The knowledge base of the LLM and the training data are restricted to the most common knowledge and domain knowledge.
Text to Query (SQL/Cypher)
Text to SQL will turn a text prompt into a SQL statement. The prompt could be used with a chat application to return data. The prompt might contain the database schema, either static or fetched from the target database. The scheme should be used to create a working SQL query.
As mentioned, the LLM prefers specific table names based on its training data. If you use uncommon table names, the LLM might create nonworking SQL statements that cannot be executed.
The LLM might struggle to create complex queries, like correctly joining tables, or it might not create SQL that correctly implements business logic. The LLM might also be unable to create SQL queries that correctly filter data, like restricting access by users or organizations. Graph RAG has the same limitations as SQL RAG.
Function Calling
Function-calling involves predefined rules to convert natural language into SQL or Cypher queries, effectively handling 80-90% of user requests with 20-25 functions. By leveraging these predefined functions, the system can efficiently interpret and execute complex database queries. Consequently, function-calling enhances user experience by providing accurate and prompt responses to various data retrieval needs.
Functions make restricting access to data or business logic straightforward; they can be provided as function parameters or obtained from the application security context. Implementing complex business logic or integrating existing applications, not just databases, is also relatively simple.
Comparing Approaches
LLM Fine-Tuning:
Pros
- Adaptability: Can handle a wide range of query complexities and variations.
- Performance in Complex Queries: Excels in generating SQL and Cypher queries for intricate natural language inputs.
Cons
- Periodic Retraining: Retraining is required to maintain accuracy, which can be resource-intensive.
- Sensitivity to Naming Conventions: May struggle with unconventional naming conventions, requiring adjustments or additional training data.
SQL/Graph RAG
Pros
- Flexible, can handle a broad range of queries
- Can create any query based on a given scheme, reducing maintenance
Cons
- The accuracy might be too low without fine-tuning
- Performance suffers from repeat attempts
- Security is not guaranteed
Function Calling:
Pros
- Security: User and organization data can be obtained from the application security context
- Integration of existing applications
- Stability: Uses predefined rules, ensuring consistent performance for routine queries.
- No Retraining Required: This does not require periodic retraining
Cons
- Flexibility: Less flexible and may struggle with queries outside the scope of predefined functions.
- Handling Varied Queries: This may not perform as effectively for a diverse range of queries as fine-tuned LLMs and RAG.
Applying a fine-tuning approach to Graph RAG
Organizations can apply this hybrid approach to graph databases, using Cypher for query generation. Here’s how:
Fine-Tuning LLMs for Cypher:
- Adaptability: Fine-tuned LLMs can handle various complex graph queries.
- Performance in Complex Queries: LLMs can generate Cypher queries from natural language inputs, accommodating the intricacies of graph data relationships.
Function Calling for Cypher:
- Stability: Predefined Cypher query templates can cover frequent, straightforward graph query tasks.
- No Retraining Required: Function-calling for routine graph queries ensures stability and consistency.
Hybrid Approach:
- Routine Tasks: Use function-calling for common graph queries, ensuring reliable performance without retraining.
- Complex Queries: Employ fine-tuned LLMs to generate Cypher queries from natural language and effectively handle complex and varied queries.
By integrating both methods, organizations can optimize their natural language to SQL and Cypher query generation processes, ensuring robust performance and adaptability for relational and graph database interactions. This synthesis leverages the strengths of both approaches, enhancing overall efficiency and accuracy in data exploration and retrieval.
Synthesis of Approaches
Combining fine-tuning and RAG can leverage their strengths. Using function-calling for routine tasks and fine-tuning LLMs for complex queries creates a robust system. This hybrid approach involves function-calling for frequent questions and updating the LLM to handle more complicated scenarios, enhancing overall performance and adaptability.
Given that function calls allow the integration of existing applications, endpoints, and complex business logic, you could combine them with RAG and LLM fine-tuning. The approaches do not exclude each other.
Conclusion
Organizations can optimize their natural language to SQL and Cypher query generation processes by integrating fine-tuning and function-calling methods. This hybrid approach leverages the adaptability and performance of fine-tuned LLMs for complex queries while utilizing the stability and security of function-calling for routine tasks. The combined strengths of these methods ensure robust performance and adaptability for both relational and graph database interactions.
Function-calling has a big advantage: it can integrate applications. Function calling also allows agents to operate safely with complex tasks.
Sources:
- https://aws.amazon.com/de/blogs/machine-learning/imperva-optimizes-sql-generation-from-natural-language-using-amazon-bedrock/
- https://medium.com/singapore-gds/from-conventional-rag-to-graph-rag-a0202a1aaca7
- https://thenewstack.io/using-sql-powered-rag-to-better-analyze-database-data-with-genai/
- https://medium.com/@manojkotary/exploring-function-calling-capabilities-with-groq-a-step-by-step-guide-586ab7a165aa
- https://mychen76.medium.com/state-of-function-calling-in-llm-bc3aa37decb4
- https://www.linkedin.com/pulse/llms-text-to-sql-problems-benchmark-vs-real-world-ilya-fastovets-fwfgc/
- https://bird-bench.github.io
- https://www.superannotate.com/blog/llm-fine-tuning