Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLM isn't still following pydantic format for SQL #1906

Closed
Emekaborisama opened this issue Dec 8, 2024 · 4 comments
Closed

LLM isn't still following pydantic format for SQL #1906

Emekaborisama opened this issue Dec 8, 2024 · 4 comments

Comments

@Emekaborisama
Copy link

Emekaborisama commented Dec 8, 2024

I have made several attempts, but the inconsistency is concerning. Is there another way to ensure the LLM adheres to the guided pedantic format in Dspy?

import dspy
import os
from dotenv import load_dotenv
load_dotenv()

class SQLResponse(BaseModel):
    """Model for SQL query responses"""
    query: str = Field(description="The generated SQL query")
    is_safe: bool = Field(
        default=True,
        description="Whether the query is considered safe to execute"
    )

lm = dspy.LM(model=os.getenv("AZURE_LLAMA_MODEL_NAME", None), api_key=os.getenv("AZURE_LLAMA_API_KEY", None), api_base=os.getenv("AZURE_LLAMA_ENDPOINT", None))
dspy.settings.configure(lm=lm)

class DivideConquerAgent(dspy.Signature):
    """
    # SQL Query Generator

    You are a Divide-and-Conquer Senior SQL Programmer at Bombe (customer behavior analytics platform at postcode level), You decompose a question into a set of sub-questions. Your task is to transform natural language queries into simple optimized SQL while following these guidelines:

      ## Query Development Strategy
      - Break down complex queries into smaller components:
        - Keep your query simple.
        - Identify core tables and relationships.
        - Build base queries first.
        - Add filters and aggregations incrementally.

      ## Technical Standards.
        - Use indexes where available.
        - Minimize subqueries.
        - LIMIT 100 for all queries.
        - Use documented columns only.
        - Only use pattern match with LIKE.
        - No dynamic parameters.
        - Maintain proper key relationships in joins.

      ## JOIN RULES:
        - Default to LEFT JOIN to keep all left table records
        - Use INNER JOIN only for mandatory matches or to exclude NULLs
      
      RETURN RULE:
        - Just the sql query without any comments or markdown formatting.


      <examples>

      - question: Postcodes with the top number of conservative shoppers
      - SQL query:
      WITH shopper_base AS (
      -- Step 1: Get base shopper data
      SELECT
        external_id,
        value
      FROM persona_geography_data.geography_data_rows
      WHERE segment IN (
        SELECT code
        FROM personas
        WHERE label LIKE '%Conservative%'
      )
    ),
    postcode_aggregation AS (
      -- Step 2: Aggregate by postcode
      SELECT
        g.name AS postcode,
        SUM(sb.value) AS total_conservative
      FROM shopper_base sb
      LEFT JOIN geographies g ON sb.external_id = g.external_id
      GROUP BY g.name
    )
    -- Step 3: Final output with ranking
    SELECT
      postcode,
      total_conservative,
      RANK() OVER (ORDER BY total_conservative DESC) as conservative_rank
    FROM postcode_aggregation
    ORDER BY total_conservative DESC
    LIMIT 100;

      - question: Get persona data for postcode E11 3QA
      - query:
        SELECT *
        FROM postal_code_uks pc
        LEFT JOIN geography_data_rows gdr
        ON pc.pcd = 'E11 3QA'
        AND pc.oa01 = gdr.external_id
        LEFT JOIN personas p
        ON gdr.segment = p.code;
    </examples>
    """

    context: str = dspy.InputField(desc="Will contain the table names and its columns names, data types, can be multiple tables")
    question: str = dspy.InputField()
    answer: SQLResponse = dspy.OutputField(desc="SQL query that will work always, without syntax errors" )


class DivideConquerSqlProgrammerAgent(dspy.Module):
    def __init__(self):
        super().__init__()
        self.generate_answer = dspy.ChainOfThought(DivideConquerAgent)


    def forward(self, question: str, re_tables: list[str], relevant_table_data):
        context =f"""

        <relevant_tables>
        {re_tables}

        </relevant_tables>

        <schemas>

        {relevant_table_data}

        </schemas>

        <notes>:
        - Use date({str(datetime.now())}) for current date.

        </notes>
        """
        prediction = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=prediction.answer)



@okhat
Copy link
Collaborator

okhat commented Dec 8, 2024

Hey @Emekaborisama ! How are you setting up your LM and which LM is it?

What is the pydantic element here? I don't see one. Is it SQLResponse? What's the definition of this?

I don't really see much of DSPy in the example code above. This is more or less a hand-written prompt.

@okhat okhat changed the title I am tired of using Dspy and the llm isn't still following the guided output pydantic format. LLM isn't still following pydantic format for SQL Dec 8, 2024
@Emekaborisama
Copy link
Author

Emekaborisama commented Dec 8, 2024

Hi @okhat Thanks alot for your response. really appreciate

just updated the code. interesting.

How would you do this differently?

@thomasahle
Copy link
Collaborator

Have you tried adding a pydantic validator for sql?
e.g.

class SQLValidator(BaseModel):
    sql_statement: str

    @validator('sql_statement')
    def validate_sql(cls, v):
        if not isinstance(v, str):
            raise ValueError('SQL statement must be a string')
            
        try:
            parsed = sqlparse.parse(v)
            if not parsed:
                raise ValueError('Empty SQL statement')
            return v
        except Exception as e:
            raise ValueError(f'Invalid SQL syntax: {str(e)}')

@Emekaborisama
Copy link
Author

yes, i did

@okhat okhat closed this as completed Jan 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants