Comparing NER results of GPT4, Claude, and Le Chat/Mistral
- 5 minutes read - 1057 wordsTable of Contents
This blog post compares the performance of Large Language Models (LLMs) in handling Named Entity Recognition (NER) tasks. Although LLMs can identify entities, their ability to classify them accurately and consistently varies.
The decision to use an LLM or a dedicated NER model should depend on the trade-offs between performance, efficiency, and specific requirements of the AI-driven data pipeline.
Why compare the results of different LLMs?
The motivation to compare different LLMs - besides leader boards - is to evaluate these LLMs regarding their concrete usefulness for your tasks.
NER is a critical NLP task that enables other NLP and NLU tasks in your AI-driven data pipeline. So, it is crucial to know how LLM-based NLP will perform.
Using LLMs for NLP tasks is attractive because it allows you to create results quickly. Some questions arise:
- Is the quality of the result reliable?
- Does it make sense to use a dedicated NER Model?
When deciding between LLMs and dedicated NER models, it’s essential to consider factors such as the computational resources required, the ease of integration into your existing pipeline, and the potential for fine-tuning or adapting the models to your specific use case.
Ultimately, the decision to use an LLM or a dedicated NER model will depend on the trade-offs between performance, efficiency, and the specific requirements of your AI-driven data pipeline.
Relevance of NER
Named Entity Recognition (NER) is a crucial task in Natural Language Processing (NLP) that involves identifying and classifying named entities in text into predefined categories such as:
- person names
- organizations
- locations
- medical codes
- time expressions
- quantities
- monetary values
- percentages
- etc.
NER is a fundamental component of many NLP pipelines, enabling other NLP and Natural Language Understanding (NLU) tasks. Therefore, evaluating the performance of different Large Language Models (LLMs) on NER tasks is essential.
In this analysis, I test the performance of several LLMs on the NER task. Before discussing the results, it is essential to understand the basics of LLMs. LLMs are deep learning models trained on vast amounts of text data to generate human-like text.
Test Prompt
Running an NLP task like NER with an LLM is pretty straightforward. It just involves formulating a prompt:
Find and classify the entities in this text: camera positions for stable diffusion, return as JSON :
{
entities: [
{
"entity": "<name>",
"label: "<NER label>"
}
]
}
Test Results
LLMs will return not only JSON but also some text. You need to extract JSON from the result for an analytical pipeline.
I will focus on the JSON results.
GPT4 (GPT4, ChatGPT)
First Iteration
{
"entities": [
{
"entity": "camera positions",
"label": "NORP"
},
{
"entity": "stable diffusion",
"label": "EVENT"
}
]
}
Second Iteration
{
"entities": [
{
"entity": "camera positions",
"label": "Concept"
},
{
"entity": "stable diffusion",
"label": "Technology"
}
]
}
Observation
GPT returns different NER labels for precisely the same prompt between different runs. It also delivers a different amount of text per iteration, sometimes short and other times rather elaborate.
GPT recognizes the entities but returns completely different labels per iteration for the same prompt.
Claude (claude-3-opus-20240229, Workbench)
Temperature: 0; setting temperature to 1 does not change the result.
First iteration, temperature=0:
{
"entities": [
{
"entity": "camera",
"label": "PRODUCT"
},
{
"entity": "stable diffusion",
"label": "PRODUCT"
}
]
}
Second Iteration, temperature=0:
{
"entities": [
{
"entity": "camera",
"label": "PRODUCT"
},
{
"entity": "stable diffusion",
"label": "PRODUCT"
}
]
}
Observation
Claude returns the same result for temperature=0, but increasing the temperature does not change the result.
Claude always returns the same result, but „PRODUCT“ is not a helpful label here.
Mistral / Le Chat
First iteration
{
"entities": [
{
"entity": "camera positions",
"label": "Equipment"
},
{
"entity": "stable diffusion",
"label": "Process"
}
]
}
Second Iteration
{
"entities": [
{
"entity": "camera positions",
"label": "Equipment"
},
{
"entity": "stable diffusion",
"label": "Process"
}
]
}
Observation
Mistral is not returning any text. It is just JSON and no text. The labels stay the same between iterations. However, the equipment and process remain the same.
Hugginface Model
Using the NER model (dslim/bert-base-NER) hosted at Huggingface yields no complete result because „camera positions for stable diffusion“ is not a proper sentence (result formatted):
[
{
"entity": "MISC",
"word": "Stable Diffusion"
}
]
Role of temperature
In the context of Large Language Models (LLMs), temperature is a hyperparameter that controls the randomness of the model’s output. A temperature value of 0 means the model will always return the most likely output, while a value closer to 1 makes the output more diverse and less predictable.
The optimal temperature value depends on the specific use case and the desired level of randomness in the model’s output.
Interestingly, it did not influence Claude’s result. This is interesting because changing the temperature does not influence Claude’s results like delivering more diverse results.
Potential architectures
A hybrid approach and a fusion of the results of different LLMs and local AI models could create a more stable result. However, investigating the labels and making an evaluation method would be necessary.
Conclusion
All LLMs identify the entities as reliable but do not classify them correctly and reliably in all cases.
Using a dedicated could make sense if the input consists of whole sentences. However, the classification for new or unknown terms might always be miscellaneous.
From a performance perspective, Mistral was the most responsive, while the performance of a local model depends on the hardware available.
To assess the LLMs’ generalization capabilities, testing them on a larger dataset with more diverse entities and contexts would be beneficial. This would provide a more comprehensive analysis of their performance on NER tasks and their potential usefulness in AI-driven data pipelines.
In conclusion, while LLMs can help quickly generate results on NER tasks, their performance may not be reliable or consistent across different contexts and entity types. Practitioners and researchers working on AI-driven data pipelines should consider the trade-offs between performance, efficiency, and specific requirements when deciding whether to use an LLM or a dedicated NER model.
Sources:
- https://arxiv.org/abs/2304.10428
- https://arxiv.org/abs/2402.10573
- https://www.tandfonline.com/doi/full/10.1080/19312458.2024.2324789
- https://academic.oup.com/jamia/advance-article/doi/10.1093/jamia/ocad259/7590607?login=false
- https://discuss.huggingface.co/t/extract-data-from-text-and-parse-it-as-a-json/64971
- https://pub.aimind.so/prompts-masterclass-output-formatting-json-5-3a5c177a9095