Two advances in language models that have recently become mainstream are image analysis and structured output. We decided to do a Halloween themed evaluation of these two modes for OpenAI and Anthropics’s models.
Structured output modes constrain language models to generate their output according to a schema. This is great when using models to call functions, or to extract data according to a template. And image analysis lets models including Claude and GPT provide detailed textual responses based on input images. To test these capabilities we built LLM powered pumpkin carving judges using Anthropic’s Claude 3.5 Sonnet and OpenAI’s GPT 4o, and evaluated them on some AI generated pumpkin images.
Images
We built a pipeline using Claude (for prompt generation) and the Flux.1-schell image generation model1 to generate images of carved pumpkins. Claude was asked to generate a detailed description of a pumpkin with the scariness, creativity, and skill specified as in the table below:
The detailed descriptions were passed to the Flux model to generate images. Two examples are shown below, along with the prompts used.
We generated images for all possible combinations of skill, creativity, and scariness, for a total of 32 images.
Rating
Both GPT-4o and Claude Sonnet models can generate output based on a provided schema. We created the following class to represent the rating template. We ask the model to provide a score for originality, skill, and scariness, with an explanation of each. We also asked for it to generate as short story about the pumpkin.
With this schema and the images, we can follow the model provider examples for data extraction2,3 (they sometimes call it tool-use to try and make it sound like intelligence) and image analysis4,5.
Below is another example of an image and the output json structure that one of the models generates:
INPUT image:
OUTPUT Rating for an example pumpkin:
This output format is very convenient because it is now rigidly structured. Reading the responses it is clear that the model (in this case Claude) is able to process subtle details from the image and “see” it similarly to how a person would.
Evaluation
For a production system it might be desirable to do a full analysis that covers robustness, security, bias, and other concerns.Here we want to run some simple checks on the systems to understand how they perform at their task.
Accuracy
The first check is a comparison of the model ratings to the “ground truth” rating to see how accurate it is. During the data generation process, we specified the creativity, originality, and skill of the design. These become the ground truth reference: we assign a 1/5 rating to “low” and a 5/5 to “high” and then look at the correlation with the scores the models generate for the 32 images. The correlation values for the three criteria are shown below:
The correlation value ranges from -1 to 1 with 1 being the “best” match between ground truth and prediction. Both GPT and Claude based systems perform almost identically here, with Claude marginally ahead for 2/3 cases, but not materially. It’s clear that for both models, the degree of correlation depends on the aspect of the pumpkin being judged. There is a high (~0.9) correlation for scariness, originality is at about .5 - still some predictive power, and skill at about 0.25 indicating much less of a clear relationship.
One reason for the difference could be ambiguity around the meaning of originality and skill. For example, anecdotally, some of the “low skill” images still were rated more highly because they contained “clean lines” even if the overall design was misshapen. Also, since we used synthetic data, we are evaluating the consistency between the image generation model’s (Flux’s) interpretation of the qualities, and that of the LLMs.
Hallucination
A hot topic that’s relevant to pumpkin judging is whether a model is hallucinating or confabulating (making up) details in its response. Since the models generate comments on their rating, we can look at these to confirm whether they are taking any liberties.
In order to do this check, we separately asked Claude 3.5 Sonnet to generate detailed descriptions of each image. Here is an example for a high-skill, not scary, not creative pumpkin:
Image:
Description:
These descriptions serve as a reference for comparison with the explanations given in the rating. Of course they are LLM generated themselves so may be imperfect, but it can still serve as a reference.
In order to evaluate the truthfulness of the explanations, we use the RAGAS faithfulness score6. This score has limitations that we won’t get into here, but again can serve as a reference for comparison, measuring whether a statement (here the rating comments) is supported by a provided context passage (here the detailed image description). The chart below compares how the two systems did.
Here, GPT-4o edges out Claude, particularly on skill where there is a material difference between GPT’s 0.63 and Claude’s 0.72. The scores range from about 0.63 to 0.77. Although technically a score of 1.0 indicates that the comments are fully supported by the description, in practice we typically find scores are lower, often for pedantic reasons, and its more useful comparing relative scores. All told, the rating comments are generally well supported and we don’t see any concerns with their truthfulness.
Conclusions
We built and evaluated data extraction systems using the structured output generation and vision capabilities of two LLMs. Both performed similarly at rating pumpkins. In our judgement there are no major issues with their truthfulness. And both are skilled judges of how scary a pumpkin is. In relation to originality and skill, their abilities are in question. Similar issues often arise out of ambiguity in the requests made to the models. It may be worth refining the prompts to more clearly explain what constitutes high and low skill and creativity.
Check out our video walk through with Andrew!
1 https://huggingface.co/black-forest-labs/FLUX.1-schnell
2 https://platform.openai.com/docs/guides/function-calling
3 https://docs.anthropic.com/en/docs/build-with-claude/tool-use
4 https://platform.openai.com/docs/guides/vision
5 ttps://docs.anthropic.com/en/docs/build-with-claude/vision
6 https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html