Introduction:
In today’s rapidly evolving technological landscape, Large Language Models (LLMs) such as GPT-4 are becoming increasingly sophisticated. However, these models are not without their limitations, one of the main ones being confabulation, where the model generates information that might not be factual or accurate. To combat this, we utilized the OpenAI Evals framework, a tool designed for systematically testing and improving the performance of LLMs. In this post, we will guide you through our process of creating an evaluation that focuses on unsolvable questions alongside contextual information, aimed at reducing confabulation in GPT-4.
1. Defining Confabulation and Unsolvable Questions:
Confabulation refers to the generation of information by the language model that might not be factual or substantiated. Our primary objective with the “Unsolvable Questions Evaluation” was to assess GPT-4’s capability in discerning and responding aptly to these unsolvable questions alongside contextual information, thereby highlighting its tendencies to confabulate. As more people build contextual integrations with GPT, it will be important that GPT can deal with answering questions truthfully alongside context.
2. Harnessing the Power of OpenAI Evals:
In order to effectively assess GPT-4’s performance, we utilized the OpenAI Evals framework. This tool facilitated a structured approach to our evaluation process, ensuring we targeted and measured key performance indicators in a systematic way.
3. Building the Dataset:
The dataset used for our evaluation was a modified version of the Stanford Question Answering Dataset (SQuAD), a collection of over 150,000 crowd-sourced questions based on Wikipedia articles. Our focus was on the SQuAD2.0 dataset, which includes over 50,000 unsolvable questions. In order to adapt this dataset for our purposes, we reformatted it into a chat format that GPT-4 could understand, concentrating on 318 samples containing unsolvable questions.
This section will detail the technical process of how we extracted a unique set of questions from the larger SQuAD2.0 dataset. We’ll provide a walkthrough of the Node.js script used to parse through the dataset and curate our subset of examples.
1 | /* SQuAD2.0 data converter |
In this initial part of the script, we’re loading the fs
(file system) and stream
modules from Node.js. We then read the input file (train.json
) and prepare an output file stream (samples.jsonl
) for writing.
The next chunk of code is our Transform
stream, which is used to parse and reformat each line of the input data:
1 | const processLine = new Transform({ |
The transform
function parses each line into JSON and checks for the question’s solvability. It then formats this information into a new JSON object that’s suitable for our chat model.
Finally, we parse the entire input file, iterate over the contents, and selectively write the questions we’re interested in to our output file:
1 | const parsedFile = JSON.parse(inputFile); |
The loop traverses each document, paragraph by paragraph, with a focus on creating a balanced and diverse set of questions for the model. We are only considering context lengths between 500 and 1500 characters to ensure manageability. We also impose a randomness factor to promote diversity in the final selection. For each paragraph that meets these criteria, we look for one solvable and one unsolvable question, if available. We then write both to our output file, thereby creating a balanced subset of solvable and unsolvable questions.
4. Conducting the Evaluation and Highlighting Failures:
After preparing the dataset, we ran the evaluation using GPT-3.5, for efficiency and cost-effectiveness. Through this process, we were able to document instances where the model was unable to provide accurate answers.
Taking these failure logs, we devised a new script that parses these logs, identifies instances where the model could not deliver the right answers, and isolates them into a separate file. This method enabled us to run the script multiple times, resulting in a larger dataset that also illustrated a broader range of failure cases with which the model struggled.
Here’s the script that accomplishes this task:
1 | /* |
The script begins by importing the required modules and setting up the streams for the input and output files.
Then we create a Transform
stream, similar to the one in convert.js
, which essentially copies each line from the input file to the output file.
1 | const processLine = new Transform({ |
Next, we create a function parseLines
that identifies and processes the failures from the log. The function looks for the ‘match’ lines where correct
is false (meaning the model got the answer wrong), and pairs each one with its preceding ‘prompt’ line to preserve the context of the failed question.
1 | const parseLines = (line, previousLine) => { |
Finally, we use Node’s readline
interface to read the input file line by line, calling parseLines
for each one. The results are piped to the output file.
1 | const lineLimit = 1500; |
This script produces a file named failure-samples.jsonl
, which contains all the failure cases from the logs. These examples can be further combined with others from additional runs to create a robust set of challenging samples for the model to improve upon.
5. Understanding the Importance of the Evaluation:
GPT-4, with its advanced capabilities, is shaping up to be a potent learning assistant. Evaluating its ability to discern solvable from unsolvable questions based on the provided context is crucial, especially as we develop more advanced applications atop this technology. This evaluation not only provides insights into potential shortcomings but also measures GPT-4’s capability to dodge trick questions.
As GPT-4 becomes accessible to individual developers and organizations, it is being deployed in more complex, context-based workflows. An evaluation like ours provides valuable insights and learnings, benefiting the wider community.
6. Sharing the Results of the Evaluation:
Upon discovering questions that stumped GPT-3.5, we aimed to test GPT-4’s performance and submit the results for OpenAI to work on its enhancement. Our initial evaluation showed GPT-4 outperforming its predecessor, GPT-3.5, with an accuracy of 0.61 compared to 0.01.
Our “Unsolvable Questions Evaluation” highlighted GPT-4’s progress in discerning complex questions and its confabulation issue. Using OpenAI Evals, we gained crucial insights into enhancing large language models.
We look forward to OpenAI’s response to our experiment and contributing towards the evolution of robust AI systems. For a closer look at our evaluation and its updates, visit the open pull request. Together, let’s drive AI innovation responsibly.