Can you ace this
Science Exam written by:
GPT3.5?

Inspired by the ability for LLM’s to self-benchmark, this project explores the intriguing question of whether smaller question-answering models can outperform much larger question-writing models.

Test Scores

  • Correct
    0
  • Incorrect
    0

Question:

Which of the following statements accurately describes the impact of Modified Newtonian Dynamics (MOND) on the observed "missing baryonic mass" discrepancy in galaxy clusters?

A

MOND is a theory that reduces the observed missing baryonic mass in galaxy clusters by postulating the existence of a new form of matter called "fuzzy dark matter."

B

MOND is a theory that increases the discrepancy between the observed missing baryonic mass in galaxy clusters and the measured velocity dispersions from a factor of around 10 to a factor of about 20.

C

MOND is a theory that explains the missing baryonic mass in galaxy clusters that was previously considered dark matter by demonstrating that the mass is in the form of neutrinos and axions.

D

MOND is a theory that reduces the discrepancy between the observed missing baryonic mass in galaxy clusters and the measured velocity dispersions from a factor of around 10 to a factor of about 2.

E

MOND is a theory that eliminates the observed missing baryonic mass in galaxy clusters by imposing a new mathematical formulation of gravity that does not require the existence of dark matter.

Question 1 of 21

Project Description

Outline

Driven by the inspiration drawn from the remarkable self-benchmarking potential of Large Language Models (LLMs), this project embarks on an exploration of a captivating question: Can question-answering models of smaller structure surpass their significantly larger counterparts that specialize in question creation?

Dataset

The dataset for this project was generated by giving GPT 3.5 snippets of text on a range of scientific topics pulled from Wikipedia, and asking it to write multiple choice questions and providing the correct answer.

To further enhance the performance of the smaller model, additional datasets were collected and used for fine-tuning. These additional datasets significantly expanded the pool of sample questions, increasing the total number of questions from the initial 200 to an impressive 142,338. This extensive dataset allowed for more comprehensive training and refinement of the smaller model's ability to generate and answer multiple-choice questions across various scientific domains.

Performance Metric

The evaluation metric is the Mean Average Precision at 3 (MAP@3), which is the average of the precision at the first three positions of the ranked list of answers. The MAP@3 is a measure of how well the model performs at the task of ranking the correct answer in the top three positions.

MAP@3=1Uu=1Uk=1min(n,3)P(k)×rel(k)\text{MAP@3} = \frac{1}{U} \sum_{u=1}^U \sum_{k=1}^{min(n,3)} P(k) \times rel(k)

Project Conclusion

Results

The smaller question-answering model achieved a notable MAP@3 score of 0.757, showcasing its impressive performance in ranking correct answers within the top three positions. The success of this smaller model indicates that it is possible to distill the knowledge and capabilities of larger models into more compact versions without sacrificing performance and generalisability.

The most important takeaway from this project is that LLMs are capable of self-benchmarking. Therefore, it is possible to continously improve the performance of LLMs by using adaptive self-evaluation in model training.

Future Work

The achievement of the smaller question-answering model represents a significant milestone in our exploration of model compression and self-benchmarking. This suggests that the future of LLM development lies in harnessing the power of adaptive self-evaluation. Moving forward, our research will focus on refining and automating the self-benchmarking process, enabling models to continously enhance their performance across diverse taksks and domains. We aim to investigate how self-benchmarking can be applied not only to question-answering models, but also to a broader spectrum of natural language processing tasks. Ultimately, we hope to develop an AI agent that can address a wide range of real-world challenges while minimizing human intervention.