Responsible Use of LLMs in Research: Moving Beyond the Hype
13 January 2025
Ryan Daniels & Catherine Breslin, Accelerate Programme Machine Learning Engineering Team
24 September 2024
The technology and use cases for Large Language Models (LLMs) are advancing at a rapid pace, providing new opportunities in research. Understanding how they work and how to use this game-changing technology can be daunting.
As part of our work to accelerate the use of machine learning in research, the Accelerate Programme for Scientific Discovery has published a suite of resources, based on an initial AI and LLM study group and workshop designed to give researchers the knowledge and tools to use LLMs with confidence.
Getting started
The first step for researchers who want to use an LLM in their research is to carefully define what they would like it to do, whether that is debugging code or classifying text . There are plenty to choose from, from well-known LLMs such as ChatGPT and Gemini to smaller, niche models.
Experimenting with prompting an LLM in different ways is a good way to understand its capabilities and whether it can handle tasks at a high enough standard to be useful for research. Many models are released with model cards, which are a great resource for accessing and understanding the details of specific models . With some iteration, it is possible to produce specific inputs or prompts that give helpful outputs. For example, a researcher might ask a model to summarise a longer document or translate text into another language.
Researchers may experiment with several LLMs before finding one that best suits their needs. There is a wide range available and plenty of factors to consider, including:
If a scientist’s research involves fine-tuning LLMs there are additional challenges to understand, such as how to structure training data to get the most out of the model.
With a single task and a starting model, researchers can experiment to see how well the model performs. It’s always advisable to start with prompting the model first, or using a Retrieval Augmented Generation (RAG) approach that is designed to improve the efficacy LLMs by leveraging data or documents relevant to a task and providing them as context for the LLM . These options do not involve training the model, so they are easier to get started with.
Pre-trained models can be fitted into local machines and bent to do specific tasks, but if these approaches do not provide the performance needed, researchers can finetune a model to perform a specific task. There are resources in our GitHub to help, including prompting examples and tips for finetuning including how to set up data .
Evaluating an LLM
To thoroughly evaluate how well a model is doing, researchers will need to figure out a formal evaluation method. For some tasks, that can be straightforward. For example, if examples have a fixed output format such as with a maths question, they can automatically run a dataset of examples through the model and compute the accuracy of the LLM’s response. However researchers must ensure that none of the data they have used for evaluating the model has been used in the training, or it is a little like seeing the exam questions while revising.
However other tasks, such as translating text from one language to another will have many possible outputs, making it harder to evaluate whether a response is correct or not. In this case, it may be necessary to see how well the LLM output works for some other downstream tasks or to manually judge outputs to ensure a model is not hallucinating.
Responsible innovation
Using LLMs responsibly in research and innovation was a hot topic in the workshop. It is important to consider some of the limitations of LLMs, such as their ability to exhibit bias or to hallucinate . Hallucination, or confabulation, is when a model generates a response that is not based on real world data, for example generating a fictitious journal article title in response to a literature search query. Human oversight in safety critical areas is always necessary, but thoughtful curation of data can help reduce bias . As with all research, it is desirable to involve end users in research and gather diverse perspectives and to think about accountability and human oversight of your work.
Responsible use of LLMs is linked to good software engineering practices, such as test-driven development. Good code can make research easier to reproduce and give others more confidence in your work. Open-source libraries such as Hugging Face offer LLMs that can be a good base for researchers to build upon, while careful testing ensures they are working as desired. There are resources on good software practices in our GitHub repository.
Code for progress To build upon the workshop and help researchers get started, we have packaged and published some Python code that they can use as a starting point to develop and build their ideas for their projects if they wish.
The code is distributed under a free-use MIT license, allowing researchers to freely take the code and modify it in any way. Certain components of the code, such as running a training loop in PyTorch are easy to apply to many projects, so users need only swap out your own model. This is relatively simple because the nature of the Hugging Face library makes it very easy to swap out different types of models without changing much of the other code. We hope this core material will be useful and they might share their modifications to benefit others beginning to utilise LLMs. The code is available on GitHub for anyone to access and we hope to spread the word so it can be used widely.
In addition to the resources available on GitHub, the AI Clinic is available to all researchers at the University of Cambridge and can offer support with approaches, methodology and troubleshooting at all stages of the research pipeline. Get in touch (accelerate-mle@cst.cam.ac.uk) if you would like to speak to one of our experts. You can also sign up to be notified of the next LLMs course here.