Responsible Use of LLMs in Research: Moving Beyond the Hype
13 January 2025
Ryan Daniels, Machine Learning Engineer - Accelerate Programme
25 April 2023
Machine learning
In the age of data-driven research, modern science inevitably has a heavy computational component. Researchers are often working with terabytes of data, which needs to be collected, processed, and run through their models. In many cases, the research conducted leads to papers published in academic journals. But what about the code that is used to process experimental data? What about the machine learning models that are used to transform data and make inferences that could have long-term impacts in their respective fields?
When mathematicians release a proof of a conjecture, we expect to see the actual proof, not a hand waving statement of, “Just trust us!” So of course, it is reasonable to expect that the authors of an article which has a computational element can demonstrate that their code works. But how do you do this? And furthermore, how can you be sure that your code is doing what you expect? Because the last thing you want is to have to redact a paper based on a software bug. Testing, packaging and distributing research software can be a complex and time consuming process, and it can be confusing to figure out exactly which path to take across the packaging terrain.
But what if you didn’t have to travel this path alone? There has been a tidal wave of media related to OpenAI’s new generative pre-trained transformer GPT-4, both positive and negative. The papers released by OpenAI and Microsoft Research show some seriously impressive capabilities. In my day-to-day job as a machine learning engineer at Cambridge, I assist researchers in solving engineering issues they encounter. For example, our team has recently been working with researcher Caroline Nettekoven and her Netts software, resolving issues related to installation and packaging. So, given the hype surrounding GPT, I wanted to find out if firstly, large language models (LLM) like GPT could help researchers solve certain issues without needing the assistance of a software engineer; and secondly, can it boost the productivity of the software engineers like me that are helping researchers to solve some of these problems.
To investigate this I first developed a simple Python package to perform linear regression on some data I then set out with the aim to write some test cases and package and publish our software. I provided GPT-4 with a series of prompts that a researcher without much experience in software engineering might give to GPT. For a full transcript, and the Python package check out our Github page.
Testing
I first tested GPT’s ability to generate test functions for the package using unittest. Overall, the test cases were not terrible, but before addressing the actual tests, it is important to state: the code produced by GPT did not run in its entirety. A number of tests produced errors: in testing a method to split data into training and testing sets GPT was trying to get values from a function that returned none. And then it had the nerve to tell me to change my function! However, a different error did bring to my attention a mistake in my code (I was performing a transpose operation twice), although I likely would have discovered this when writing my own test cases.
There are two ways to load data in my little application: via a numpy array, or via a csv file. When testing the load functions, GPT opts to manually build an array of data, rather than randomly generating it, and the manually generated dataset is quite small. This means that when it comes time to test the function designed to split the data into training and testing sets, there really isn’t a lot of choice about how to split the data, and GPT only tests a single 50/50 split. It doesn’t test the default arguments for the split function, and it only tests the test-train split using a numpy array, and not also using an array loaded from a csv file.
Similarly, the test cases for the actual linear regression model are simply not comprehensive enough for me to feel that edge cases are covered (more details can be found in the Github repo). But overall, it’s a pretty good start, and there is always the option of zeroing in on a specific test case and asking GPT to refine it. In my interactions with GPT-4, I have been impressed at its ability to read error messages and apply changes to the guilty code. However, I have also found that if GPT cannot solve the problem after 2 or 3 errors, it becomes more likely that GPT will produce incoherent output. In particular, GPT will begin going around in circles, “correcting” its code to an earlier version that I had already established as being incorrect.
Packaging
I then tested GPT’s ability to publish the above package. GPT was able to provide me with a step-by-step guide that was a little ragged, but good enough to get the package onto Test PyPi. Quite frankly, I was astonished that this actually worked. GPT’s method is not perfect. It uses an older method of packaging python applications (perhaps a result of the training data cutoff of September 2021), instead of using popular industry methods such as the Poetry or Build libraries. Interestingly, if prompted to do so, GPT is capable of providing a packaging method that utilises Poetry. Despite some pedantic issues, GPT gets the job done, and is also able to respond to prompts to clarify certain aspects of the process. So if you know absolutely nothing about packaging, it at least gives you a starting point.
Bear in mind that GPT is a stochastic beast, meaning that its output varies, even with identical input. In other words, just because it worked for me, does not mean it will work for you. The method that I used to prompt GPT here is just one of many. The standard
Overall I have been immensely impressed with GPT-4’s capabilities, but also urge caution: GPT-4 has some serious limitations, and it is not easy to effectively prompt GPT-4 to provide meaningful output. I have also found a ‘positivity bias’ where GPT tends to assume that your problem has a tractable answer, and then proceeds to give you a comprehensive (and utterly incorrect) breakdown of how to solve it. Having said this, GPT does shine in some areas: solving well documented and commonly occurring coding problems (such as packaging up an application, or producing small functions for a specific task); acting as a type of ‘rubber duck’ that you can bounce ideas off of; and critiquing pieces of code that you have written. It is important to note here that there may be security implications in sharing confidential information with GPT, and you should familiarise yourself with OpenAI’s Terms of use, and the National Cyber Security Centre’s guidelines on using ChatGPT and LLMs. There is also a wider context for the development of these tools, which affects the risks and benefits associated with their use.
I’ll leave you with one final thought. Suppose all research on LLMs stopped today, and we made no further progress, that this is as good as it gets. With the growth of prompt engineering, it seems plausible that we will become better at providing more effective prompts to LLMs, such that we can tease out more accurate information and explanations. So whether you think that GPT is reliable, or whether it truly “understands” GPT-4 has the potential to be a genuine force-multiplier for responsible researchers and educators.
For a full transcript, and the Python package check out the Github page.
If you are experiencing any issues with your machine learning pipelines or software, please get in touch with the team.