Evaluation of Large Language Models for Discovery of Gene Set Function

Exploration of Gene Set Function Discovery Using Large Language Models: GPT-4 Excels

Academic Background

In functional genomics, gene set enrichment analysis is a critical methodology for understanding gene functions and their associated biological processes. However, existing enrichment analyses rely heavily on curated gene function databases, such as Gene Ontology (GO), which have inherent limitations: incomplete data and slow updates. This results in many gene sets being inadequately analyzed by traditional tools, leaving unexplored potential for deriving novel biological insights from such unclassified gene sets.

In this context, the emergence of generative artificial intelligence (AI), especially Large Language Models (LLMs) like GPT-4, has introduced new opportunities for functional genomics. These models can capture deep semantic relationships from vast textual data and are potentially capable of identifying and summarizing shared functions within gene sets. However, a critical question remains: are these AI models scientifically reliable and robust enough to handle such complex biological challenges? This study aims to address this question.

Source of the Paper

The paper, titled “Evaluation of Large Language Models for Discovery of Gene Set Function,” was authored by Mengzhou Hu, Sahar Alkhairy, Ingoo Lee, Rudolf T. Pillich, and others. The researchers are affiliated with the University of California, San Diego, spanning the departments of Medicine, Computer Science and Engineering, and Physics. Published in the January 2025 issue of *Nature Methods*, the study evaluates five major LLMs for their effectiveness in uncovering gene set functions, with a focus on models’ performance in recovering functional annotations from validation datasets and assessing their internal confidence.


Research Workflow

a) Research Design and Workflow

The team developed a fully automated gene set functional analysis pipeline based on LLMs. This pipeline takes a list of user-provided genes or proteins as input and generates the following outputs:

  1. Proposed Name: A concise biological descriptor summarizing the primary function of the gene set.
  2. Analysis Essay: An explanatory report that justifies the naming, detailing specific gene functions or biological relevance.
  3. Confidence Score: The model’s level of confidence in the results, ranging from 0 to 1.

The study evaluated five LLMs: GPT-4, GPT-3.5 (both from OpenAI), Gemini Pro (Google), Mixtral Instruct (MistralAI), and Llama2 70b (Meta). Additionally, standardized prompt templates were carefully designed for each model to improve consistency. These prompts were structured into seven major sections, including task instructions, confidence level guidance, and example references.

To test the performance of these models, the authors curated two types of gene set datasets: 1. Literature-curated Gene Sets: Randomly sampled gene sets from the GO biological process (GO-BP) branch, comprising 1,000 gene sets.
2. Omics-derived Gene Sets: A total of 300 gene sets derived from transcriptomics and proteomics datasets.

b) Experimental Methods

  1. Semantic Similarity Measurement:
    The study used the SapBERT model to calculate the semantic similarity between LLM-generated names and GO term names. This metric ranges from 0 to 1, with higher values indicating closer semantic alignment.

  2. Confidence and Data Contamination Testing:
    To evaluate the LLMs’ ability to identify unrelated gene sets, “contaminated” gene sets (half-randomized) and fully random gene sets were introduced, tracking how model confidence scores varied.

  3. Exploratory Functional Analysis of Omics Data:
    The study compared GPT-4’s results on 300 omics-derived gene sets with those produced by a traditional enrichment analysis tool, g:Profiler.


Key Findings

a) Performance on Curated Gene Sets

For the GO-based gene set validation task, the study yielded the following results: - Outstanding Performance of GPT-4: GPT-4 generated names with high semantic similarity to GO terms in 73% of cases, with its confidence scores strongly correlating with actual accuracy (correlation coefficient r = 0.92).
- Comparison with Other LLMs: GPT-4, Gemini Pro, GPT-3.5, and Mixtral performed comparably (median semantic similarity ~0.45-0.50), while Llama2 significantly lagged behind (median similarity 0.40).
- Scientific Consistency: A manual review of GPT-4’s analysis essays revealed that 88% of its descriptions could be verified in the literature, indicating a high level of scientific reliability.

b) Discovering Functions in Omics Data

For 300 experimentally-derived gene sets from omics datasets: - Ability to Capture Precise Functions: GPT-4 confidently proposed names for 135 gene sets (45%), while g:Profiler annotated 229 sets, though with lower specificity. Furthermore, GPT-4 avoided generating misleading names for random gene sets.
- Creativity and Logical Depth: GPT-4 not only provided precise functional summaries but also generated insightful analyses connecting multiple gene roles. For instance, in the Nest:2-105 protein network data, it proposed the name “Regulation of Cullin–Ring Ubiquitin Ligase (CRL) Complexes” and thoroughly discussed supporting evidence for key genes in the cluster.

c) Handling of Random Gene Sets

One of GPT-4’s distinguishing features was its ability to refuse naming unrelated gene sets proactively. In 87% of fully random gene set tests, GPT-4 confidently output “System of Unrelated Proteins” with a confidence score of 0. This conservative behavior was markedly superior to GPT-3.5 and other models.


Conclusions and Significance

d) Key Highlights

  1. Deep Biological Knowledge:
    GPT-4 demonstrated strong performance in function discovery, literature referencing, and logical reasoning, affirming its value in functional genomics.

  2. Novel Methods and Tools:
    The introduction of confidence scoring, advanced prompt design, and a robust citation verification module (for literature references) ensures reproducibility and reliability for future research.

  3. Potential for Novel Function Discovery:
    Beyond GO terms, GPT-4 integrated literature and unstructured data to uncover functions not accounted for by GO, showcasing its potential to assist scientists in unexplored domains.

e) Scientific and Applied Value

This study demonstrated that LLMs can serve as powerful assistants in functional genomics research, offering meaningful applications in omics data interpretation and the discovery of novel gene functions. Additionally, GPT-4’s confidence scoring provides a strong mechanism to handle noisy data and unrelated analyses, opening innovative avenues for life sciences research.

The integration of next-generation language models into the life sciences represents a transformative step forward, addressing bottlenecks in empirical research and unveiling opportunities in biomedical exploration.