MostPlas: A Self-Correction Multi-Label Learning Model for Plasmid Host Range Prediction

Plasmids are small, circular, double-stranded DNA molecules that exist independently of chromosomal DNA in bacteria. They facilitate horizontal gene transfer, enabling host bacteria to acquire beneficial traits such as antibiotic resistance and metal resistance. Some plasmids can transfer, replicate, or persist in multiple microorganisms, and these are known as broad-host-range plasmids (BHR plasmids). Accurately predicting the host range of BHR plasmids is crucial for understanding how plasmids promote bacterial evolution, spread resistance genes, and develop recombinant vectors. However, the lack of databases providing detailed host range labels for BHR plasmids poses a significant challenge for machine learning-based host range prediction. The absence of sufficient annotated samples makes it difficult for models to extract effective feature representations, leading to limited prediction accuracy.

To address this issue, a team from the Department of Electrical Engineering at the City University of Hong Kong, comprising Wei Zou, Yongxin Ji, Jiaojiao Guan, and Yanni Sun, proposed a self-correction multi-label learning model named MostPlas for plasmid host range prediction. The study was published on February 17, 2025, in the journal Bioinformatics, titled “MostPlas: A Self-Correction Multi-Label Learning Model for Plasmid Host Range Prediction.”

Research Process and Methodology

1. Research Objectives and Challenges

The goal of MostPlas is to predict the host range of plasmids, particularly BHR plasmids, using a multi-label learning model. The main challenges faced by the study include: - Incomplete Data Annotation: Existing databases (e.g., NCBI RefSeq) only provide labels for the host from which the plasmid was isolated, lacking comprehensive host range information. - Label Imbalance: The number of non-host bacteria for each plasmid far exceeds the number of actual hosts, leading to an overemphasis on negative labels during model training while neglecting the recognition of positive labels.

2. MostPlas Model Design

The core innovations of MostPlas lie in the design of a pseudo-label generation algorithm and a self-correction asymmetric loss function to address the aforementioned challenges.

2.1 Pseudo-Label Generation Algorithm

The pseudo-label generation algorithm assigns additional credible host labels to training samples by mining the distribution information of plasmid-encoded proteins. The specific steps are as follows: 1. Data Preparation: Download all plasmid sequences from the NCBI RefSeq database, filter sequences at the complete genome level, and remove non-bacterial hosts and genera with fewer than 10 samples. 2. Protein Clustering: Use Prodigal for gene prediction and translation, then cluster protein sequences using CD-HIT (similarity threshold of 0.9) to generate protein clusters (PCs). 3. Significance Scoring: Design an improved scoring method based on TF-IDF, called TF-IDFpro, to evaluate the significance of each PC to different host genera. 4. Pseudo-Label Assignment: Assign additional host labels to training samples based on the TF-IDFpro scores of plasmid-encoded proteins.

2.2 Self-Correction Asymmetric Loss Function

Traditional binary cross-entropy loss treats the contributions of positive and negative labels equally during training, while the self-correction asymmetric loss function adjusts model training in the following ways: - Positive Label Dominance: Increase the weight of positive labels and reduce the impact of negative labels. - Adaptive Recognition of Missing Labels: Adaptively identify potential missing positive labels during model training to optimize the model’s decision boundaries.

3. Experiments and Results

The research team conducted experiments on multiple datasets, including the NCBI RefSeq database, PLSDB 2025 database, plasmid sequences with experimentally determined host ranges, Hi-C dataset, and DoriC dataset. The results showed that MostPlas could identify more host labels while maintaining high precision.

3.1 Multi-Host Plasmid Test Set

On the NCBI RefSeq and PLSDB 2025 databases, MostPlas significantly outperformed other tools in recall and F1 score. For example, on the RefSeq dataset, MostPlas improved recall by 5.7% and F1 score by 5.0%.

3.2 Experimentally Determined Host Range Plasmids

On the MOB-suite dataset, MostPlas’s predictions overlapped with those of other tools by up to 89.2%, indicating high reliability of its predictions.

3.3 Biological Characteristics Analysis

By analyzing the DoriC dataset, the study found that plasmids with multiple host genus labels often have multiple replicons, providing insights into the mechanisms of plasmid host adaptation.

Research Conclusions and Significance

MostPlas is the first study to apply a multi-label learning model to plasmid host range prediction. Its innovations lie in addressing the issues of incomplete data annotation and label imbalance through the pseudo-label generation algorithm and self-correction asymmetric loss function. The experimental results demonstrate that MostPlas performs exceptionally well on multiple datasets, particularly in identifying multi-host plasmids.

Scientific and Application Value

  • Scientific Value: MostPlas provides new tools and methods for studying plasmid host adaptation mechanisms, horizontal gene transfer, and the spread of resistance genes.
  • Application Value: The model can be used to predict the host range of newly discovered plasmids, aid in the development of plasmid-based recombinant vectors, and analyze environmental microbiomes.

Research Highlights

  • Pseudo-Label Generation Algorithm: By mining the distribution information of plasmid-encoded proteins, it generates high-quality pseudo labels, significantly improving model performance.
  • Self-Correction Asymmetric Loss Function: By adjusting the weights of positive and negative labels, it addresses label imbalance and adaptively identifies missing labels.
  • Validation on Multiple Datasets: Extensive validation on multiple public datasets demonstrates the model’s robustness and generalizability.

Future Research Directions

Although MostPlas has made significant progress in plasmid host range prediction, there is still room for improvement. For example, future research could explore how plasmid origins of replication, transposons, and other mobile genes influence plasmid host adaptation to further enhance prediction accuracy. Additionally, applying MostPlas to incomplete plasmid sequences (e.g., plasmid contigs) is a promising direction for future exploration.