Regulatory T Cells-Related Gene in Primary Sclerosing Cholangitis: Evidence from Mendelian Randomization and Transcriptome Data

Regulatory T Cells-Related Gene in Primary Sclerosing Cholangitis: Evidence from Mendelian Randomization and Transcriptome Data

Authors: Jianlan Hu1, Youxing Wu1, Danxia Zhang1, Xiaoyang Wang1, Yaohui Sheng1, Hui Liao1, Yangpeng Ou2✉, Zhen Chen3, Baolian Shu1✉, and Ruohu Gui1✉
Journal: Genes & Immunity (2024) 25:492–513
DOI: https://doi.org/10.1038/s41435-024-00304-4

Introduction

Primary sclerosing cholangitis (PSC) is a chronic, progressive liver disease caused by a combination of immunological, inflammatory, and genetic factors, ultimately leading to liver failure. The incidence and prevalence of PSC vary globally, with rates ranging from 0.07 (Spain) to 1.3 (Norway) cases per 100,000 people per year for incidence, and from 0.2 (Spain) to 13.6 (US) cases per 100,000 people for prevalence. Approximately 70–80% of PSC patients also suffer from inflammatory bowel disease, which is a risk factor for cholangiocarcinoma and colorectal cancer. The clinical manifestations and course of PSC are varied, and the diagnosis relies mainly on bile duct imaging and liver histopathology after excluding other causes. While PSC has a more indolent course in some patients, the diagnosis carries significant long-term health implications, with a median transplant-free survival of 13.2 years. Advances in imaging technology have led to less invasive diagnostic methods utilizing magnetic resonance (MR), replacing invasive procedures like endoscopic retrograde cholangiopancreatography and liver biopsies for the safe and accurate identification of PSC. Further optimization of diagnostic methods may be achieved by identifying cost-effective PSC-specific biological markers in the future. Currently, there are no effective drugs to treat PSC, and liver transplantation is the only effective treatment. Hence, there is an urgent need to identify new biomarkers for diagnosing PSC, determining individualized risk, discovering treatment targets, and improving our understanding of its pathophysiology.

Several theories exist about the etiology of PSC, but the most popular one is that it is an immune-mediated illness initiated by an environmental stimulus, ultimately resulting in hepatocyte damage and inflammation in genetically susceptible individuals. A significant correlation has been observed between immune cell infiltration and the progression of fibrosis and damage to cholangiocytes. Various types of immune cells, such as neutrophils and macrophages, have been found close to the bile ducts of individuals with PSC. One of the most distinguishing features of PSC is the presence of T cell infiltration; however, the composition and roles of these infiltrating T cells differ. For instance, CD4+ T cells from PSC patients exhibit decreased apoptosis sensitivity, but not CD8+ T cells in peripheral blood. Patients with PSC also exhibit a greater proportion of CD8+ T cells positive for CXCR3 but have fewer CD25-positive CD4+ T cells. However, these observational studies are prone to reverse causation and confounding bias. Randomized controlled trials (RCTs) are important in determining if targeting specific immune cells can lead to the discovery of novel treatment avenues. In the absence of RCTs, Mendelian randomization (MR) has emerged as a robust technique to assess causal inferences between exposure and clinical outcomes by utilizing genetic variants as instrumental variables (IVs).

Research on PSC is limited due to the difficulty in obtaining cholangiocytes, the relatively low number of these cells in the liver, unstable in vitro culture techniques, and samples typically taken from patients with advanced illness. The availability of high-throughput RNA sequencing datasets provides an unprecedented opportunity to discover novel biomarkers. With the development of bioinformatics, machine learning approaches have become a routine tool for selecting feature variables and constructing predictive models. Currently, Lasso-Cox is the mainstream algorithm used for generating predictive signatures, but the uniqueness and inappropriateness of certain modeling methodologies have resulted in models with significant inadequacies, limiting their use in clinical contexts. The discovery of novel biomarkers for PSC using transcriptome data combined with advanced machine learning (ML) algorithms is limited. An integrated approach based on a combination of various ML algorithms that can fit a consensus model for diagnosing PSC has not yet been exploited.

In this study, we first performed MR analyses to establish the causal relationships between 731 immune cells and PSC using extensive genome-wide association studies (GWAS) summary data. We replicated the findings in another PSC GWAS and pooled the results with meta-analysis. Through a search of the Gene Expression Omnibus (GEO) database, we downloaded the expression profiles of two PSC mRNA datasets (GSE119600 and GSE159676). We used IOBR (CIBERSORT, TIMER, xCell, MCPcounter, ESTIMATE, EPIC, quantiseq) and ssGSEA to quantify immune cell infiltration levels. Weighted gene co-expression network analysis (WGCNA) was applied to identify significant Treg modules in two GEO cohorts. Then, we developed a novel machine learning framework that incorporated 12 machine learning algorithms and their 107 combinations to construct a consensus Tregs classifier and validate it in another cohort. The hub Tregs-related genes’ expression in the PSC mice model was confirmed using qRT-PCR. Subsequently, we explored the crucial immune cell infiltration and molecular pathways implicated in the initiation of PSC and their association with hub Tregs-related genes. Lastly, we used the Shapley additive explanations (SHAP) and XGBoost algorithm to identify optimal Tregs-related genes and performed the Mantel test to investigate the link between pivotal molecular pathways. We hypothesize that the developed Tregs classifier may effectively diagnose PSC. Moreover, the discovery of the hub Tregs-related gene, pivotal molecular pathways, and immune cells will provide insights into the pathogenesis mechanisms of PSC, uncovering druggable targets for its treatment.

Materials and Methods

GWAS Data Sources for PSC and Immune Cell Traits

Three large-scale GWAS data on PSC (GWAS ID: ieu-a-1112, finn-b-k11_cholangi_strict, finn-b-k11_cholangi) were obtained from the IEU OpenGWAS project. Two PSC GWAS datasets (GWAS ID: ieu-a-1112, and finn-b-k11_cholangi_strict) were set as discovery cohorts, and another PSC GWAS dataset (GWAS ID: finn-b-k11_cholangi) was used as a validation cohort. Table 1 highlights the basic data extracted from PSC GWAS datasets.

GWAS summary statistics of immunological traits are publicly available from the GWAS catalog (accession numbers gcst90001391 to gcst90002121). This dataset includes 731 distinct immune phenotypes, such as morphological parameters (MP) (n=32), relative cell (RC) counts (n=192), absolute cell (AC) counts (n=118), and median fluorescence intensities (MFI) expressing surface antigen levels (n=389). In particular, the RC, AC, and MFI features include B cells, myeloid cells, dendritic cells (DCs), monocytes, mature stages of T cells, TBNK (T cells, B cells, natural killer cells), and Treg panels, whereas MP includes TBNK and DC panels. Table 1 highlights the basic data collected via GWAS on immune cells.

The STROBE-MR (Strengthening the Reporting of Observational Studies in Epidemiology using Mendelian Randomization) checklist was completed for this observational study (Supplementary Material 1).

Selection of IVs and Data Harmonization

In MR, the IVs indicate genetic variations that are highly related to the exposure and are not confounded by other factors that impact the outcome, which may adequately suppress the effect of confounders. For an MR study to be valid, the IVs ought to fulfill three essential criteria: (1) SNPs markedly (p < 5 × 10−8 or p < 1 × 10−5) linked to exposures are utilized as IVs; to eliminate the impact of weak IV bias, we incorporated SNPs whose F-statistic was >10; (2) Independence assumption: significant confounding factors, such as those related to the exposure and the corresponding outcome, exhibit no link to SNPs (IVs); (3) Exclusivity assumption: SNPs (IVs) influence outcome directly via exposure and are not otherwise linked to outcome.

SNPs linked to immunological traits were determined at the p < 1 × 10−5, as used in previous MR studies (given that only a small number of SNPs attained a degree of genome-wide significance (p < 5 × 10−8)). Further, the examination of the linkage disequilibrium (LD) among SNPs relied on 1000 Genomes Project-obtained European ancestry reference information. We picked SNPs utilizing an LD coefficient (r2 < 0.001) and situated over 10 Mb, to identify SNPs with independent genetic effects.

Additionally, in cases where there were no common SNPs between the outcome and exposure, proxy SNPs in LD (r2 ≥ 0.8) were added. To eliminate the impact of weak IV bias, we incorporated SNPs whose F-statistic was <10 (a measurement of these IVs’ strength).

Primary MR Analysis

MR analysis with the package TwoSampleMR (version 0.5.6) was applied to ascertain the causal relationships of immunological traits with PSC in discovery and validation cohorts. Regarding characteristics influenced by two or more SNPs, the random-effects inverse variance weighted (IVW) models, which have stable, as well as, balanced pleiotropic impacts, were utilized as the primary method to determine the causal effects. Additionally, the effect estimates for traits controlled by a single SNP were derived utilizing the Wald ratio. The MR estimates are presented as odds ratio (OR) with a 95% CI for the dichotomous data or beta value with standard error (SE) for the continuous variables. To improve the reliability and strength of the causal relationship, we took the intersection of the results of two discovery cohorts.

MR Sensitivity and Heterogeneity Analysis

To ascertain if MR impact estimates are resilient to possible invalid genetic variants, we executed MR-Egger regression, weighted median (WM), simple mode, and weight mode as sensitivity analyses. In comparison to the IVW approach, which presumes that all the SNPs are valid IVs when the instrument strength must be independent of the direct effect (inside) assumption holds, the MR-Egger regression test could generate a consistent estimate even if all the genetic instruments are invalid. The WM model stands out as a robust method, capable of availing consistent estimate results when over half of the genetic instruments are deemed valid. For any possible heterogeneity, we applied Cochran Q statistic derived from the MR-Egger regression and IVW approaches. They were visualized using funnel plots, and MR-Egger intercept was utilized to examine horizontal pleiotropy; a threshold of p < 0.05 for both was utilized. In addition, MR Steiger filtering was utilized to ascertain the direction of causation between exposure and outcome. Finally, we performed a leave-one-out (LOO) analysis to re-calibrate the overall effect size and explored whether the association can be affected by a single SNP, by eliminating one exposure-linked SNP at a time.

Meta-Analysis Pooled the Results of MR

To further improve the reliability and strength of the causal relationship, we performed a meta-analysis to pool the significant results of MR in discovery and validation cohorts. OR with a 95% CI was reported to estimate the association between immunological traits and PSC risk. Heterogeneity was assessed through the utilization of three methods: visually inspecting the forest plots, assessing the diversity (D2) estimates, and computing the inconsistency statistics (I2). We employed a fixed effects model when I2 was equal to zero, and a combination of fixed and random effects models when I2 was greater than zero. We reported the most conservative estimate being the point estimate closest to no effect or the estimate with the widest CI.

Transcriptome Dataset Collection, Sample Selection, and Preprocessing

The GEO (http://www.ncbi.nlm.nih.gov/geo) database was systematically searched to find PSC-relevant transcriptome datasets from February 2001 through February 2024. The transcriptomic profiling datasets needed to fulfill the following criteria: (a) organism: Homo sapiens; (b) expression profiling through high-throughput sequencing or array. Lastly, two GEO datasets (GSE119600: 47 healthy controls and 45 PSC samples; and GSE159676: 6 healthy tissues and 12 PSC samples) were included for quantitative and qualitative analyses. For GEO datasets, a robust multi-array average analysis was implemented, including quantile normalization, background correction, and summarization.

Evaluation of Immune Cell Infiltration

We used eight different algorithms to infer PSC patients’ immune cell infiltration in two GEO cohorts. These algorithms including CIBERSORT, TIMER, xCell, MCPcounter, ESTIMATE, EPIC, and quantiseq were implemented using the R package ‘IOBR’; while ssGSEA was performed to quantify the infiltrating immune cells. Additionally, the relative abundance of Tregs and B cells was compared between healthy/control and PSC groups.

WGCNA

By constructing a weighted gene co-expression network in two GEO cohorts utilizing the R package “WGCNA” (v 1.71), it was feasible to identify functional gene modules that correlated strongly with the proportion of Tregs and were thus appropriate for additional screening. Before establishing a scale-free topology, the soft thresholding power (β=1–20) was ascertained. Following the generation of the weighted adjacency matrix, it was transformed into a topological overlap matrix (TOM). Furthermore, the dynamic tree-cutting method was implemented to examine different modules clustered according to gene similarity, after obtaining the disstom for hierarchical clustering. Subsequently, we related the identified modules to two traits (B cells and Tregs proportion). For further analysis, genes from the module that exhibited a significant correlation with Tregs were selected.

Identification of Tregs-Related Genes and Functional Enrichment Analysis

Standardized data including GSE119600 and GSE159676 datasets between healthy and PSC samples were subjected to a variance analysis utilizing the NetworkAnalyst online gene expression table. Then, we took the overlap of differentially expressed genes (DEGs) and significant Tregs module genes in two PSC cohorts, defined as Tregs-related genes, and visualized with the UpSetR package. GO (Gene Ontology) and KEGG (Kyoto Encyclopedia of Genes and Genomes) enrichment analyses were conducted on the Treg-related genes utilizing the clusterProfiler R package.

Identification and Construction of Tregs Classifier through Multiple Machine Learning Methods

To find robust hub Tregs-related genes, ML techniques with 5-fold cross-validation based on disease status were applied to two PSC cohorts. Among these methods were support vector machine (SVM) for feature variable selection and modified Lasso penalized regression. The hub Tregs-related genes were derived from the overlapping of the results obtained from SVM and Lasso.

Additionally, the following procedure was carried out to develop a consensus diagnostic model for PSC: (1) We began by combining 12 classical algorithms: stepGLM, gradient boosting machine (GBM), linear discriminant analysis (LDA), extreme gradient boosting (XGBoost), NaiveBayes and SVM, random forest (RF), glmBoost, Lasso, partial least squares regression for generalized linear models (PLSRglm), ridge regression, elastic network (enet). Of these, Lasso, RF, stepGLM, and glmBoost have feature selection capabilities. In addition, 107 algorithm combinations were constructed as prediction models utilizing the leave-one-out cross-validation (LOOCV) framework. (2) Next, for the training set, we utilized the GSE119600 in PSC and employed these 107 combinations to generate classifiers independently using hub Tregs-related genes. (3) Lastly, in the testing cohorts (GSE159676), the Tregs score was calculated for each cohort by employing the model that was acquired in the training cohort. The best consensus diagnostic model for PSC was ultimately determined by averaging the area under the curve (AUC) of the two cohorts.

Diagnostic Value, and Clinical Usefulness of Tregs Classifier

Receiver operating characteristic (ROC) analysis was executed to examine the diagnostic capability of the Tregs classifier in two PSC datasets. Principal component analysis (PCA) utilizing hub Tregs-related gene expression levels was executed on two PSC datasets. Ultimately, the clinical applicability of the Tregs classifier was determined via decision curve analysis (DCA).

PSC Animals Model

Hunan Slack Jingda Laboratory Animal Co. Ltd. supplied the BALB/c mice that were 6–8 weeks old. BALB/c mice were randomly assigned to one of the three groups: the control group (n=6), the 3,5-diethoxycarbonyl-1,4-dihydrocollidine (DDC)-induced PSC group (n=6). All animal experiments received approval from the Ethical Committee on Animal Experimentation of the Central Hospital of Hengyang City (approval number: 2023-10-40). Animal experiments were carried out under specific pathogen-free conditions. We chose the DDC diet to develop a PSC mice model, considering that it well mimics the human PSC and reproduces the gradual progression of the disease. During the experiments, the mice were anesthetized with 2% isofluorane in 95% oxygen. Upon completion of each study, the mice were anesthetized by with 2% isofluorane in 95% oxygen and died of spinal dislocation. All the methods were in accordance with relevant guidelines and regulations.

Tissue Samples and Quantitative Real-Time Polymerase Chain Reaction (qRT-PCR)

Liver tissue samples were acquired from PSC and control groups and preserved at −80 °C for subsequent in vitro experiments. Total RNA was isolated as per the provided guidelines using TRIzol (Invitrogen). Reverse transcription of RNA was executed utilizing the RevertAid RT Reverse Transcription Kit (Thermo Scientific). Quantitative PCR was conducted utilizing PowerUp™ SYBR™ Green Master Mix (Thermo Scientific), with GAPDH serving as the internal standard. Quantitative reverse transcription-PCR was executed utilizing the ABI 7500 Real-Time PCR System (Applied Biosystems, Foster City, CA, USA). Subsequently, fold change in gene expression was examined using the 2-ΔΔCt method. Gene-specific PCR primers are provided in Supplementary Material 2.

Expression Levels, and Correlation Pattern of Hub Tregs-Related Genes

The hub Tregs-related gene mRNA expression levels in PSC and control groups were compared and verified in two GEO datasets and the PSC mice model. Additionally, hub Tregs-related genes in two GEO datasets were subjected to correlation analysis.

Analyses of Immune Cell Infiltration in PSC and Its Relationship with Hub Tregs-Related Genes

Based on the results of the previous eight different algorithms which calculate the level of immune cell infiltration, we compared the scores of immune, stromal, ESTIMATE, and microenvironment in two PSC cohorts. In addition, Spearman correlation analyses were conducted to examine the association of hub Tregs-related genes with immune cells.

Discovery of Potential Mechanisms in the Development and Progression of PSC and Its Relationship with Hub Tregs-Related Genes

ssGSEA analysis was performed on several representative gene sets (immune/inflammatory/cell death-related pathways) in two PSC cohorts. Then,