胃癌TCGA结合GEO数据建模（机器学习建立疾病诊断模型的方法）

生物信息建模大杂烩...

有一阵子没有回社区写东西了O(∩_∩)O，之前在简书上写东西得到了比较不错的反馈，即使理性如我，有良好的反馈时还是会更加积极一些

另一方面也是想为桑格助手拓展新的读者渠道。

最近简书在净网，不让发布内容，但是帖子该写还是要写，还好咱渠道多。

今天看的这篇文章是很标准化的一篇文章，经典，可以学习和借鉴。

Abstract

Current studies indicate that long non-coding RNAs (lncRNAs) are frequently aberrantly expressed in cancers and implicated with prognosis in gastric cancer (GC).

We intended to generate a multi-lncRNA signature to improve prognostic prediction of GC. By analyzing ten paired GC and adjacent normal mucosa tissues, 339 differentially expressed lncRNAs were identified as the candidate prognostic biomarkers in GC.

Then we used LASSO Cox regression method to build a 12-lncRNA signature and validated it in another independent GEO dataset. An innovative 12-lncRNA signature was established, and it was significantly associated with the disease free survival (DFS) in the training dataset.

By applying the 12-lncRNA signature, the training cohort patients could be categorized into high-risk or low-risk subgroup with significantly different DFS (HR = 4.52, 95%CI= 2.49-8.20, P < 0.0001). Similar results were obtained in another independent GEO dataset (HR=1.58, 95%CI=1.05 - 2.38, P=0.0270). Further analysis showed that the prognostic value of this 12-lncRNA signature was independent of AJCC stage and postoperative chemotherapy.

Receiver operating characteristic (ROC) analysis showed that the area under receiver operating characteristic curve (AUC) of combined model reached 0.869. Additionally, a well-performed nomogram was constructed for clinicians. Moreover, single-sample gene-set enrichment analysis (ssGSEA) showed that a group of pathways related to drug resistance and cancer metastasis significantly enriched in the high risk patients.

A useful innovative 12-lncRNA signature was established for prognostic evaluation of GC.

It might complement clinicopathological features and facilitate personalized management of GC.

摘要其实就是文章主要方法的介绍，主要意义可以看最后两句。

Introduction

GC is a common and highly lethal malignancy, being the fourth most common cancer and the second leading cause of cancer death in the world 1. Although the tendency of incidence rates declines, it is still concerned worldwide with the highest estimated mortality rates in Eastern Asian 2.

Surgery is the only curative treatment strategy and conventional chemotherapy has shown limited efficacy. Despite the recent therapeutic advances, the overall outcome of GC remains undesirable 3, 4. For the risk stratification of GC, the TNM Staging System has been widely used, which is developed and maintained by American Joint Committee on Cancer (AJCC) and adopted by the Union International Committee on Cancer (UICC).

Although TNM staging system is of great value clinically, it has not adequate prognostic and predictive capabilities to guide patient management 5, 6. Thus, new biomarkers are needed to discriminate the high-risk patients with GC and consequently improve personalized cancer care.

这一段背景介绍是挺好的，经常看文章，感觉自己都会写了，满脑子里都是词汇有没有~

其他介绍先不管，先看数据

数据选取：

使用包括10对GC和正常粘膜组织的数据集GSE79973来鉴定差异表达的lncRNA（Affymetrix Human Genome U133 Plus 2.0芯片）。

过滤没有临床存活信息的样品后，总共有491个样品，包括来自GSE62254的 300个样品，来自GSE15459的191个样品。

训练集（GSE62254）通过LASSO Cox回归模型筛选来自差异表达的lncRNA的预后多-nncRNA标记。

GSE15459作为独立的验证集。

Abbreviations

AJCC	American Joint Committee on Cancer
AUC	area under receiver operating characteristic. CI: confidence interval
DCA	decision curve analysis
DFS	disease free survival
GC	gastric cancer
GEO	Gene Expression Ominus
HR	hazard ratio
lincRNA	large intergenic non-coding RNAs
lncRNAs	long non-coding RNAs
ncRNAs	non-coding RNAs
OS	overall survival
ROC	receiver operating characteristic
ssGSEA	single-sample gene-set enrichment analysis
UICC	Union International Committee on Cancer.