胃癌TCGA结合GEO数据建模(机器学习建立疾病诊断模型的方法)

生物信息建模大杂烩...

有一阵子没有回社区写东西了O(∩_∩)O,之前在简书上写东西得到了比较不错的反馈,即使理性如我,有良好的反馈时还是会更加积极一些

另一方面也是想为桑格助手拓展新的读者渠道。

最近简书在净网,不让发布内容,但是帖子该写还是要写,还好咱渠道多。

今天看的这篇文章是很标准化的一篇文章,经典,可以学习和借鉴。

attachments-2019-09-iahO1FrF5d6dd60920a1c.png

Abstract

Current studies indicate that long non-coding RNAs (lncRNAs) are frequently aberrantly expressed in cancers and implicated with prognosis in gastric cancer (GC).

We intended to generate a multi-lncRNA signature to improve prognostic prediction of GC. By analyzing ten paired GC and adjacent normal mucosa tissues, 339 differentially expressed lncRNAs were identified as the candidate prognostic biomarkers in GC.

Then we used LASSO Cox regression method to build a 12-lncRNA signature and validated it in another independent GEO dataset. An innovative 12-lncRNA signature was established, and it was significantly associated with the disease free survival (DFS) in the training dataset.

By applying the 12-lncRNA signature, the training cohort patients could be categorized into high-risk or low-risk subgroup with significantly different DFS (HR = 4.52, 95%CI= 2.49-8.20, P < 0.0001). Similar results were obtained in another independent GEO dataset (HR=1.58, 95%CI=1.05 - 2.38, P=0.0270). Further analysis showed that the prognostic value of this 12-lncRNA signature was independent of AJCC stage and postoperative chemotherapy.

Receiver operating characteristic (ROC) analysis showed that the area under receiver operating characteristic curve (AUC) of combined model reached 0.869. Additionally, a well-performed nomogram was constructed for clinicians. Moreover, single-sample gene-set enrichment analysis (ssGSEA) showed that a group of pathways related to drug resistance and cancer metastasis significantly enriched in the high risk patients.

A useful innovative 12-lncRNA signature was established for prognostic evaluation of GC.

It might complement clinicopathological features and facilitate personalized management of GC.


摘要其实就是文章主要方法的介绍,主要意义可以看最后两句。



Introduction

GC is a common and highly lethal malignancy, being the fourth most common cancer and the second leading cause of cancer death in the world 1. Although the tendency of incidence rates declines, it is still concerned worldwide with the highest estimated mortality rates in Eastern Asian 2.

Surgery is the only curative treatment strategy and conventional chemotherapy has shown limited efficacy. Despite the recent therapeutic advances, the overall outcome of GC remains undesirable 34. For the risk stratification of GC, the TNM Staging System has been widely used, which is developed and maintained by American Joint Committee on Cancer (AJCC) and adopted by the Union International Committee on Cancer (UICC).

Although TNM staging system is of great value clinically, it has not adequate prognostic and predictive capabilities to guide patient management 56. Thus, new biomarkers are needed to discriminate the high-risk patients with GC and consequently improve personalized cancer care.


这一段背景介绍是挺好的,经常看文章,感觉自己都会写了,满脑子里都是词汇有没有~

其他介绍先不管,先看数据


数据选取:

使用包括10对GC和正常粘膜组织的数据集GSE79973来鉴定差异表达的lncRNA(Affymetrix Human Genome U133 Plus 2.0芯片)。

过滤没有临床存活信息的样品后,总共有491个样品,包括来自GSE62254的 300个样品,来自GSE15459的191个样品

训练集(GSE62254)通过LASSO Cox回归模型筛选来自差异表达的lncRNA的预后多-nncRNA标记。

GSE15459作为独立的验证集。


Abbreviations
AJCCAmerican Joint Committee on Cancer
AUCarea under receiver operating characteristic. CI: confidence interval
DCAdecision curve analysis
DFSdisease free survival
GCgastric cancer
GEOGene Expression Ominus
HRhazard ratio
lincRNAlarge intergenic non-coding RNAs
lncRNAslong non-coding RNAs
ncRNAsnon-coding RNAs
OSoverall survival
ROCreceiver operating characteristic
ssGSEAsingle-sample gene-set enrichment analysis
UICCUnion International Committee on Cancer.

我以前不太关注缩略词,但是最近投稿时经常会遇到这个问题,就是杂志觉得你的缩略词整理的不全,大家平时也要多注意整理。


文章主要工作就是构建了一个 12-lncRNA signature,但是构建过程分散在了Methods和Results里,这里帮大家总结一下,具体的内容回到文章里看,文章篇幅不长。

1.第一步肯定是数据获取和数据整理了

2.芯片重注释得到lncRNA表达谱

3.数据标准化,去除批次效应,补全缺失值

4.差异lncRNA初筛

5.用一些技术手段将样本分成训练集和测试集

6.使用 SVM-RFE 对训练集进行特征选择,筛选特征 lncRNA

7.预测 lncRNA 的靶基因

8.靶基因功能富集分析

9.使用支持向量机构建健康组和疾病组的诊断分类模型

10.绘制ROC曲线验证模型分类情况

11.分类样本的功能富集分析(GSEA

12.验证集数据验证模型可靠性



最后,建模的内容我们整理了很多,详情可见:

机器学习及建模分类方法总结

9分的Deep Learning肝癌建模文献介绍

实战(五)模仿一篇曾经10+胃癌亚型预后相关的文章

实战系列(四)来自Aging(5分)的免疫微环境研究 

风险分类模型SVM文章解析

一篇5分甲状腺癌分型文章的解读

实战系列(二)CCR数据挖掘建模文章操作

COX 比例风险回归模型

一篇经典的WGCNA套路分析

基于20个基因的预后模型预测肺腺癌的生存

实战系列(三)模仿4分胃癌发病机制和预后关键基因的文章 (寻找预后key gene,是最普遍的预后建模方法)

实战系列(一)手把手复现3分lncRNA经典小文章 (这里面用到了一个分类的模型)



最后的最后

想说的是,没错这些我们都可以做(●'◡'●)如果你不会做,或者做不好,可以找我们,就这样。

attachments-2019-09-34EaWjpU5d6dde5b122a3.png

  • 发表于 2019-09-03 11:27
  • 阅读 ( 680 )
  • 分类:转录组学

0 条评论

请先 登录 后评论
不写代码的码农
生信分析流

FBI

47 篇文章

作家榜 »

  1. 合肥国肽生物 113 文章
  2. 祝让飞 104 文章
  3. 刘永鑫 64 文章
  4. 生信分析流 47 文章
  5. SXR 44 文章
  6. 调研图 38 文章
  7. 张海伦 31 文章
  8. 爽儿 25 文章