看教程不够直观,那就看视频吧! >>点击加载视频
版本更新了,有写东西变了,想要使用请查看文档 https://shixiangwang.github.io/UCSCXenaTools/
------------------------------------------------------------------------
XenaR包提供了一个简单的UCSC Xena接口,可以获取一些UCSC Xena存储的信息,包括GDC、TCGA、ICGC、GTEx、CCLE等数据库的上千个数据集。特别是TCGA(hg19版本)的一部分数据UCSC做了非常好的标准化处理,下载即可用。这几天我想要能够通过代码下载相关数据,而不是每次通过网页上的点点点。考虑到XenaR包的原作者有3年没有更新了,我在它的基础上修正了目前UCSC Xena提供的Hug API,可以完成原包的功能(见https://github.com/DataGeeker/XenaR)。并且,基于这个包,目前正在构建包UCSCXenaTools。
点击查看目前Xena提供的数据集。
目前可以利用它搜索数据集以及下载和导入R了。下面简单讲解下它的用法,目前没时间写文档,所以使用该包看这篇文章很重要。
使用安装
从Github上安装,运行下面代码
if(!require(devtools)){ install.packages("devtools", dependencies = TRUE) } devtools::install_github("ShixiangWang/UCSCXenaTools")
导入
library(UCSCXenaTools)
探索
使用XenaHub()可以获取所有的资源,另外可以通过参数指定感兴趣的,包括hosts,cohorts以及datasets。
xe <- XenaHub() xe ## class: XenaHub ## hosts(): ## https://ucscpublic.xenahubs.net ## https://tcga.xenahubs.net ## https://gdc.xenahubs.net ## https://icgc.xenahubs.net ## https://toil.xenahubs.net ## cohorts() (137 total): ## (unassigned) ## 1000_genomes ## Acute lymphoblastic leukemia (Mullighan 2008) ## ... ## TCGA Pan-Cancer (PANCAN) ## TCGA TARGET GTEx ## datasets() (1521 total): ## parsons2008cgh_public/parsons2008cgh_genomicMatrix ## parsons2008cgh_public/parsons2008cgh_public_clinicalMatrix ## vijver2002_public/vijver2002_genomicMatrix ## ... ## TCGA_survival_data ## mc3.v0.2.8.PUBLIC.toil.xena head(cohorts(xe)) ## [1] "(unassigned)" ## [2] "1000_genomes" ## [3] "Acute lymphoblastic leukemia (Mullighan 2008)" ## [4] "B cells (Basso 2005)" ## [5] "Breast Cancer (Caldas 2007)" ## [6] "Breast Cancer (Chin 2006)"
结果返回一个XenaHub对象。
为了简化hosts()的输入,我们可以使用hostName指定我们想要搜索TCGA的内容,如下:
XenaHub(hostName = "TCGA") ## class: XenaHub ## hosts(): ## https://tcga.xenahubs.net ## cohorts() (39 total): ## (unassigned) ## TCGA Acute Myeloid Leukemia (LAML) ## TCGA Adrenocortical Cancer (ACC) ## ... ## TCGA Thyroid Cancer (THCA) ## TCGA Uterine Carcinosarcoma (UCS) ## datasets() (879 total): ## TCGA.OV.sampleMap/HumanMethylation27 ## TCGA.OV.sampleMap/HumanMethylation450 ## TCGA.OV.sampleMap/Gistic2_CopyNumber_Gistic2_all_data_by_genes ## ... ## TCGA.MESO.sampleMap/MESO_clinicalMatrix ## TCGA.MESO.sampleMap/Pathway_Paradigm_RNASeq_And_Copy_Number
有hosts()、cohorts()、datasets()以及samples()函数可以获取对应的内容,输入参数为XenaHub对象。
hosts(xe) ## [1] "https://ucscpublic.xenahubs.net" "https://tcga.xenahubs.net" ## [3] "https://gdc.xenahubs.net" "https://icgc.xenahubs.net" ## [5] "https://toil.xenahubs.net" cohorts(xe) ## [1] "(unassigned)" ## [2] "1000_genomes" ## [3] "Acute lymphoblastic leukemia (Mullighan 2008)" ## [4] "B cells (Basso 2005)" ## [5] "Breast Cancer (Caldas 2007)" ## [6] "Breast Cancer (Chin 2006)" ## [7] "Breast Cancer (Haverty 2008)" ## [8] "Breast Cancer (Hess 2006)" ## [9] "Breast Cancer (Miller 2005)" ## [10] "Breast Cancer (vantVeer 2002)" ## [11] "Breast Cancer (Vijver 2002)" ## [12] "Breast Cancer (Yau 2010)" ## [13] "Breast Cancer Cell Lines (Heiser 2012)" ## [14] "Breast Cancer Cell Lines (Neve 2006)" ## [15] "Cancer Cell Line Encyclopedia (Breast)" ## [16] "Cancer Cell Line Encyclopedia (CCLE)" ## [17] "Connectivity Map" ## [18] "DIPG and Pediatric Non-Brainstem High-Grade Glioma (Wu 2014, St Jude)" ## [19] "Ewing Sarcoma Family of Tumors (Brohl 2014)" ## [20] "GBM (Parsons 2008)" ## [21] "Glioma (Kotliarov 2006)" ## [22] "Inbred mouse (Cutler 2007)" ## [23] "Lung Adenocarcinoma (Ding 2008)" ## [24] "Lung Cancer (Raponi 2006)" ## [25] "Lung Cancer CGH (Weir 2007)" ## [26] "lymph-node-negative breast cancer (Wang 2005)" ## [27] "MAGIC" ## [28] "Melanoma (Lin 2008)" ## [29] "Mouse and Human Colon Tumors (Kaiser 2007)" ## [30] "Mouse pancreatic adenocarcinoma (Bardeesy 2006)" ## [31] "Mouse Tumors (Maser 2007)" ## [32] "NCI60" ## [33] "Neuroblastoma (Khan)" ## [34] "Neuroblastoma (Sausen 2013)" ## [35] "Node-negative breast cancer (Desmedt 2007)" ## [36] "Ovarian Cancer (Etemadmoghadam 2009)" ## [37] "Pancreatic Cancer (Balagurunathan 2008)" ## [38] "Pancreatic Cancer (Harada 2008)" ## [39] "Pancreatic Cancer (Jones 2008)" ## [40] "Pediatric diffuse intrinsic pontine gliomas (Puget 2012)" ## [41] "Pediatric tumor (Khan)" ## [42] "POG TCGA TARGET_NBL" ## [43] "Single-cell RNA-seq mouse cortex (Zeisel)" ## [44] "St Jude PCGP pan-cancer" ## [45] "TARGET Acute Lymphoblastic Leukemia" ## [46] "TARGET neuroblastoma" ## [47] "(unassigned)" ## [48] "TCGA Acute Myeloid Leukemia (LAML)" ## [49] "TCGA Adrenocortical Cancer (ACC)" ## [50] "TCGA Bile Duct Cancer (CHOL)" ## [51] "TCGA Bladder Cancer (BLCA)" ## [52] "TCGA Breast Cancer (BRCA)" ## [53] "TCGA Cervical Cancer (CESC)" ## [54] "TCGA Colon and Rectal Cancer (COADREAD)" ## [55] "TCGA Colon Cancer (COAD)" ## [56] "TCGA Endometrioid Cancer (UCEC)" ## [57] "TCGA Esophageal Cancer (ESCA)" ## [58] "TCGA Formalin Fixed Paraffin-Embedded Pilot Phase II (FPPP)" ## [59] "TCGA Glioblastoma (GBM)" ## [60] "TCGA Head and Neck Cancer (HNSC)" ## [61] "TCGA Kidney Chromophobe (KICH)" ## [62] "TCGA Kidney Clear Cell Carcinoma (KIRC)" ## [63] "TCGA Kidney Papillary Cell Carcinoma (KIRP)" ## [64] "TCGA Large B-cell Lymphoma (DLBC)" ## [65] "TCGA Liver Cancer (LIHC)" ## [66] "TCGA Lower Grade Glioma (LGG)" ## [67] "TCGA lower grade glioma and glioblastoma (GBMLGG)" ## [68] "TCGA Lung Adenocarcinoma (LUAD)" ## [69] "TCGA Lung Cancer (LUNG)" ## [70] "TCGA Lung Squamous Cell Carcinoma (LUSC)" ## [71] "TCGA Melanoma (SKCM)" ## [72] "TCGA Mesothelioma (MESO)" ## [73] "TCGA Ocular melanomas (UVM)" ## [74] "TCGA Ovarian Cancer (OV)" ## [75] "TCGA Pan-Cancer (PANCAN)" ## [76] "TCGA Pancreatic Cancer (PAAD)" ## [77] "TCGA Pheochromocytoma & Paraganglioma (PCPG)" ## [78] "TCGA Prostate Cancer (PRAD)" ## [79] "TCGA Rectal Cancer (READ)" ## [80] "TCGA Sarcoma (SARC)" ## [81] "TCGA Stomach Cancer (STAD)" ## [82] "TCGA Testicular Cancer (TGCT)" ## [83] "TCGA Thymoma (THYM)" ## [84] "TCGA Thyroid Cancer (THCA)" ## [85] "TCGA Uterine Carcinosarcoma (UCS)" ## [86] "(unassigned)" ## [87] "GDC Pan-Cancer (PANCAN)" ## [88] "GDC TARGET-AML" ## [89] "GDC TARGET-CCSK" ## [90] "GDC TARGET-NBL" ## [91] "GDC TARGET-OS" ## [92] "GDC TARGET-RT" ## [93] "GDC TARGET-WT" ## [94] "GDC TCGA Acute Myeloid Leukemia (LAML)" ## [95] "GDC TCGA Adrenocortical Cancer (ACC)" ## [96] "GDC TCGA Bile Duct Cancer (CHOL)" ## [97] "GDC TCGA Bladder Cancer (BLCA)" ## [98] "GDC TCGA Breast Cancer (BRCA)" ## [99] "GDC TCGA Cervical Cancer (CESC)" ## [100] "GDC TCGA Colon Cancer (COAD)" ## [101] "GDC TCGA Endometrioid Cancer (UCEC)" ## [102] "GDC TCGA Esophageal Cancer (ESCA)" ## [103] "GDC TCGA Glioblastoma (GBM)" ## [104] "GDC TCGA Head and Neck Cancer (HNSC)" ## [105] "GDC TCGA Kidney Chromophobe (KICH)" ## [106] "GDC TCGA Kidney Clear Cell Carcinoma (KIRC)" ## [107] "GDC TCGA Kidney Papillary Cell Carcinoma (KIRP)" ## [108] "GDC TCGA Large B-cell Lymphoma (DLBC)" ## [109] "GDC TCGA Liver Cancer (LIHC)" ## [110] "GDC TCGA Lower Grade Glioma (LGG)" ## [111] "GDC TCGA Lung Adenocarcinoma (LUAD)" ## [112] "GDC TCGA Lung Squamous Cell Carcinoma (LUSC)" ## [113] "GDC TCGA Melanoma (SKCM)" ## [114] "GDC TCGA Mesothelioma (MESO)" ## [115] "GDC TCGA Ocular melanomas (UVM)" ## [116] "GDC TCGA Ovarian Cancer (OV)" ## [117] "GDC TCGA Pancreatic Cancer (PAAD)" ## [118] "GDC TCGA Pheochromocytoma & Paraganglioma (PCPG)" ## [119] "GDC TCGA Prostate Cancer (PRAD)" ## [120] "GDC TCGA Rectal Cancer (READ)" ## [121] "GDC TCGA Sarcoma (SARC)" ## [122] "GDC TCGA Stomach Cancer (STAD)" ## [123] "GDC TCGA Testicular Cancer (TGCT)" ## [124] "GDC TCGA Thymoma (THYM)" ## [125] "GDC TCGA Thyroid Cancer (THCA)" ## [126] "GDC TCGA Uterine Carcinosarcoma (UCS)" ## [127] "(unassigned)" ## [128] "ICGC (donor centric)" ## [129] "ICGC (specimen centric)" ## [130] "ICGC (US donors with both RNA and SNV data)" ## [131] "PACA-AU" ## [132] "(unassigned)" ## [133] "GTEX" ## [134] "TARGET Pan-Cancer (PANCAN)" ## [135] "TCGA and TARGET Pan-Cancer (PANCAN)" ## [136] "TCGA Pan-Cancer (PANCAN)" ## [137] "TCGA TARGET GTEx" datasets(xe)[1:10] ## [1] "parsons2008cgh_public/parsons2008cgh_genomicMatrix" ## [2] "parsons2008cgh_public/parsons2008cgh_public_clinicalMatrix" ## [3] "vijver2002_public/vijver2002_genomicMatrix" ## [4] "vijver2002_public/vijver2002_public_clinicalMatrix" ## [5] "chin2006_public/chin2006Exp_genomicMatrix" ## [6] "chin2006_public/ucsfChinCGH2006_genomicMatrix" ## [7] "chin2006_public/chin2006_public_clinicalMatrix" ## [8] "Treehouse/Treehouse_Khan_neuroblastoma/expression" ## [9] "Treehouse/Treehouse_Khan_neuroblastoma/neuroblastoma_affy_clinicalMatrix" ## [10] "Treehouse/NBL_Sausen_et_al_2013_SNV.tsv" # samples(xe)[1:10] # 关于samples的用法请查看 <https://github.com/DataGeeker/XenaR/blob/master/inst/README.Rmd> # 这里输出内容太多,也不是该包的主题
下载与导入数据
为了能够自定义下载所需要的数据,该包提供了XenaQuery、XenaDownload与XenaPrepare3连击。
下面以下载和导入TCGA临床数据为例进行说明,其他数据类似。
filter
查看感兴趣的数据集
xe = XenaHub(hostName = "TCGA") xe ## class: XenaHub ## hosts(): ## https://tcga.xenahubs.net ## cohorts() (39 total): ## (unassigned) ## TCGA Acute Myeloid Leukemia (LAML) ## TCGA Adrenocortical Cancer (ACC) ## ... ## TCGA Thyroid Cancer (THCA) ## TCGA Uterine Carcinosarcoma (UCS) ## datasets() (879 total): ## TCGA.OV.sampleMap/HumanMethylation27 ## TCGA.OV.sampleMap/HumanMethylation450 ## TCGA.OV.sampleMap/Gistic2_CopyNumber_Gistic2_all_data_by_genes ## ... ## TCGA.MESO.sampleMap/MESO_clinicalMatrix ## TCGA.MESO.sampleMap/Pathway_Paradigm_RNASeq_And_Copy_Number
可以看到有800+个数据集,太多了。下面使用filterXena()函数进行过滤。用户可以使用全名或者正则表达式。
(filterXena(xe, filterDatasets = "clinical") -> xe2) ## class: XenaHub ## hosts(): ## https://tcga.xenahubs.net ## cohorts() (39 total): ## (unassigned) ## TCGA Acute Myeloid Leukemia (LAML) ## TCGA Adrenocortical Cancer (ACC) ## ... ## TCGA Thyroid Cancer (THCA) ## TCGA Uterine Carcinosarcoma (UCS) ## datasets() (37 total): ## TCGA.OV.sampleMap/OV_clinicalMatrix ## TCGA.DLBC.sampleMap/DLBC_clinicalMatrix ## TCGA.KIRC.sampleMap/KIRC_clinicalMatrix ## ... ## TCGA.READ.sampleMap/READ_clinicalMatrix ## TCGA.MESO.sampleMap/MESO_clinicalMatrix
不是很多了吧?注意该函数的两个参数filterCohorts与filterDatasets是相互独立的,因为核心的XenaR并没有其中一者变化,另外也跟着变化的功能。后续我会想其他办法解决。不过呢,这里因为我们主要聚焦数据集的下载和使用,cohorts可以不管。
datasets(xe2) ## [1] "TCGA.OV.sampleMap/OV_clinicalMatrix" ## [2] "TCGA.DLBC.sampleMap/DLBC_clinicalMatrix" ## [3] "TCGA.KIRC.sampleMap/KIRC_clinicalMatrix" ## [4] "TCGA.SARC.sampleMap/SARC_clinicalMatrix" ## [5] "TCGA.COAD.sampleMap/COAD_clinicalMatrix" ## [6] "TCGA.PRAD.sampleMap/PRAD_clinicalMatrix" ## [7] "TCGA.LUSC.sampleMap/LUSC_clinicalMatrix" ## [8] "TCGA.ACC.sampleMap/ACC_clinicalMatrix" ## [9] "TCGA.KICH.sampleMap/KICH_clinicalMatrix" ## [10] "TCGA.UCS.sampleMap/UCS_clinicalMatrix" ## [11] "TCGA.COADREAD.sampleMap/COADREAD_clinicalMatrix" ## [12] "TCGA.LUNG.sampleMap/LUNG_clinicalMatrix" ## [13] "TCGA.LUAD.sampleMap/LUAD_clinicalMatrix" ## [14] "TCGA.FPPP.sampleMap/FPPP_clinicalMatrix" ## [15] "TCGA.LAML.sampleMap/LAML_clinicalMatrix" ## [16] "TCGA.GBM.sampleMap/GBM_clinicalMatrix" ## [17] "TCGA.KIRP.sampleMap/KIRP_clinicalMatrix" ## [18] "TCGA.PAAD.sampleMap/PAAD_clinicalMatrix" ## [19] "TCGA.CHOL.sampleMap/CHOL_clinicalMatrix" ## [20] "TCGA.CESC.sampleMap/CESC_clinicalMatrix" ## [21] "TCGA.SKCM.sampleMap/SKCM_clinicalMatrix" ## [22] "TCGA.LGG.sampleMap/LGG_clinicalMatrix" ## [23] "TCGA.PCPG.sampleMap/PCPG_clinicalMatrix" ## [24] "TCGA.TGCT.sampleMap/TGCT_clinicalMatrix" ## [25] "TCGA.BLCA.sampleMap/BLCA_clinicalMatrix" ## [26] "TCGA.THYM.sampleMap/THYM_clinicalMatrix" ## [27] "TCGA.BRCA.sampleMap/BRCA_clinicalMatrix" ## [28] "TCGA.UVM.sampleMap/UVM_clinicalMatrix" ## [29] "TCGA.UCEC.sampleMap/UCEC_clinicalMatrix" ## [30] "TCGA.LIHC.sampleMap/LIHC_clinicalMatrix" ## [31] "TCGA.GBMLGG.sampleMap/GBMLGG_clinicalMatrix" ## [32] "TCGA.THCA.sampleMap/THCA_clinicalMatrix" ## [33] "TCGA.HNSC.sampleMap/HNSC_clinicalMatrix" ## [34] "TCGA.ESCA.sampleMap/ESCA_clinicalMatrix" ## [35] "TCGA.STAD.sampleMap/STAD_clinicalMatrix" ## [36] "TCGA.READ.sampleMap/READ_clinicalMatrix" ## [37] "TCGA.MESO.sampleMap/MESO_clinicalMatrix"
我只想选择肺癌相关,所以再加一些条件:
(filterXena(xe2, filterDatasets = "LUAD|LUSC|LUNG")) -> xe2
如果你很清楚你想要做的,可以使用dplyr的管道操作符进行连续过滤,不然建议一步一步挑选。
suppressMessages(require(dplyr)) ## Warning: 程辑包'dplyr'是用R版本3.5.1 来建造的 xe %>% filterXena(filterDatasets = "clinical") %>% filterXena(filterDatasets = "luad|lusc|lung") ## class: XenaHub ## hosts(): ## https://tcga.xenahubs.net ## cohorts() (39 total): ## (unassigned) ## TCGA Acute Myeloid Leukemia (LAML) ## TCGA Adrenocortical Cancer (ACC) ## ... ## TCGA Thyroid Cancer (THCA) ## TCGA Uterine Carcinosarcoma (UCS) ## datasets() (3 total): ## TCGA.LUSC.sampleMap/LUSC_clinicalMatrix ## TCGA.LUNG.sampleMap/LUNG_clinicalMatrix ## TCGA.LUAD.sampleMap/LUAD_clinicalMatrix
过滤后返回的还是XenaHub对象。
query
接下来我们准备下载这3个选择好的数据集。
先构建一个query对象(当前还没有用类封装),就是一个数据框。存储了主机地址,下载的url等。
xe2_query = XenaQuery(xe2) xe2_query ## hosts datasets ## 1 https://tcga.xenahubs.net TCGA.LUSC.sampleMap/LUSC_clinicalMatrix ## 2 https://tcga.xenahubs.net TCGA.LUNG.sampleMap/LUNG_clinicalMatrix ## 3 https://tcga.xenahubs.net TCGA.LUAD.sampleMap/LUAD_clinicalMatrix ## url ## 1 https://tcga.xenahubs.net/download/TCGA.LUSC.sampleMap/LUSC_clinicalMatrix.gz ## 2 https://tcga.xenahubs.net/download/TCGA.LUNG.sampleMap/LUNG_clinicalMatrix.gz ## 3 https://tcga.xenahubs.net/download/TCGA.LUAD.sampleMap/LUAD_clinicalMatrix.gz
download
默认XenaDownload函数将下载数据到当前目录的Xena_Data目录下,如果数据已经下载,将提示并不会下载,可以使用force=TRUE强制下载,另外支持一些到download.file函数的参数。
注意该函数有返回项,可以用于后续数据的导入。
xe2_download = XenaDownload(xe2_query, destdir = "E:/Github/XenaData/test/") ## We will download files to directory E:/Github/XenaData/test/. ## E:/Github/XenaData/test//TCGA.LUSC.sampleMap__LUSC_clinicalMatrix.gz, the file has been download! ## E:/Github/XenaData/test//TCGA.LUNG.sampleMap__LUNG_clinicalMatrix.gz, the file has been download! ## E:/Github/XenaData/test//TCGA.LUAD.sampleMap__LUAD_clinicalMatrix.gz, the file has been download! ## Note fileNames transfromed from datasets name and / chracter all changed to __ character.
prepare
数据下载之后就可以将数据导入R,背后用的是readr包的read_tsv函数。
支持4种导入方式,大于1个文件就会生成一个列表:
方式1:
# way1: directory cli1 = XenaPrepare("E:/Github/XenaData/test/") names(cli1) ## [1] "TCGA.LUAD.sampleMap__LUAD_clinicalMatrix.gz" ## [2] "TCGA.LUNG.sampleMap__LUNG_clinicalMatrix.gz" ## [3] "TCGA.LUSC.sampleMap__LUSC_clinicalMatrix.gz"
方式2:
# way2: local files cli2 = XenaPrepare("E:/Github/XenaData/test/TCGA.LUAD.sampleMap__LUAD_clinicalMatrix.gz") class(cli2) ## [1] "tbl_df" "tbl" "data.frame" cli2 = XenaPrepare(c("E:/Github/XenaData/test/TCGA.LUAD.sampleMap__LUAD_clinicalMatrix.gz", "E:/Github/XenaData/test/TCGA.LUNG.sampleMap__LUNG_clinicalMatrix.gz")) class(cli2) ## [1] "list" names(cli2) ## [1] "TCGA.LUAD.sampleMap__LUAD_clinicalMatrix.gz" ## [2] "TCGA.LUNG.sampleMap__LUNG_clinicalMatrix.gz"
方式3:
# way3: urls cli3 = XenaPrepare(xe2_download$url[1:2]) names(cli3) ## [1] "LUSC_clinicalMatrix.gz" "LUNG_clinicalMatrix.gz"
方式4:
# way4: xenadownload object cli4 = XenaPrepare(xe2_download) names(cli4) ## [1] "TCGA.LUSC.sampleMap__LUSC_clinicalMatrix.gz" ## [2] "TCGA.LUNG.sampleMap__LUNG_clinicalMatrix.gz" ## [3] "TCGA.LUAD.sampleMap__LUAD_clinicalMatrix.gz"
如果觉得我的文章对您有用,请随意打赏。你的支持将鼓励我继续创作!