看教程不够直观,那就看视频吧! >>点击加载视频
Bioconductor的GEOquery几个常用函数可以实现GEO数据的下载,但有时候我想直接通过终端下载而不是使用Rstudio然后运行脚本的方式,所以下面用shell脚本对GEOquery两个下载函数getGEO()以及getGEOSuppFiles()进行了简单的封装。
安装使用clone命令
git clone https://github.com/ShixiangWang/mytoolkit/
点击页面右上方的克隆或下载按钮预置与帮助
Linux系统安装R,如果你没有安装GEOquery包,脚本会自动判断并进行下载安装。
查看脚本帮助:
./getGEOSuppFiles.sh -h ./getGEO.sh -h ./bulkGEO.sh -h
下载GEO附加文件
GEO附加文件一般是原始的芯片数据。
用法:
Usage: ./getGEOSuppFiles.sh -n GEO -d directory
GEO: GEO accession 号,比如 GPL1073 or GSM1137
directory: 下载到的目录,默认为你的当前目录。
下载GEO表达矩阵文件
这个是最常用的功能,下载芯片的表达矩阵文件,数据已经经过研究者的预处理,可以直接使用。
用法:
Usage: ./getGEO.sh -n GEO -d destdir -M GSEMatrix -A AnnotGPL -P getGPL Detail of Options ================== -n GEO: 代表GEO对象的字符 (比如 'GDS505','GSE2','GSM2','GPL96') -d destdir: 要下载到的目的目录,默认为当前目录。 -M 逻辑值TRUE或FALSE,告诉脚本是否下载GSE Series Matrix文件,默认为TRUE。 -A 逻辑值TRUE或FALSE,告诉脚本是否使用注释GPL信息文件(会下载),这些文件包含了最新映射的Gene ID和其他基本信息,但不是都有。默认为TRUE。 -P 逻辑值TRUE或FALSE,告诉脚本是否在下载GSEMatrix文件时下载GPL信息,如果你知道你要用bioconductor工具的注释包,你可以选择FALSE,默认为TRUE。 Minimal Use Method ================== If you do not know how to use these options, just set -n option is OK Like ./getGEO.sh -n GEO change the 'GEO' above to name of GSE you want to download
大量下载表达矩阵文件和原始文件
这个功能利用了前两个脚本,对它们进行循环调用。
用法:
Usage: ./bulkGEO.sh -n GEO -d destdir -M GSEMatrix -A AnnotGPL -f filename -s supp Detail of Options ================== -n GEO: 代表GEO对象的字符 (比如 'GDS505','GSE2','GSM2','GPL96') -d destdir: 要下载到的目的目录,默认为当前目录。 -M 逻辑值TRUE或FALSE,告诉脚本是否下载GSE Series Matrix文件,默认为TRUE。 -A 逻辑值TRUE或FALSE,告诉脚本是否使用注释GPL信息文件(会下载),这些文件包含了最新映射的Gene ID和其他基本信息,但不是都有。默认为TRUE。 -f filename: 你可以把要下载的GEO对象名放入一个文件,然后指定它。注意,如果使用它,请不要设定-n选项,不然会被覆盖掉。 -s supp: 逻辑值TRUE或FALSE,设定是否要下载原始附加文件。 Minimal Use Method ================== If you do not know how to use these options, just set -n option is OK Like ./bulkGEO.sh -n 'GEO1 GEO2 GEO3' change the 'GEO' above to name of GSE you want to download
昨天为了避免自我感觉的下载麻烦所以写了这些代码,因为对linux的脚本还不是很精通,脚本可能会存在问题。基本的下载不会出错,我已经调试过。如果有问题或其他功能,欢迎提问,我会尝试解决。
谢谢阅读~
------------------------------------------------------------------------------------------------------------
今天刚好在一个新机器上下载GEO数据,只装了一些基本的R包,可以看看效果。
[wangshx@HPC-login mytoolkit]$ ./bulkGEO.sh -h Usage: ./bulkGEO.sh -n GEO -d destdir -M GSEMatrix -A AnnotGPL -f filename -s supp Detail of Options ================== -n GEO:A character string representing GEO objects for download and parsing. (eg., 'GDS505','GSE2','GSM2','GPL96'), you can use space to seperate multiple objects. Or you can use the -f option to locate the file where you put names of GEO object. -d destdir:The destination directory for any downloads. Defaults to the current directory. You may want to specify a different directory if you want to save the file for later use. Doing so is a good idea if you have a slow connection, as some of the GEO files are HUGE! -M A boolean telling GEOquery whether or not to use GSE Series Matrix files from GEO. The parsing of these files can be many orders-of-magnitude faster than parsing the GSE SOFT format files. Defaults to TRUE, meaning that the SOFT format parsing will not occur; set to FALSE if you for some reason need other columns from the GSE records. -A A boolean defaulting to TRUE as to whether or not to use the Annotation GPL information. These files are nice to use because they contain up-to-date information remapped from Entrez Gene on a regular basis. However, they do not exist for all GPLs; in general, they are only available for GPLs referenced by a GDS -f filename: a character string specify the filename where GEO names stored. -s supp: A boolean defaulting to FALSE as to whether or not to download supplementary files. Minimal Use Method ================== If you do not know how to use these options, just set -n option is OK Like ./bulkGEO.sh -n 'GEO1 GEO2 GEO3' change the 'GEO*' above to name of GSE you want to download [wangshx@HPC-login mytoolkit]$ ./bulkGEO.sh -d ~/workspace/GEO_data/igcc_cnv/ -f ~/workspace/GEO_data/igcc_cnv/geo_names.txt Package GEOquery not available. Atempting to install it. Bioconductor version 3.6 (BiocInstaller 1.28.0), ?biocLite for help BioC_mirror: https://bioconductor.org Using Bioconductor 3.6 (BiocInstaller 1.28.0), R 3.4.2 (2017-09-28). Installing package(s) ‘GEOquery’ also installing the dependencies ‘BiocGenerics’, ‘Biobase’ trying URL 'https://bioconductor.org/packages/3.6/bioc/src/contrib/BiocGenerics_0.24.0.tar.gz' Content type 'application/x-gzip' length 43393 bytes (42 KB) ================================================== downloaded 42 KB trying URL 'https://bioconductor.org/packages/3.6/bioc/src/contrib/Biobase_2.38.0.tar.gz' Content type 'application/x-gzip' length 1656734 bytes (1.6 MB) ================================================== downloaded 1.6 MB trying URL 'https://bioconductor.org/packages/3.6/bioc/src/contrib/GEOquery_2.46.13.tar.gz' Content type 'application/x-gzip' length 13745245 bytes (13.1 MB) ================================================== downloaded 13.1 MB * installing *source* package ‘BiocGenerics’ ... ** R ** inst ** preparing package for lazy loading Creating a new generic function for ‘append’ in package ‘BiocGenerics’ Creating a new generic function for ‘as.data.frame’ in package ‘BiocGenerics’ Creating a new generic function for ‘cbind’ in package ‘BiocGenerics’ Creating a new generic function for ‘rbind’ in package ‘BiocGenerics’ Creating a new generic function for ‘do.call’ in package ‘BiocGenerics’ Creating a new generic function for ‘duplicated’ in package ‘BiocGenerics’ Creating a new generic function for ‘anyDuplicated’ in package ‘BiocGenerics’ Creating a new generic function for ‘eval’ in package ‘BiocGenerics’ Creating a new generic function for ‘pmax’ in package ‘BiocGenerics’ Creating a new generic function for ‘pmin’ in package ‘BiocGenerics’ Creating a new generic function for ‘pmax.int’ in package ‘BiocGenerics’ Creating a new generic function for ‘pmin.int’ in package ‘BiocGenerics’ Creating a new generic function for ‘Reduce’ in package ‘BiocGenerics’ Creating a new generic function for ‘Filter’ in package ‘BiocGenerics’ Creating a new generic function for ‘Find’ in package ‘BiocGenerics’ Creating a new generic function for ‘Map’ in package ‘BiocGenerics’ Creating a new generic function for ‘Position’ in package ‘BiocGenerics’ Creating a new generic function for ‘get’ in package ‘BiocGenerics’ Creating a new generic function for ‘mget’ in package ‘BiocGenerics’ Creating a new generic function for ‘grep’ in package ‘BiocGenerics’ Creating a new generic function for ‘grepl’ in package ‘BiocGenerics’ Creating a new generic function for ‘is.unsorted’ in package ‘BiocGenerics’ Creating a new generic function for ‘lapply’ in package ‘BiocGenerics’ Creating a new generic function for ‘sapply’ in package ‘BiocGenerics’ Creating a new generic function for ‘lengths’ in package ‘BiocGenerics’ Creating a new generic function for ‘mapply’ in package ‘BiocGenerics’ Creating a new generic function for ‘match’ in package ‘BiocGenerics’ Creating a new generic function for ‘rowSums’ in package ‘BiocGenerics’ Creating a new generic function for ‘colSums’ in package ‘BiocGenerics’ Creating a new generic function for ‘rowMeans’ in package ‘BiocGenerics’ Creating a new generic function for ‘colMeans’ in package ‘BiocGenerics’ Creating a new generic function for ‘order’ in package ‘BiocGenerics’ Creating a new generic function for ‘paste’ in package ‘BiocGenerics’ Creating a new generic function for ‘rank’ in package ‘BiocGenerics’ Creating a new generic function for ‘rownames’ in package ‘BiocGenerics’ Creating a new generic function for ‘colnames’ in package ‘BiocGenerics’ Creating a new generic function for ‘union’ in package ‘BiocGenerics’ Creating a new generic function for ‘intersect’ in package ‘BiocGenerics’ Creating a new generic function for ‘setdiff’ in package ‘BiocGenerics’ Creating a new generic function for ‘sort’ in package ‘BiocGenerics’ Creating a new generic function for ‘table’ in package ‘BiocGenerics’ Creating a new generic function for ‘tapply’ in package ‘BiocGenerics’ Creating a new generic function for ‘unique’ in package ‘BiocGenerics’ Creating a new generic function for ‘unsplit’ in package ‘BiocGenerics’ Creating a new generic function for ‘var’ in package ‘BiocGenerics’ Creating a new generic function for ‘sd’ in package ‘BiocGenerics’ Creating a new generic function for ‘which’ in package ‘BiocGenerics’ Creating a new generic function for ‘which.max’ in package ‘BiocGenerics’ Creating a new generic function for ‘which.min’ in package ‘BiocGenerics’ Creating a new generic function for ‘IQR’ in package ‘BiocGenerics’ Creating a new generic function for ‘mad’ in package ‘BiocGenerics’ Creating a new generic function for ‘xtabs’ in package ‘BiocGenerics’ Creating a new generic function for ‘clusterCall’ in package ‘BiocGenerics’ Creating a new generic function for ‘clusterApply’ in package ‘BiocGenerics’ Creating a new generic function for ‘clusterApplyLB’ in package ‘BiocGenerics’ Creating a new generic function for ‘clusterEvalQ’ in package ‘BiocGenerics’ Creating a new generic function for ‘clusterExport’ in package ‘BiocGenerics’ Creating a new generic function for ‘clusterMap’ in package ‘BiocGenerics’ Creating a new generic function for ‘parLapply’ in package ‘BiocGenerics’ Creating a new generic function for ‘parSapply’ in package ‘BiocGenerics’ Creating a new generic function for ‘parApply’ in package ‘BiocGenerics’ Creating a new generic function for ‘parRapply’ in package ‘BiocGenerics’ Creating a new generic function for ‘parCapply’ in package ‘BiocGenerics’ Creating a new generic function for ‘parLapplyLB’ in package ‘BiocGenerics’ Creating a new generic function for ‘parSapplyLB’ in package ‘BiocGenerics’ ** help *** installing help indices ** building package indices ** testing if installed package can be loaded * DONE (BiocGenerics) * installing *source* package ‘Biobase’ ... ** libs /public/home/wangshx/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -I/public/home/wangshx/anaconda3/lib/R/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -I/public/home/wangshx/anaconda3/include -fpic -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -I/public/home/wangshx/anaconda3/include -c Rinit.c -o Rinit.o /public/home/wangshx/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -I/public/home/wangshx/anaconda3/lib/R/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -I/public/home/wangshx/anaconda3/include -fpic -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -I/public/home/wangshx/anaconda3/include -c anyMissing.c -o anyMissing.o /public/home/wangshx/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -I/public/home/wangshx/anaconda3/lib/R/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -I/public/home/wangshx/anaconda3/include -fpic -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -I/public/home/wangshx/anaconda3/include -c envir.c -o envir.o /public/home/wangshx/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -I/public/home/wangshx/anaconda3/lib/R/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -I/public/home/wangshx/anaconda3/include -fpic -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -I/public/home/wangshx/anaconda3/include -c matchpt.c -o matchpt.o /public/home/wangshx/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -I/public/home/wangshx/anaconda3/lib/R/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -I/public/home/wangshx/anaconda3/include -fpic -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -I/public/home/wangshx/anaconda3/include -c rowMedians.c -o rowMedians.o /public/home/wangshx/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -I/public/home/wangshx/anaconda3/lib/R/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -I/public/home/wangshx/anaconda3/include -fpic -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -I/public/home/wangshx/anaconda3/include -c sublist_extract.c -o sublist_extract.o /public/home/wangshx/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -shared -L/public/home/wangshx/anaconda3/lib/R/lib -Wl,-O2,--sort-common,--as-needed,-z,relro,-z,now -L/public/home/wangshx/anaconda3/lib -o Biobase.so Rinit.o anyMissing.o envir.o matchpt.o rowMedians.o sublist_extract.o -L/public/home/wangshx/anaconda3/lib/R/lib -lR installing to /public/home/wangshx/anaconda3/lib/R/library/Biobase/libs ** R ** data ** inst ** preparing package for lazy loading ** help *** installing help indices ** building package indices ** installing vignettes ** testing if installed package can be loaded * DONE (Biobase) * installing *source* package ‘GEOquery’ ... ** R ** inst ** preparing package for lazy loading ** help *** installing help indices ** building package indices ** installing vignettes ** testing if installed package can be loaded * DONE (GEOquery) The downloaded source packages are in ‘/tmp/Rtmptc9bgw/downloaded_packages’ Updating HTML index of packages in '.Library' Making 'packages.html' ... done Old packages: 'hms', 'limma', 'Rcpp', 'tibble', 'xml2' GEO: GSE76730 destdir: /public/home/wangshx/workspace/GEO_data/igcc_cnv/ GSEMatrix: TRUE AnnotGPL: TRUE getGPL: TRUE Found 1 file(s) GSE76730_series_matrix.txt.gz trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE76nnn/GSE76730/matrix/GSE76730_series_matrix.txt.gz' Content type 'application/x-gzip' length 262447098 bytes (250.3 MB) ================================================== downloaded 250.3 MB Parsed with column specification: cols( .default = col_double(), ID_REF = col_character() ) See spec(...) for full column specifications. Annotation GPL not available, so will use submitter GPL instead File stored at: /public/home/wangshx/workspace/GEO_data/igcc_cnv//GPL3718.soft $GSE76730_series_matrix.txt.gz ExpressionSet (storageMode: lockedEnvironment) assayData: 261981 features, 190 samples element names: exprs protocolData: none phenoData sampleNames: GSM2036728 GSM2036729 ... GSM2036917 (190 total) varLabels: title geo_accession ... who performance status:ch1 (61 total) varMetadata: labelDescription featureData featureNames: SNP_A-1780270 SNP_A-1780272 ... SNP_A-4241299 (261981 total) fvarLabels: ID Affy SNP ID ... SPOT_ID (27 total) fvarMetadata: Column Description labelDescription experimentData: use 'experimentData(object)' Annotation: GPL3718 Warning message: In download.file(myurl, destfile, mode = mode, quiet = TRUE, method = getOption("download.file.method.GEOquery")) : cannot open URL 'https://ftp.ncbi.nlm.nih.gov/geo/platforms/GPL3nnn/GPL3718/annot/GPL3718.annot.gz': HTTP status was '404 Not Found' The files of GSE76730 download successfully! The files of GSE76730 download successfully!
如果觉得我的文章对您有用,请随意打赏。你的支持将鼓励我继续创作!