使用linux终端下载GEO数据

Bioconductor的GEOquery几个常用函数可以实现GEO数据的下载，但有时候我想直接通过终端下载而不是使用Rstudio然后运行脚本的方式，所以下面用shell脚本对GEOquery两个下载函数getGEO()以及getGEOSuppFiles()进行了简单的封装。

安装使用clone命令

git clone https://github.com/ShixiangWang/mytoolkit/

点击页面右上方的克隆或下载按钮预置与帮助

Linux系统安装R，如果你没有安装GEOquery包，脚本会自动判断并进行下载安装。

查看脚本帮助：

./getGEOSuppFiles.sh -h
./getGEO.sh -h
./bulkGEO.sh -h

下载GEO附加文件

GEO附加文件一般是原始的芯片数据。

用法：

Usage: ./getGEOSuppFiles.sh -n GEO -d directory
GEO:  GEO accession 号，比如 GPL1073 or GSM1137 
directory: 下载到的目录，默认为你的当前目录。

下载GEO表达矩阵文件

这个是最常用的功能，下载芯片的表达矩阵文件，数据已经经过研究者的预处理，可以直接使用。

用法：

Usage: ./getGEO.sh -n GEO -d destdir -M GSEMatrix -A AnnotGPL -P getGPL

Detail of Options
==================
-n      GEO: 代表GEO对象的字符 (比如 'GDS505','GSE2','GSM2','GPL96')
-d      destdir: 要下载到的目的目录，默认为当前目录。
-M      逻辑值TRUE或FALSE，告诉脚本是否下载GSE Series Matrix文件，默认为TRUE。
-A      逻辑值TRUE或FALSE，告诉脚本是否使用注释GPL信息文件（会下载），这些文件包含了最新映射的Gene ID和其他基本信息，但不是都有。默认为TRUE。
-P      逻辑值TRUE或FALSE，告诉脚本是否在下载GSEMatrix文件时下载GPL信息，如果你知道你要用bioconductor工具的注释包，你可以选择FALSE，默认为TRUE。

Minimal Use Method
==================
If you do not know how to use these options, just set -n option is OK
                Like
             ./getGEO.sh -n GEO
        change the 'GEO' above to name of GSE you want to download

大量下载表达矩阵文件和原始文件

这个功能利用了前两个脚本，对它们进行循环调用。

用法：

Usage: ./bulkGEO.sh -n GEO -d destdir -M GSEMatrix -A AnnotGPL -f filename -s supp


Detail of Options
==================
-n      GEO: 代表GEO对象的字符 (比如 'GDS505','GSE2','GSM2','GPL96')
-d      destdir: 要下载到的目的目录，默认为当前目录。
-M      逻辑值TRUE或FALSE，告诉脚本是否下载GSE Series Matrix文件，默认为TRUE。
-A      逻辑值TRUE或FALSE，告诉脚本是否使用注释GPL信息文件（会下载），这些文件包含了最新映射的Gene ID和其他基本信息，但不是都有。默认为TRUE。
-f      filename: 你可以把要下载的GEO对象名放入一个文件，然后指定它。注意，如果使用它，请不要设定-n选项，不然会被覆盖掉。
-s      supp: 逻辑值TRUE或FALSE，设定是否要下载原始附加文件。

Minimal Use Method
==================
If you do not know how to use these options, just set -n option is OK
                Like
             ./bulkGEO.sh -n 'GEO1 GEO2 GEO3'
        change the 'GEO' above to name of GSE you want to download

昨天为了避免自我感觉的下载麻烦所以写了这些代码，因为对linux的脚本还不是很精通，脚本可能会存在问题。基本的下载不会出错，我已经调试过。如果有问题或其他功能，欢迎提问，我会尝试解决。

谢谢阅读~

------------------------------------------------------------------------------------------------------------

今天刚好在一个新机器上下载GEO数据，只装了一些基本的R包，可以看看效果。

[wangshx@HPC-login mytoolkit]$ ./bulkGEO.sh -h


Usage: ./bulkGEO.sh -n GEO -d destdir -M GSEMatrix -A AnnotGPL -f filename -s supp

Detail of Options
==================
-n	GEO:A character string representing GEO objects for download and parsing. (eg., 'GDS505','GSE2','GSM2','GPL96'), you can use space to seperate multiple objects. Or you can use the -f option to locate the file where you put names of GEO object.
-d	destdir:The destination directory for any downloads. Defaults to the current directory. You may want to specify a different directory if you want to save the file for later use. Doing so is a good idea if you have a slow connection, as some of the GEO files are HUGE!
-M	A boolean telling GEOquery whether or not to use GSE Series Matrix files from GEO. The parsing of these files can be many orders-of-magnitude faster than parsing the GSE SOFT format files. Defaults to TRUE, meaning that the SOFT format parsing will not occur; set to FALSE if you for some reason need other columns from the GSE records.
-A	A boolean defaulting to TRUE as to whether or not to use the Annotation GPL information. These files are nice to use because they contain up-to-date information remapped from Entrez Gene on a regular basis. However, they do not exist for all GPLs; in general, they are only available for GPLs referenced by a GDS
-f	filename: a character string specify the filename where GEO names stored.
-s	supp: A boolean defaulting to FALSE as to whether or not to download supplementary files. 

Minimal Use Method
==================
If you do not know how to use these options, just set -n option is OK
		Like
             ./bulkGEO.sh -n 'GEO1 GEO2 GEO3'                
	change the 'GEO*' above to name of GSE you want to download

[wangshx@HPC-login mytoolkit]$ ./bulkGEO.sh -d ~/workspace/GEO_data/igcc_cnv/ -f ~/workspace/GEO_data/igcc_cnv/geo_names.txt 



Package GEOquery not available. Atempting to install it.
Bioconductor version 3.6 (BiocInstaller 1.28.0), ?biocLite for help
BioC_mirror: https://bioconductor.org
Using Bioconductor 3.6 (BiocInstaller 1.28.0), R 3.4.2 (2017-09-28).
Installing package(s) ‘GEOquery’
also installing the dependencies ‘BiocGenerics’, ‘Biobase’

trying URL 'https://bioconductor.org/packages/3.6/bioc/src/contrib/BiocGenerics_0.24.0.tar.gz'
Content type 'application/x-gzip' length 43393 bytes (42 KB)
==================================================
downloaded 42 KB

trying URL 'https://bioconductor.org/packages/3.6/bioc/src/contrib/Biobase_2.38.0.tar.gz'
Content type 'application/x-gzip' length 1656734 bytes (1.6 MB)
==================================================
downloaded 1.6 MB

trying URL 'https://bioconductor.org/packages/3.6/bioc/src/contrib/GEOquery_2.46.13.tar.gz'
Content type 'application/x-gzip' length 13745245 bytes (13.1 MB)
==================================================
downloaded 13.1 MB

* installing *source* package ‘BiocGenerics’ ...
** R
** inst
** preparing package for lazy loading
Creating a new generic function for ‘append’ in package ‘BiocGenerics’
Creating a new generic function for ‘as.data.frame’ in package ‘BiocGenerics’
Creating a new generic function for ‘cbind’ in package ‘BiocGenerics’
Creating a new generic function for ‘rbind’ in package ‘BiocGenerics’
Creating a new generic function for ‘do.call’ in package ‘BiocGenerics’
Creating a new generic function for ‘duplicated’ in package ‘BiocGenerics’
Creating a new generic function for ‘anyDuplicated’ in package ‘BiocGenerics’
Creating a new generic function for ‘eval’ in package ‘BiocGenerics’
Creating a new generic function for ‘pmax’ in package ‘BiocGenerics’
Creating a new generic function for ‘pmin’ in package ‘BiocGenerics’
Creating a new generic function for ‘pmax.int’ in package ‘BiocGenerics’
Creating a new generic function for ‘pmin.int’ in package ‘BiocGenerics’
Creating a new generic function for ‘Reduce’ in package ‘BiocGenerics’
Creating a new generic function for ‘Filter’ in package ‘BiocGenerics’
Creating a new generic function for ‘Find’ in package ‘BiocGenerics’
Creating a new generic function for ‘Map’ in package ‘BiocGenerics’
Creating a new generic function for ‘Position’ in package ‘BiocGenerics’
Creating a new generic function for ‘get’ in package ‘BiocGenerics’
Creating a new generic function for ‘mget’ in package ‘BiocGenerics’
Creating a new generic function for ‘grep’ in package ‘BiocGenerics’
Creating a new generic function for ‘grepl’ in package ‘BiocGenerics’
Creating a new generic function for ‘is.unsorted’ in package ‘BiocGenerics’
Creating a new generic function for ‘lapply’ in package ‘BiocGenerics’
Creating a new generic function for ‘sapply’ in package ‘BiocGenerics’
Creating a new generic function for ‘lengths’ in package ‘BiocGenerics’
Creating a new generic function for ‘mapply’ in package ‘BiocGenerics’
Creating a new generic function for ‘match’ in package ‘BiocGenerics’
Creating a new generic function for ‘rowSums’ in package ‘BiocGenerics’
Creating a new generic function for ‘colSums’ in package ‘BiocGenerics’
Creating a new generic function for ‘rowMeans’ in package ‘BiocGenerics’
Creating a new generic function for ‘colMeans’ in package ‘BiocGenerics’
Creating a new generic function for ‘order’ in package ‘BiocGenerics’
Creating a new generic function for ‘paste’ in package ‘BiocGenerics’
Creating a new generic function for ‘rank’ in package ‘BiocGenerics’
Creating a new generic function for ‘rownames’ in package ‘BiocGenerics’
Creating a new generic function for ‘colnames’ in package ‘BiocGenerics’
Creating a new generic function for ‘union’ in package ‘BiocGenerics’
Creating a new generic function for ‘intersect’ in package ‘BiocGenerics’
Creating a new generic function for ‘setdiff’ in package ‘BiocGenerics’
Creating a new generic function for ‘sort’ in package ‘BiocGenerics’
Creating a new generic function for ‘table’ in package ‘BiocGenerics’
Creating a new generic function for ‘tapply’ in package ‘BiocGenerics’
Creating a new generic function for ‘unique’ in package ‘BiocGenerics’
Creating a new generic function for ‘unsplit’ in package ‘BiocGenerics’
Creating a new generic function for ‘var’ in package ‘BiocGenerics’
Creating a new generic function for ‘sd’ in package ‘BiocGenerics’
Creating a new generic function for ‘which’ in package ‘BiocGenerics’
Creating a new generic function for ‘which.max’ in package ‘BiocGenerics’
Creating a new generic function for ‘which.min’ in package ‘BiocGenerics’
Creating a new generic function for ‘IQR’ in package ‘BiocGenerics’
Creating a new generic function for ‘mad’ in package ‘BiocGenerics’
Creating a new generic function for ‘xtabs’ in package ‘BiocGenerics’
Creating a new generic function for ‘clusterCall’ in package ‘BiocGenerics’
Creating a new generic function for ‘clusterApply’ in package ‘BiocGenerics’
Creating a new generic function for ‘clusterApplyLB’ in package ‘BiocGenerics’
Creating a new generic function for ‘clusterEvalQ’ in package ‘BiocGenerics’
Creating a new generic function for ‘clusterExport’ in package ‘BiocGenerics’
Creating a new generic function for ‘clusterMap’ in package ‘BiocGenerics’
Creating a new generic function for ‘parLapply’ in package ‘BiocGenerics’
Creating a new generic function for ‘parSapply’ in package ‘BiocGenerics’
Creating a new generic function for ‘parApply’ in package ‘BiocGenerics’
Creating a new generic function for ‘parRapply’ in package ‘BiocGenerics’
Creating a new generic function for ‘parCapply’ in package ‘BiocGenerics’
Creating a new generic function for ‘parLapplyLB’ in package ‘BiocGenerics’
Creating a new generic function for ‘parSapplyLB’ in package ‘BiocGenerics’
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (BiocGenerics)
* installing *source* package ‘Biobase’ ...
** libs
/public/home/wangshx/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -I/public/home/wangshx/anaconda3/lib/R/include -DNDEBUG   -D_FORTIFY_SOURCE=2 -O2 -I/public/home/wangshx/anaconda3/include   -fpic  -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -I/public/home/wangshx/anaconda3/include  -c Rinit.c -o Rinit.o
/public/home/wangshx/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -I/public/home/wangshx/anaconda3/lib/R/include -DNDEBUG   -D_FORTIFY_SOURCE=2 -O2 -I/public/home/wangshx/anaconda3/include   -fpic  -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -I/public/home/wangshx/anaconda3/include  -c anyMissing.c -o anyMissing.o
/public/home/wangshx/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -I/public/home/wangshx/anaconda3/lib/R/include -DNDEBUG   -D_FORTIFY_SOURCE=2 -O2 -I/public/home/wangshx/anaconda3/include   -fpic  -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -I/public/home/wangshx/anaconda3/include  -c envir.c -o envir.o
/public/home/wangshx/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -I/public/home/wangshx/anaconda3/lib/R/include -DNDEBUG   -D_FORTIFY_SOURCE=2 -O2 -I/public/home/wangshx/anaconda3/include   -fpic  -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -I/public/home/wangshx/anaconda3/include  -c matchpt.c -o matchpt.o
/public/home/wangshx/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -I/public/home/wangshx/anaconda3/lib/R/include -DNDEBUG   -D_FORTIFY_SOURCE=2 -O2 -I/public/home/wangshx/anaconda3/include   -fpic  -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -I/public/home/wangshx/anaconda3/include  -c rowMedians.c -o rowMedians.o
/public/home/wangshx/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -I/public/home/wangshx/anaconda3/lib/R/include -DNDEBUG   -D_FORTIFY_SOURCE=2 -O2 -I/public/home/wangshx/anaconda3/include   -fpic  -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -I/public/home/wangshx/anaconda3/include  -c sublist_extract.c -o sublist_extract.o
/public/home/wangshx/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -shared -L/public/home/wangshx/anaconda3/lib/R/lib -Wl,-O2,--sort-common,--as-needed,-z,relro,-z,now -L/public/home/wangshx/anaconda3/lib -o Biobase.so Rinit.o anyMissing.o envir.o matchpt.o rowMedians.o sublist_extract.o -L/public/home/wangshx/anaconda3/lib/R/lib -lR
installing to /public/home/wangshx/anaconda3/lib/R/library/Biobase/libs
** R
** data
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
* DONE (Biobase)
* installing *source* package ‘GEOquery’ ...
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
* DONE (GEOquery)

The downloaded source packages are in
	‘/tmp/Rtmptc9bgw/downloaded_packages’
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
Old packages: 'hms', 'limma', 'Rcpp', 'tibble', 'xml2'
GEO: GSE76730
destdir: /public/home/wangshx/workspace/GEO_data/igcc_cnv/
GSEMatrix: TRUE
AnnotGPL: TRUE
getGPL: TRUE

Found 1 file(s)
GSE76730_series_matrix.txt.gz
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE76nnn/GSE76730/matrix/GSE76730_series_matrix.txt.gz'
Content type 'application/x-gzip' length 262447098 bytes (250.3 MB)
==================================================
downloaded 250.3 MB

Parsed with column specification:
cols(
  .default = col_double(),
  ID_REF = col_character()
)
See spec(...) for full column specifications.
Annotation GPL not available, so will use submitter GPL instead
File stored at: 
/public/home/wangshx/workspace/GEO_data/igcc_cnv//GPL3718.soft
$GSE76730_series_matrix.txt.gz
ExpressionSet (storageMode: lockedEnvironment)
assayData: 261981 features, 190 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: GSM2036728 GSM2036729 ... GSM2036917 (190 total)
  varLabels: title geo_accession ... who performance status:ch1 (61
    total)
  varMetadata: labelDescription
featureData
  featureNames: SNP_A-1780270 SNP_A-1780272 ... SNP_A-4241299 (261981
    total)
  fvarLabels: ID Affy SNP ID ... SPOT_ID (27 total)
  fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
Annotation: GPL3718 

Warning message:
In download.file(myurl, destfile, mode = mode, quiet = TRUE, method = getOption("download.file.method.GEOquery")) :
  cannot open URL 'https://ftp.ncbi.nlm.nih.gov/geo/platforms/GPL3nnn/GPL3718/annot/GPL3718.annot.gz': HTTP status was '404 Not Found'

The files of GSE76730 download successfully!


The files of GSE76730 download successfully!

发表于 2018-01-16 13:50
阅读 ( 9524 )
分类：软件工具

使用linux终端下载GEO数据

你可能感兴趣的文章

相关问题

0 条评论

作家榜 »