使用linux终端下载GEO数据

Bioconductor的GEOquery几个常用函数可以实现GEO数据的下载,但有时候我想直接通过终端下载而不是使用Rstudio然后运行脚本的方式,所以下面用shell脚本对GEOquery两个下载函数getGEO()以及getGEOSuppFiles()进行了简单的封装。

Bioconductor的GEOquery几个常用函数可以实现GEO数据的下载,但有时候我想直接通过终端下载而不是使用Rstudio然后运行脚本的方式,所以下面用shell脚本对GEOquery两个下载函数getGEO()以及getGEOSuppFiles()进行了简单的封装。

安装使用clone命令

git clone https://github.com/ShixiangWang/mytoolkit/

点击页面右上方的克隆或下载按钮预置与帮助

Linux系统安装R,如果你没有安装GEOquery包,脚本会自动判断并进行下载安装。

查看脚本帮助:

./getGEOSuppFiles.sh -h
./getGEO.sh -h
./bulkGEO.sh -h

下载GEO附加文件

GEO附加文件一般是原始的芯片数据。

用法

Usage: ./getGEOSuppFiles.sh -n GEO -d directory
GEO:  GEO accession 号,比如 GPL1073 or GSM1137 
directory: 下载到的目录,默认为你的当前目录。

下载GEO表达矩阵文件

这个是最常用的功能,下载芯片的表达矩阵文件,数据已经经过研究者的预处理,可以直接使用。

用法

Usage: ./getGEO.sh -n GEO -d destdir -M GSEMatrix -A AnnotGPL -P getGPL

Detail of Options
==================
-n      GEO: 代表GEO对象的字符 (比如 'GDS505','GSE2','GSM2','GPL96')
-d      destdir: 要下载到的目的目录,默认为当前目录。
-M      逻辑值TRUEFALSE,告诉脚本是否下载GSE Series Matrix文件,默认为TRUE。
-A      逻辑值TRUEFALSE,告诉脚本是否使用注释GPL信息文件(会下载),这些文件包含了最新映射的Gene ID和其他基本信息,但不是都有。默认为TRUE。
-P      逻辑值TRUEFALSE,告诉脚本是否在下载GSEMatrix文件时下载GPL信息,如果你知道你要用bioconductor工具的注释包,你可以选择FALSE,默认为TRUE。

Minimal Use Method
==================
If you do not know how to use these options, just set -n option is OK
                Like
             ./getGEO.sh -n GEO
        change the 'GEO' above to name of GSE you want to download

大量下载表达矩阵文件和原始文件

这个功能利用了前两个脚本,对它们进行循环调用。

用法

Usage: ./bulkGEO.sh -n GEO -d destdir -M GSEMatrix -A AnnotGPL -f filename -s supp


Detail of Options
==================
-n      GEO: 代表GEO对象的字符 (比如 'GDS505','GSE2','GSM2','GPL96')
-d      destdir: 要下载到的目的目录,默认为当前目录。
-M      逻辑值TRUEFALSE,告诉脚本是否下载GSE Series Matrix文件,默认为TRUE。
-A      逻辑值TRUEFALSE,告诉脚本是否使用注释GPL信息文件(会下载),这些文件包含了最新映射的Gene ID和其他基本信息,但不是都有。默认为TRUE。
-f      filename: 你可以把要下载的GEO对象名放入一个文件,然后指定它。注意,如果使用它,请不要设定-n选项,不然会被覆盖掉。
-s      supp: 逻辑值TRUEFALSE,设定是否要下载原始附加文件。

Minimal Use Method
==================
If you do not know how to use these options, just set -n option is OK
                Like
             ./bulkGEO.sh -n 'GEO1 GEO2 GEO3'
        change the 'GEO' above to name of GSE you want to download

昨天为了避免自我感觉的下载麻烦所以写了这些代码,因为对linux的脚本还不是很精通,脚本可能会存在问题。基本的下载不会出错,我已经调试过。如果有问题或其他功能,欢迎提问,我会尝试解决。

谢谢阅读~


------------------------------------------------------------------------------------------------------------

今天刚好在一个新机器上下载GEO数据,只装了一些基本的R包,可以看看效果。

[wangshx@HPC-login mytoolkit]$ ./bulkGEO.sh -h


Usage: ./bulkGEO.sh -n GEO -d destdir -M GSEMatrix -A AnnotGPL -f filename -s supp

Detail of Options
==================
-n	GEO:A character string representing GEO objects for download and parsing. (eg., 'GDS505','GSE2','GSM2','GPL96'), you can use space to seperate multiple objects. Or you can use the -f option to locate the file where you put names of GEO object.
-d	destdir:The destination directory for any downloads. Defaults to the current directory. You may want to specify a different directory if you want to save the file for later use. Doing so is a good idea if you have a slow connection, as some of the GEO files are HUGE!
-M	A boolean telling GEOquery whether or not to use GSE Series Matrix files from GEO. The parsing of these files can be many orders-of-magnitude faster than parsing the GSE SOFT format files. Defaults to TRUE, meaning that the SOFT format parsing will not occur; set to FALSE if you for some reason need other columns from the GSE records.
-A	A boolean defaulting to TRUE as to whether or not to use the Annotation GPL information. These files are nice to use because they contain up-to-date information remapped from Entrez Gene on a regular basis. However, they do not exist for all GPLs; in general, they are only available for GPLs referenced by a GDS
-f	filename: a character string specify the filename where GEO names stored.
-s	supp: A boolean defaulting to FALSE as to whether or not to download supplementary files. 

Minimal Use Method
==================
If you do not know how to use these options, just set -n option is OK
		Like
             ./bulkGEO.sh -n 'GEO1 GEO2 GEO3'                
	change the 'GEO*' above to name of GSE you want to download

[wangshx@HPC-login mytoolkit]$ ./bulkGEO.sh -d ~/workspace/GEO_data/igcc_cnv/ -f ~/workspace/GEO_data/igcc_cnv/geo_names.txt 



Package GEOquery not available. Atempting to install it.
Bioconductor version 3.6 (BiocInstaller 1.28.0), ?biocLite for help
BioC_mirror: https://bioconductor.org
Using Bioconductor 3.6 (BiocInstaller 1.28.0), R 3.4.2 (2017-09-28).
Installing package(s) ‘GEOquery’
also installing the dependencies ‘BiocGenerics’, ‘Biobase’

trying URL 'https://bioconductor.org/packages/3.6/bioc/src/contrib/BiocGenerics_0.24.0.tar.gz'
Content type 'application/x-gzip' length 43393 bytes (42 KB)
==================================================
downloaded 42 KB

trying URL 'https://bioconductor.org/packages/3.6/bioc/src/contrib/Biobase_2.38.0.tar.gz'
Content type 'application/x-gzip' length 1656734 bytes (1.6 MB)
==================================================
downloaded 1.6 MB

trying URL 'https://bioconductor.org/packages/3.6/bioc/src/contrib/GEOquery_2.46.13.tar.gz'
Content type 'application/x-gzip' length 13745245 bytes (13.1 MB)
==================================================
downloaded 13.1 MB

* installing *source* package ‘BiocGenerics’ ...
** R
** inst
** preparing package for lazy loading
Creating a new generic function for ‘append’ in package ‘BiocGenerics’
Creating a new generic function for ‘as.data.frame’ in package ‘BiocGenerics’
Creating a new generic function for ‘cbind’ in package ‘BiocGenerics’
Creating a new generic function for ‘rbind’ in package ‘BiocGenerics’
Creating a new generic function for ‘do.call’ in package ‘BiocGenerics’
Creating a new generic function for ‘duplicated’ in package ‘BiocGenerics’
Creating a new generic function for ‘anyDuplicated’ in package ‘BiocGenerics’
Creating a new generic function for ‘eval’ in package ‘BiocGenerics’
Creating a new generic function for ‘pmax’ in package ‘BiocGenerics’
Creating a new generic function for ‘pmin’ in package ‘BiocGenerics’
Creating a new generic function for ‘pmax.int’ in package ‘BiocGenerics’
Creating a new generic function for ‘pmin.int’ in package ‘BiocGenerics’
Creating a new generic function for ‘Reduce’ in package ‘BiocGenerics’
Creating a new generic function for ‘Filter’ in package ‘BiocGenerics’
Creating a new generic function for ‘Find’ in package ‘BiocGenerics’
Creating a new generic function for ‘Map’ in package ‘BiocGenerics’
Creating a new generic function for ‘Position’ in package ‘BiocGenerics’
Creating a new generic function for ‘get’ in package ‘BiocGenerics’
Creating a new generic function for ‘mget’ in package ‘BiocGenerics’
Creating a new generic function for ‘grep’ in package ‘BiocGenerics’
Creating a new generic function for ‘grepl’ in package ‘BiocGenerics’
Creating a new generic function for ‘is.unsorted’ in package ‘BiocGenerics’
Creating a new generic function for ‘lapply’ in package ‘BiocGenerics’
Creating a new generic function for ‘sapply’ in package ‘BiocGenerics’
Creating a new generic function for ‘lengths’ in package ‘BiocGenerics’
Creating a new generic function for ‘mapply’ in package ‘BiocGenerics’
Creating a new generic function for ‘match’ in package ‘BiocGenerics’
Creating a new generic function for ‘rowSums’ in package ‘BiocGenerics’
Creating a new generic function for ‘colSums’ in package ‘BiocGenerics’
Creating a new generic function for ‘rowMeans’ in package ‘BiocGenerics’
Creating a new generic function for ‘colMeans’ in package ‘BiocGenerics’
Creating a new generic function for ‘order’ in package ‘BiocGenerics’
Creating a new generic function for ‘paste’ in package ‘BiocGenerics’
Creating a new generic function for ‘rank’ in package ‘BiocGenerics’
Creating a new generic function for ‘rownames’ in package ‘BiocGenerics’
Creating a new generic function for ‘colnames’ in package ‘BiocGenerics’
Creating a new generic function for ‘union’ in package ‘BiocGenerics’
Creating a new generic function for ‘intersect’ in package ‘BiocGenerics’
Creating a new generic function for ‘setdiff’ in package ‘BiocGenerics’
Creating a new generic function for ‘sort’ in package ‘BiocGenerics’
Creating a new generic function for ‘table’ in package ‘BiocGenerics’
Creating a new generic function for ‘tapply’ in package ‘BiocGenerics’
Creating a new generic function for ‘unique’ in package ‘BiocGenerics’
Creating a new generic function for ‘unsplit’ in package ‘BiocGenerics’
Creating a new generic function for ‘var’ in package ‘BiocGenerics’
Creating a new generic function for ‘sd’ in package ‘BiocGenerics’
Creating a new generic function for ‘which’ in package ‘BiocGenerics’
Creating a new generic function for ‘which.max’ in package ‘BiocGenerics’
Creating a new generic function for ‘which.min’ in package ‘BiocGenerics’
Creating a new generic function for ‘IQR’ in package ‘BiocGenerics’
Creating a new generic function for ‘mad’ in package ‘BiocGenerics’
Creating a new generic function for ‘xtabs’ in package ‘BiocGenerics’
Creating a new generic function for ‘clusterCall’ in package ‘BiocGenerics’
Creating a new generic function for ‘clusterApply’ in package ‘BiocGenerics’
Creating a new generic function for ‘clusterApplyLB’ in package ‘BiocGenerics’
Creating a new generic function for ‘clusterEvalQ’ in package ‘BiocGenerics’
Creating a new generic function for ‘clusterExport’ in package ‘BiocGenerics’
Creating a new generic function for ‘clusterMap’ in package ‘BiocGenerics’
Creating a new generic function for ‘parLapply’ in package ‘BiocGenerics’
Creating a new generic function for ‘parSapply’ in package ‘BiocGenerics’
Creating a new generic function for ‘parApply’ in package ‘BiocGenerics’
Creating a new generic function for ‘parRapply’ in package ‘BiocGenerics’
Creating a new generic function for ‘parCapply’ in package ‘BiocGenerics’
Creating a new generic function for ‘parLapplyLB’ in package ‘BiocGenerics’
Creating a new generic function for ‘parSapplyLB’ in package ‘BiocGenerics’
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (BiocGenerics)
* installing *source* package ‘Biobase’ ...
** libs
/public/home/wangshx/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -I/public/home/wangshx/anaconda3/lib/R/include -DNDEBUG   -D_FORTIFY_SOURCE=2 -O2 -I/public/home/wangshx/anaconda3/include   -fpic  -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -I/public/home/wangshx/anaconda3/include  -c Rinit.c -o Rinit.o
/public/home/wangshx/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -I/public/home/wangshx/anaconda3/lib/R/include -DNDEBUG   -D_FORTIFY_SOURCE=2 -O2 -I/public/home/wangshx/anaconda3/include   -fpic  -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -I/public/home/wangshx/anaconda3/include  -c anyMissing.c -o anyMissing.o
/public/home/wangshx/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -I/public/home/wangshx/anaconda3/lib/R/include -DNDEBUG   -D_FORTIFY_SOURCE=2 -O2 -I/public/home/wangshx/anaconda3/include   -fpic  -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -I/public/home/wangshx/anaconda3/include  -c envir.c -o envir.o
/public/home/wangshx/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -I/public/home/wangshx/anaconda3/lib/R/include -DNDEBUG   -D_FORTIFY_SOURCE=2 -O2 -I/public/home/wangshx/anaconda3/include   -fpic  -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -I/public/home/wangshx/anaconda3/include  -c matchpt.c -o matchpt.o
/public/home/wangshx/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -I/public/home/wangshx/anaconda3/lib/R/include -DNDEBUG   -D_FORTIFY_SOURCE=2 -O2 -I/public/home/wangshx/anaconda3/include   -fpic  -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -I/public/home/wangshx/anaconda3/include  -c rowMedians.c -o rowMedians.o
/public/home/wangshx/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -I/public/home/wangshx/anaconda3/lib/R/include -DNDEBUG   -D_FORTIFY_SOURCE=2 -O2 -I/public/home/wangshx/anaconda3/include   -fpic  -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -I/public/home/wangshx/anaconda3/include  -c sublist_extract.c -o sublist_extract.o
/public/home/wangshx/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -shared -L/public/home/wangshx/anaconda3/lib/R/lib -Wl,-O2,--sort-common,--as-needed,-z,relro,-z,now -L/public/home/wangshx/anaconda3/lib -o Biobase.so Rinit.o anyMissing.o envir.o matchpt.o rowMedians.o sublist_extract.o -L/public/home/wangshx/anaconda3/lib/R/lib -lR
installing to /public/home/wangshx/anaconda3/lib/R/library/Biobase/libs
** R
** data
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
* DONE (Biobase)
* installing *source* package ‘GEOquery’ ...
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
* DONE (GEOquery)

The downloaded source packages are in
	‘/tmp/Rtmptc9bgw/downloaded_packages’
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
Old packages: 'hms', 'limma', 'Rcpp', 'tibble', 'xml2'
GEO: GSE76730
destdir: /public/home/wangshx/workspace/GEO_data/igcc_cnv/
GSEMatrix: TRUE
AnnotGPL: TRUE
getGPL: TRUE

Found 1 file(s)
GSE76730_series_matrix.txt.gz
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE76nnn/GSE76730/matrix/GSE76730_series_matrix.txt.gz'
Content type 'application/x-gzip' length 262447098 bytes (250.3 MB)
==================================================
downloaded 250.3 MB

Parsed with column specification:
cols(
  .default = col_double(),
  ID_REF = col_character()
)
See spec(...) for full column specifications.
Annotation GPL not available, so will use submitter GPL instead
File stored at: 
/public/home/wangshx/workspace/GEO_data/igcc_cnv//GPL3718.soft
$GSE76730_series_matrix.txt.gz
ExpressionSet (storageMode: lockedEnvironment)
assayData: 261981 features, 190 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: GSM2036728 GSM2036729 ... GSM2036917 (190 total)
  varLabels: title geo_accession ... who performance status:ch1 (61
    total)
  varMetadata: labelDescription
featureData
  featureNames: SNP_A-1780270 SNP_A-1780272 ... SNP_A-4241299 (261981
    total)
  fvarLabels: ID Affy SNP ID ... SPOT_ID (27 total)
  fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
Annotation: GPL3718 

Warning message:
In download.file(myurl, destfile, mode = mode, quiet = TRUE, method = getOption("download.file.method.GEOquery")) :
  cannot open URL 'https://ftp.ncbi.nlm.nih.gov/geo/platforms/GPL3nnn/GPL3718/annot/GPL3718.annot.gz': HTTP status was '404 Not Found'

The files of GSE76730 download successfully!


The files of GSE76730 download successfully!


  • 发表于 2018-01-16 13:50
  • 阅读 ( 8411 )
  • 分类:软件工具

0 条评论

请先 登录 后评论
不写代码的码农
王诗翔

研究生在读

5 篇文章

作家榜 »

  1. 祝让飞 118 文章
  2. 柚子 91 文章
  3. 刘永鑫 64 文章
  4. admin 57 文章
  5. 生信分析流 55 文章
  6. SXR 44 文章
  7. 张海伦 31 文章
  8. 爽儿 25 文章