Notes about open source software, computers, other stuff.

# Tag: Science(Page 1 of 2)

This morning version 0.9-6 of the DatABEL R package was published on CRAN. This is only a minor update that consists of a few small changes and one bug fix. See the official announcement for more information.

DatABEL is an R package that allows users to access files with large matrices (of several gigabytes or more in size) in a fast and efficient manner. The package is mainly used for genome-wide association analyses using e.g. ProbABEL or OmicABEL.

It was quite a long time in the making and then a bunch of other stuff came in between, but I finally managed to release v0.4.4 of ProbABEL!

ProbABEL is a toolset for doing fast, memory (RAM) efficient genome-wide regression tests.

This is a bugfix release, but a major one for those who use the Cox proportional hazards regression module. Thanks to some of our users on the GenABEL forum, a serious bug leading to way to many NaN’s in the output was discovered, fixed and tested. This is one of the best examples of community collaboration I have seen in the GenABEL project.

Another bug fixed in this release is one that caused a failed install on MacOS X and FreeBSD. Again a bug reported on the forum by one of our users. Great work!

Uploads to Debian and the Ubuntu PPA are coming ASAP.

Now, let’s get ready for a new feature release, which will include p-value calculation (a long-standing feature request) and major speed-ups (implemented by former colleague Maarten Kooyman). Time to get to work ;-)!

Sun Grid Engine (SGE) is a bath queue system that can be used to distribute computation intensive tasks across one or more servers/CPUs. SGE has a graphical configuration utility called qmon, but when you start it on a remote machine (using SSH), you may end up with errors like this:

Warning: Cannot convert string "-adobe-courier-medium-r-*--14-*-*-*-m-*-*-*" to type FontStruct
Warning: Cannot convert string "-adobe-courier-bold-r-*--14-*-*-*-m-*-*-*" to type FontStruct
Warning: Cannot convert string "-adobe-courier-medium-r-*--12-*-*-*-m-*-*-*" to type FontStruct
X Error of failed request:  BadName (named color or font does not exist)
Major opcode of failed request:  45 (X_OpenFont)
Serial number of failed request:  329
Current serial number in output stream:  340


The warnings are not really a problem, but the error is. It can be solved by running the following on the client (i.e. your local) machine (assuming it runs Debian or Ubuntu):

sudo apt-get install xfonts-75dpi xset +fp /usr/share/fonts/X11/75dpi xset fp rehash

I took some time today to configure my R experience. I’m mostly using R from Emacs using ESS (Emacs Speaks Statistics), which means I had to configure some settings there as well.

Previously, my settings only consisted of setting a customised directory in which to install my packages and an alias to start R without asking for saving the histroy when quitting. This I did by setting the following environment variable in my .bashrc and/or .zshrc, as well as an alias:

# Set the library path for R export R_LIBS_USER=${R_LIBS_USER}:~/Programmeren/R/lib if [ -n "$(/usr/bin/which R 2>/dev/null)" ]; then alias R="$(/usr/bin/which R) --no-save" fi However, Emacs didn’t pick up either of these variables, so high time to fix that. This meant creating two files with the following content: ~/.Rprofile: # Set the default CRAN repository used by install.packages() options("repos" = c(CRAN = "http://cran-mirror.cs.uu.nl/")) ~/.Renviron: R_LIBS="~/Programmeren/R/lib" I added the following to my .emacs file to start R with the --no-save option: (setq inferior-R-args "--no-save ") Additionally I have the following in there to turn on a spelling checker and have line-wraps enabled: (add-hook 'ess-mode-hook (lambda () ;; Set pdflatex as the default command for Sweave (default: texi2pdf) (setq ess-swv-pdflatex-commands (quote ("pdflatex" "texi2pdf" "make"))) (auto-fill-mode t) (flyspell-prog-mode) )) During the Christmas holidays I released a new version of ProbABEL (v0.4.2). The official release announcement can be found here. ProbABEL is a toolset that allows running GWAS (Genome-Wide Association Studies) in a fast and efficient manner. It implements regression using the linear, logistic or Cox proportional hazards models. This version is mostly a bug fix release. The most important user-visible change is the fact that the ‘official’ name for the wrapper script that runs a GWAS over a range of chromosomes is now called probabel instead of probabel.pl. This change was induced by my attempts to get ProbABEL packaged in the Debian Linux repositories. One of the warnings that occurred during the package creation process was a Lintian warning that said that scripts with ‘language extensions’ are not allowed. There are several reasons for that, but the one I found most compelling was the fact that the user shouldn’t be concerned with the programming/scripting language we used to write it in. Moreover, being ‘agnostic’ in this matter also allows us to write such a script in a different language. Of course, we have left the original name in place (via a symlink) in order not to disrupt any current pipelines. If the user runs the script with the old name a warning appears, urging him/her to start using the new name and that the old name will be deprecated in the future. In the mean time, ProbABEL v0.4.1 has been accepted in Debian (unstable) and as of today it is also available in Debian ‘testing’. Lots of thanks to the Debian Med team that helped me a lot in preparing the .deb package. Note that the package has been split up in probabel (architecture-dependent files) and probabel-examples (with architecture independent files: the examples). See the Debian Package Tracking System page for ProbABEL for more details of the package. From Debian the package has trickled down to Ubuntu as well (Launchpad page here), so it will be available by default in the next Ubuntu release (14.04, a.k.a. Trusty Tahr). This is a quick example of how to do a fixed meta-analysis using the R package Rmeta, just so I dont have to look it up again next time: ## Create data frame containing betas and standard errors df <- data.frame() df <- rbind(df, c(2., 0.2)) df <- rbind(df, c(2.5, 0.4)) df <- rbind(df, c(2.2, 0.2)) ## Add study names df <- cbind(df, c("study 1", "study 2", "study 3")) colnames(df) <- c("beta", "se_beta", "name") ## Do the meta-analysis ms <- meta.summaries(df$beta, df$se_beta, names=df$name)   ## Add some colors mc <- meta.colors(summary="darkgreen", zero="red")   ## Make a forest plot plot(ms, xlab=expression(beta ~ " (mmol/l)"), ylab="Study", colors=mc, zero=2.6)

The resulting plot looks like this:

Last week I released v0.4.1 of ProbABEL, just a few days after releasing v0.4.0, which contained a small, but irritating bug.

This release took quite some time to create, but features quite a few bug fixes, including a major one: for the first time since the filevector format was introduced somewhere in 2009/2010, the Cox proportional hazards regression module works with filevector/DatABEL files. This is a major step forward, because up till now we had to maintain two branches of code: trunk, with all the regular updates and improvements, and the old branch that contained the Cox PH module that was only capable of reading text files.

Another notable change is the incorporation of $\chi^2$ values in the output files. At the moment these are based on the LRT (likelihood ratio test), except where that doesn’t make sense (e.g. when using the --mmscore option. The implementation was relatively easy, because part of the code was still there from previous versions; it was disabled however, because it didn’t deal with missing genotype data. Now it does. Using the LRT is also easier in the case of the 2df (or genotypic) genetic model, where using the Wald test is not straightforward.

The third user-visible change was a change in the [code]probabel.pl[/code] script that hides some of the details (e.g. the location of the files with genotype data) of running a regression for the user. Previously, using the -o option meant that the output file name was constructed from the name of the phenotype file, the argument of the -o option and a fixed extension that depends on the model(s) being run. Starting with v.0.4.0 this behaviour has changed. If the -o option is specified its argument is used as the start of the output file name, with only the fixed extension appended to it. This allows users to specify output in a different directory than the one where the phenotype file was created.

Packages for Ubuntu Linux (or one of its derivatives and probably also Debian) can be found in the GenABEL PPA (personal package archive). Previously we also released pre-compiled Windows binaries, but I’ve stopped doing that. They were never tested anyway, and I think there isn’t much demand for them anyway. Most people who do genome-wide association studies use Linux servers anyway.

Development of ProbABEL (and other members of the GenABEL suite) takes place on this R-forge page. If you are in search of an open source project to contribute to, feel free to contact us!

User support for the GenABEL suite can be found at our forum.

I use Emacsorg-mode a lot for writing notes, todo lists, presentations and writing short reports. Recently I started writing a larger report which I normally would have done in LaTeX. This time, since the notes related to the project were already in org format, I decided to write the whole report in org-mode. The one thing I needed for that was using BibTeX bibliographies (and RefTeX) from org-mode. A quick web search revealed that that can easily be done by adding the following to your .emacs file:

;; Configure RefTeX for use with org-mode. At the end of your ;; org-mode file you need to insert your style and bib file: ;; \bibliographystyle{plain} ;; \bibliography{ProbePosition} ;; See http://www.mfasold.net/blog/2009/02/using-emacs-org-mode-to-draft-papers/ (defun org-mode-reftex-setup () (load-library "reftex") (and (buffer-file-name) (file-exists-p (buffer-file-name)) (reftex-parse-all)) (define-key org-mode-map (kbd "C-c )") 'reftex-citation) ) (add-hook 'org-mode-hook 'org-mode-reftex-setup)

After that, RefTeX works, but exporting the org document to PDF (via LaTeX) didn’t include the bibliography entries. A quick look at the error log showed that bibtex hadn’t been run, so the question was: how to tell org-mode to do that too when exporting. The answer is to tell org-mode to use the latexmk Perl script (on Debian/Ubuntu it is easily installed from the package repositories) when exporting to PDF. I added the following lines to my .emacs file:

;; Use latexmk for PDF export (setq org-latex-to-pdf-process (list "latexmk -pdf -bibtex %f"))

One thing we do regularly at work is taking a look at aligned sequences of human DNA as generated by what is called “next-generation sequencing”. This data is stored in so-called .bam files, which can get pretty large. For example, the .bam file for an individual whose whole genome is sequenced at 12x coverage is approximately 60GB.
To view these files, to check the alignment, look at the coverage of a specific region, etc, people typically use graphical browsers like the IGV or Savant. However, these require you to either run the tool on the server (which means relatively slow X-forwarding over SSH) or copying the BAM file to your local machine, which also takes a lot of time, especially if you want to take a look at a single region for a bunch of people.

For jobs like that I’ve found the text-based viewer integrated in SamTools to be very convenient. It’s a matter of running

samtools tview sample.bam /path/to/reference.genome.fasta


after which you get a view like this:

1000821   1000831   1000841   1000851   1000861   1000871   1000881   1000891   1000901
GGCCAGGCAGGGCTTCTGGGTGGAGTTCAAGGTGCATCCTGACCGCTGTCACCTTCAGACTCTGTCCCCTGGGGCTGGGGCAAGTGCCCGATGGGAGCGCA
.....................................................................................................
..........          ......................A.......................T...............G........A........C
...........                                     .....................................................
............                                           ..............................................
..........................................................C...........      .......................A.
...................................................................................        ..........
..........


Using g followed by 1:23000000 you will jump to the given position on the given chromosome.
If the 1:23000000 doesn’t work, check the header of the BAM file to see how the chromosome is specified (sometimes it is chr1:23000000, for example):

samtools view -H sample.bam


In the above example the dots indicate nucleotides that are identical to the reference (shown in the second line), the positions with letters indicate reads where a different base was read. In this example all of them are probably sequencing or alignment errors because only one discordant read is observed at any position. If you find a column with letters that means this position is indeed different from the reference. Also notice how the various reads are aligned and that in this case the coverage doesn’t seem to be very high.

In my line of work it is not uncommon to have to find out whether a given term is present in a long list. Say, for example you need to look up whether a set of, say 10, SNPs is present in a (possibly annotated) list of SNPs present on a genotyping array (having for example 240k SNPs).
My first instinct in such cases is to use grep, and it’s a good instinct that has served me well over the years.

Recently we had a case that involved quite some larger files. We needed to see whether a set of genomic positions was present in a genome-wide list of such positions. Of course we split the files up per chromosome, but still this took ~ 24 hours for a chromosome when using

grep -w -f short_list long_file > results

I was convinced this could be done faster and googled a bit, read the grep man page to find out that the -F option of grep ensures that the search string is not seen as a (regexp) pattern, but as fixed. This meant an enormous speed improvement. Instead of having to wait for 24 hours we got the output in under a minute!

I did a quick performance comparison: looking up ten items in a ~415MB file with 247,871 rows and 136 columns took ~2 minutes, 3 seconds with out -F and less than a second with the -F option:

$time grep -w -f shortlist.txt largefile.tsv > out_withoutF real 2m3.181s user 2m0.780s sys 0m2.196s$ time grep -wF -f shortlist.txt largefile.tsv > out_withF   real 0m0.568s user 0m0.500s sys 0m0.060s