Notes about open source software, computers, other stuff.

Category: Programming (Page 2 of 3)

ProbABEL v0.4.1 released

Last week I released v0.4.1 of ProbABEL, just a few days after releasing v0.4.0, which contained a small, but irritating bug.

This release took quite some time to create, but features quite a few bug fixes, including a major one: for the first time since the filevector format was introduced somewhere in 2009/2010, the Cox proportional hazards regression module works with filevector/DatABEL files. This is a major step forward, because up till now we had to maintain two branches of code: trunk, with all the regular updates and improvements, and the old branch that contained the Cox PH module that was only capable of reading text files.

Another notable change is the incorporation of \chi^2 values in the output files. At the moment these are based on the LRT (likelihood ratio test), except where that doesn’t make sense (e.g. when using the --mmscore option. The implementation was relatively easy, because part of the code was still there from previous versions; it was disabled however, because it didn’t deal with missing genotype data. Now it does. Using the LRT is also easier in the case of the 2df (or genotypic) genetic model, where using the Wald test is not straightforward.

The third user-visible change was a change in the [code]probabel.pl[/code] script that hides some of the details (e.g. the location of the files with genotype data) of running a regression for the user. Previously, using the -o option meant that the output file name was constructed from the name of the phenotype file, the argument of the -o option and a fixed extension that depends on the model(s) being run. Starting with v.0.4.0 this behaviour has changed. If the -o option is specified its argument is used as the start of the output file name, with only the fixed extension appended to it. This allows users to specify output in a different directory than the one where the phenotype file was created.

Packages for Ubuntu Linux (or one of its derivatives and probably also Debian) can be found in the GenABEL PPA (personal package archive). Previously we also released pre-compiled Windows binaries, but I’ve stopped doing that. They were never tested anyway, and I think there isn’t much demand for them anyway. Most people who do genome-wide association studies use Linux servers anyway.

Development of ProbABEL (and other members of the GenABEL suite) takes place on this R-forge page. If you are in search of an open source project to contribute to, feel free to contact us!

User support for the GenABEL suite can be found at our forum.

Using BibTeX from org-mode

I use Emacsorg-mode a lot for writing notes, todo lists, presentations and writing short reports. Recently I started writing a larger report which I normally would have done in LaTeX. This time, since the notes related to the project were already in org format, I decided to write the whole report in org-mode. The one thing I needed for that was using BibTeX bibliographies (and RefTeX) from org-mode. A quick web search revealed that that can easily be done by adding the following to your .emacs file:

;; Configure RefTeX for use with org-mode. At the end of your
;; org-mode file you need to insert your style and bib file:
;; \bibliographystyle{plain}
;; \bibliography{ProbePosition}
;; See http://www.mfasold.net/blog/2009/02/using-emacs-org-mode-to-draft-papers/
(defun org-mode-reftex-setup ()
  (load-library "reftex")
  (and (buffer-file-name)
       (file-exists-p (buffer-file-name))
       (reftex-parse-all))
  (define-key org-mode-map (kbd "C-c )") 'reftex-citation)
  )
(add-hook 'org-mode-hook 'org-mode-reftex-setup)

After that, RefTeX works, but exporting the org document to PDF (via LaTeX) didn’t include the bibliography entries. A quick look at the error log showed that bibtex hadn’t been run, so the question was: how to tell org-mode to do that too when exporting. The answer is to tell org-mode to use the latexmk Perl script (on Debian/Ubuntu it is easily installed from the package repositories) when exporting to PDF. I added the following lines to my .emacs file:

;; Use latexmk for PDF export
(setq org-latex-to-pdf-process (list "latexmk -pdf -bibtex %f"))

Exit a Bash script if an error occurs

Last week I found out that a Bash script I wrote to do some data QC gave me a false sense of security: a script continues even if one (or more) of the statements in the scripts fails (with an exit status not equal to 0). It turned out that for some of the data sets the QC wasn’t done correctly because I didn’t check the exit status after each step.

My first thought was: oh boy, that means I have to check $? for every step. That means a lot of repetitive code to write! Luckily my colleague came with the answer: add

set -e

at the top of you Bash script and the script will fail if one of its statements fails (for the fine print see the top answer in this is StackOverflow post).

Speeding up grep when looking in large files

In my line of work it is not uncommon to have to find out whether a given term is present in a long list. Say, for example you need to look up whether a set of, say 10, SNPs is present in a (possibly annotated) list of SNPs present on a genotyping array (having for example 240k SNPs).
My first instinct in such cases is to use grep, and it’s a good instinct that has served me well over the years.

Recently we had a case that involved quite some larger files. We needed to see whether a set of genomic positions was present in a genome-wide list of such positions. Of course we split the files up per chromosome, but still this took ~ 24 hours for a chromosome when using

grep -w -f short_list long_file > results

I was convinced this could be done faster and googled a bit, read the grep man page to find out that the -F option of grep ensures that the search string is not seen as a (regexp) pattern, but as fixed. This meant an enormous speed improvement. Instead of having to wait for 24 hours we got the output in under a minute!

I did a quick performance comparison: looking up ten items in a ~415MB file with 247,871 rows and 136 columns took ~2 minutes, 3 seconds with out -F and less than a second with the -F option:

$ time grep -w -f shortlist.txt largefile.tsv > out_withoutF
 
real    2m3.181s
user    2m0.780s
sys     0m2.196s
$ time grep -wF -f shortlist.txt largefile.tsv > out_withF
 
real    0m0.568s
user    0m0.500s
sys     0m0.060s

Importing a git repo into another one (keeping all history)

Today I wanted to create a new git repository that should contain several subdirectories that each were initially stored as separate git repos. Of course I didn’t want to lose the history. Thanks to user ebneter‘s answer at StackOverflow I was able to do so. These are the steps I took:

mkdir new_combined_repo
git init                           # Make empty new 'container' repo (no need to create a subdir at this point yet)
git remote add oldrepo /path/to/oldrepo
git fetch oldrepo
git checkout -b olddir oldrepo/master
mkdir olddir
git mv stuff olddir/stuff          # as necessary
git commit -m "Moved stuff to olddir"
git checkout master                
git merge olddir                   # should add olddir/ to master
git commit
git remote rm oldrepo
git branch -d olddir               # to get rid of the extra branch before pushing
git push                           # if you have a remote, that is

Converting from bzr to git

I’m in the process of moving several of my projects that used Bazaar (bzr) for revision control to Git. Converting a repository from bzr to git is very easy when using the fastimport package. In a Debian-based distribution run the following command to install the package (don’t be fooled by its name, it also contains the fastexport option):

sudo aptitude install bzr-fastimport

The go into the directory that contains your bzr repo and run:

git init
bzr fast-export `pwd` | git fast-import 

You can now check a few things, e.g. running git log to see whether the change log was imported correctly. This is also the moment to move the content of your .bzrignore file to a .gitignore file.

If all is well, let’s clean up:

rm -r .bzr 
git reset HEAD

Thanks to Ron DuPlain for his post here, from which I got most of this info.

Using CUA selection mode in Emacs to edit rectangles

Today Planet Emacsen brought me Irreal’s second blog post in a short time on CUA mode in Emacs. So far I’ve always ignored it because as far as I knew CUA mode is about getting the Windows keyboard shortcuts of Ctrl-c, Ctrl-x and Ctrl-v for copying and pasting to Emacs. The thing is, I date back to the DOS era when Shift-Del and Shift-Ins were used for that, so back in my Windows days I never got used to those ‘new’ keyboard shortcut. Now that I’ve been an Emacs user for more than a decade I’m so used to C-w and C-y and I see no reason for having the Windows shortcuts work in Emacs.

Back to Irreal. In his recent blog posts he writes about a subset of cua-mode: cua-selection-mode. The video by Mark Mansour that we writes about says it all (it’s short, so go and watch it!). What cua-selection-mode is all about is rectangle editting. So far I’ve been using the regular Emacs keys for rectangle selection and editing (basically C-space to select a rectangle and C-r-k to cut it, C-r-t to insert text and C-r-y to paste a rectangle). By setting

(cua-selection-mode 1)

in your ~/.emacs file you only enable the rectangle features of CUA mode.

So, for those that didn’t watch the video, what does the rectangle editing mean? It means that you can for example simply insert a list of increasing numbers in a text (this may come in handy in an org-mode table for example), or you can insert the same text in front of and/or behind a selected column of text.

Key combos to remember are:

  • C-return: Start selection
  • return: move the cursor to top-left, top-right, bottom-left and bottom-right corner of the selected rectangle
  • C-?: briefly list the available key combinations (with rectangle selection enabled)
  • M-i: if the selection is a column of numbers increase the numbers (by one)
  • M-n: Insert a number in the column (asks for start value and increment value)
  • C-1 C-w: Kill (cut) the contents of the rectangle to register 1 (you can use number 0–9 for different registers). Using C-1 C-y yanks (pastes) the rectangle at the cursor position.

ProbABEL v0.3.0 released

On New Year’s day I released version 0.3.0 of ProbABEL, almost two months after the previous release.

This update contains a few small bug fixes, but the most important feature of this new release is that thanks to the work of Maarten Kooyman we have a four to five-fold speed increase for the types of GWAS we run at work. In his e-mail to the GenABEL developers list he explains what he did to achieve this. The take-home-message of it is that you should always look for a suitable library for important tasks of any program you write. The old ProbABEL was based on a self-written matrix class that handled things like matrix multiplication and matrix subsetting. In the new release we make use of the Eigen C++ template library, maintained and developed by people who know much more about fast implementations of linear algebra than we do.

For those of you running Ubuntu Linux (or one of its derivatives and probably also Debian) I have set up the GenABEL PPA (personal package archive) where you can download and install the ProbABEL .deb package and stay up to date with future updates.
ProbABEL is also available for MS Windows, although we don’t have much experience running it on that platform.

Development of ProbABEL (and other members of the GenABEL suite) takes place on this R-forge page. If you are in search of an open source project to contribute to, feel free to contact us!

User support for the GenABEL suite can be found at our forum.

Tabs and spaces in Emacs

Recently I added the following lines to my ~/.emacs file:

;; Don't insert tabs when indenting regions
(setq-default indent-tabs-mode nil)

The idea behind disabling the indent-tabs-mode was that (especially) while programming I want any tabs to be converted to spaces. Since different people have different settings for a tab width this seemed like a good choice.
However, once I opened a Makefile I ran into trouble. In a Makefile tabs are a requirement, not an option. Since all my tabs were converted to spaces the moment I saved a Makefile compiling became a nightmare. To solve this problem I added the following to my ~/.emacs file, after the aforementioned statement:

(add-hook 'makefile-mode-hook
          (lambda ()
            indent-tabs-mode t))

This enables tabs again for modes that involve Makefiles.

« Older posts Newer posts »

© 2021 Lennart's weblog

Theme by Anders NorenUp ↑