Notes about open source software, computers, other stuff.

Month: June 2013

Speeding up grep when looking in large files

In my line of work it is not uncommon to have to find out whether a given term is present in a long list. Say, for example you need to look up whether a set of, say 10, SNPs is present in a (possibly annotated) list of SNPs present on a genotyping array (having for example 240k SNPs).
My first instinct in such cases is to use grep, and it’s a good instinct that has served me well over the years.

Recently we had a case that involved quite some larger files. We needed to see whether a set of genomic positions was present in a genome-wide list of such positions. Of course we split the files up per chromosome, but still this took ~ 24 hours for a chromosome when using

grep -w -f short_list long_file > results

I was convinced this could be done faster and googled a bit, read the grep man page to find out that the -F option of grep ensures that the search string is not seen as a (regexp) pattern, but as fixed. This meant an enormous speed improvement. Instead of having to wait for 24 hours we got the output in under a minute!

I did a quick performance comparison: looking up ten items in a ~415MB file with 247,871 rows and 136 columns took ~2 minutes, 3 seconds with out -F and less than a second with the -F option:

$ time grep -w -f shortlist.txt largefile.tsv > out_withoutF
 
real    2m3.181s
user    2m0.780s
sys     0m2.196s
$ time grep -wF -f shortlist.txt largefile.tsv > out_withF
 
real    0m0.568s
user    0m0.500s
sys     0m0.060s

Related Images:

Fixing the NFS check plugin in Nagios (in Ubuntu)

For some time (probably after an upgrade, I actually don’t remember anymore) we had problems with the NFS check in Nagios on our Ubuntu 12.04 servers. The check would return UNKNOWN: RPC program nfs udp is not running. When running the actual check from the command line:

/usr/lib/nagios/plugins/check_rpc -H '$HOSTADDRESS$' -C nfs -c2,3

the output would be: Can't fork for rpcinfo.
It turns out that the file /usr/lib/nagios/plugins/utils.pm has the wrong path to the rpcinfo binary. Instead of /usr/sbin/rpcinfo it lists /usr/bin/rpcinfo. So, like most of the times, the fix is easy, but pinpointing the exact problem isn’t.

Don’t forget to restart Nagios after changing the path as utils.pm needs to be reloaded.

As Ubuntu is based on Debian, I expect this fix to work there as well. According to this Launchpad bug report this issue was fixed in January in version 1.4.16-1ubuntu1 of the nagios-plugins package, which is not in Ubuntu 12.04.

Related Images:

Importing a git repo into another one (keeping all history)

Today I wanted to create a new git repository that should contain several subdirectories that each were initially stored as separate git repos. Of course I didn’t want to lose the history. Thanks to user ebneter‘s answer at StackOverflow I was able to do so. These are the steps I took:

mkdir new_combined_repo
git init                           # Make empty new 'container' repo (no need to create a subdir at this point yet)
git remote add oldrepo /path/to/oldrepo
git fetch oldrepo
git checkout -b olddir oldrepo/master
mkdir olddir
git mv stuff olddir/stuff          # as necessary
git commit -m "Moved stuff to olddir"
git checkout master                
git merge olddir                   # should add olddir/ to master
git commit
git remote rm oldrepo
git branch -d olddir               # to get rid of the extra branch before pushing
git push                           # if you have a remote, that is

Related Images:

Converting from bzr to git

I’m in the process of moving several of my projects that used Bazaar (bzr) for revision control to Git. Converting a repository from bzr to git is very easy when using the fastimport package. In a Debian-based distribution run the following command to install the package (don’t be fooled by its name, it also contains the fastexport option):

sudo aptitude install bzr-fastimport

The go into the directory that contains your bzr repo and run:

git init
bzr fast-export `pwd` | git fast-import 

You can now check a few things, e.g. running git log to see whether the change log was imported correctly. This is also the moment to move the content of your .bzrignore file to a .gitignore file.

If all is well, let’s clean up:

rm -r .bzr 
git reset HEAD

Thanks to Ron DuPlain for his post here, from which I got most of this info.

Related Images:

© 2024 Lennart's weblog

Theme by Anders NorĂ©nUp ↑