In my line of work it is not uncommon to have to find out whether a given term is present in a long list. Say, for example you need to look up whether a set of, say 10, SNPs is present in a (possibly annotated) list of SNPs present on a genotyping array (having for example 240k SNPs).
My first instinct in such cases is to use
grep, and it’s a good instinct that has served me well over the years.
Recently we had a case that involved quite some larger files. We needed to see whether a set of genomic positions was present in a genome-wide list of such positions. Of course we split the files up per chromosome, but still this took ~ 24 hours for a chromosome when using
grep -w -f short_list long_file > results
I was convinced this could be done faster and googled a bit, read the
grep man page to find out that the
-F option of
grep ensures that the search string is not seen as a (regexp) pattern, but as fixed. This meant an enormous speed improvement. Instead of having to wait for 24 hours we got the output in under a minute!
I did a quick performance comparison: looking up ten items in a ~415MB file with 247,871 rows and 136 columns took ~2 minutes, 3 seconds with out
-F and less than a second with the
$ time grep -w -f shortlist.txt largefile.tsv > out_withoutF real 2m3.181s user 2m0.780s sys 0m2.196s $ time grep -wF -f shortlist.txt largefile.tsv > out_withF real 0m0.568s user 0m0.500s sys 0m0.060s