Notes about open source software, computers, other stuff.

Month: January 2013

Using CUA selection mode in Emacs to edit rectangles

Today Planet Emacsen brought me Irreal’s second blog post in a short time on CUA mode in Emacs. So far I’ve always ignored it because as far as I knew CUA mode is about getting the Windows keyboard shortcuts of Ctrl-c, Ctrl-x and Ctrl-v for copying and pasting to Emacs. The thing is, I date back to the DOS era when Shift-Del and Shift-Ins were used for that, so back in my Windows days I never got used to those ‘new’ keyboard shortcut. Now that I’ve been an Emacs user for more than a decade I’m so used to C-w and C-y and I see no reason for having the Windows shortcuts work in Emacs.

Back to Irreal. In his recent blog posts he writes about a subset of cua-mode: cua-selection-mode. The video by Mark Mansour that we writes about says it all (it’s short, so go and watch it!). What cua-selection-mode is all about is rectangle editting. So far I’ve been using the regular Emacs keys for rectangle selection and editing (basically C-space to select a rectangle and C-r-k to cut it, C-r-t to insert text and C-r-y to paste a rectangle). By setting

(cua-selection-mode 1)

in your ~/.emacs file you only enable the rectangle features of CUA mode.

So, for those that didn’t watch the video, what does the rectangle editing mean? It means that you can for example simply insert a list of increasing numbers in a text (this may come in handy in an org-mode table for example), or you can insert the same text in front of and/or behind a selected column of text.

Key combos to remember are:

  • C-return: Start selection
  • return: move the cursor to top-left, top-right, bottom-left and bottom-right corner of the selected rectangle
  • C-?: briefly list the available key combinations (with rectangle selection enabled)
  • M-i: if the selection is a column of numbers increase the numbers (by one)
  • M-n: Insert a number in the column (asks for start value and increment value)
  • C-1 C-w: Kill (cut) the contents of the rectangle to register 1 (you can use number 0–9 for different registers). Using C-1 C-y yanks (pastes) the rectangle at the cursor position.

Related Images:

Comparing rsnapshot and obnam for scheduled large backups

Introduction

The home directories of the servers I administer at work total about 6.5TB of data. The home directories are stored on a file server (using ext4 partitions) and served to the other server over NFSv3 with a bonded 1Gbps LAN link.

As you all know backups are a good idea but how to implement a backup strategy for this kind of data? We decided quite early that using tapes as backup medium was out of the question. We simply can’t afford them given the amount of disk space we need. Moreover, tapes usually require operator involvement and neither me nor my colleague feels like going to the data centre every week. Our idea was to back up to another server with enough disk space in a different part of the data centre. For off-site backups we can always make an annual (maybe monthly) backup either on tape at SurfSARA/BigGrid or on a remote server.

Before implementing a given strategy several things need to be known and tested. The major questions we wanted to have an answer to were:

  1. How often do we want to backup the data? Daily snapshots? Weekly? Monthly?
  2. How many of the backups mentioned above do we want to keep? And for how long?
  3. In order to answer these questions (given a roughly fixed amount of backup space) we need to know
    • How much data changes per night/week/etc.
    • How much duplication is there in the data? How many people store the same file (or blocks, if you go for block-level deduplication)?
  4. Is NFS/network speed a limiting factor when running the backups?
  5. Can the tool preserve additional file system attributes like POSIX ACLS?

Candidates

After looking around the web and looking back at my own experiences I came up with three possible candidates. Each of them allows for backup rotation and preserves Posix ACLs (so points 1 and 5 above have been taken care of).

  1. Bacula: enterprise-level backup application that I’ve used in combination with tapes in the past. Easily supports multiple clients, tape robots, etc. No deduplication. All metadata etc. are stored in a (MySQL) database, so restoring takes some effort (and don’t forget to make a backup of the database as well!).
  2. rsnapshot: based on rsync, makes snapshots using hard links. Easy to restore, because files are simply copied to the backup medium.
  3. rdiff-backup: similar to rsnapshot, but doesn’t allow for removal of intermediate backups after a given time interval. Consequently it was the first candidate to fall of my list.
  4. Obnam: a young tool that promises block level data deduplication. Stores backed up data in its own file format. Tools for browsing those archives are not really well developed yet.

Tests

Because I already had quite some experience with Bacula but none with the other two candidates (although I use rsync a lot) I decided to start a test run with Obnam, followed by a run with rsnapshot. These are the results:

Obnam

After backing up /home completely (which took several days!), a new run, several days later took (timing by the Linux time command):

Backed up 3443706 files, uploaded 94.0 GiB in 127h48m49s at 214.2 KiB/s average speed830 files; 1.24 GiB (0 B/s)

real    7668m56.628s
user    4767m16.132s
sys     162m48.739s

From the obname log file:

2012-11-17 12:41:34 INFO VFS: baseurl=/home read=0 written=0
2012-11-21 23:09:36 INFO VFS: baseurl=/backups/backup_home read=2727031576964 written=150015706142
2012-11-21 23:09:36 INFO Backup performance statistics:
2012-11-21 23:09:36 INFO * files found: 3443706
2012-11-21 23:09:36 INFO * uploaded data: 100915247663 bytes (93.9846482715 GiB)
2012-11-21 23:09:36 INFO * duration: 460128.627629 s
2012-11-21 23:09:36 INFO * average speed: 214.179341663 KiB/s
2012-11-21 23:09:36 INFO Backup finished.
2012-11-21 23:09:36 INFO Obnam ends
2012-11-21 23:09:36 INFO obnam version 1.2 ends normally

So: ~5 days for backing up ~100 GB of changed data… Load was not high on the machines, neither in terms of CPU, nor in terms of RAM. Disk usage in /backups/backup_home was 5.7T, disk usage of /home was 6.6T, so there is some dedup, it seems.

rsnapshot

A full backup of /home to (according to the log file):

[27/Nov/2012:12:55:31] /usr/bin/rsnapshot daily: started
[27/Nov/2012:12:55:31] echo 17632 > /var/run/rsnapshot.pid
[27/Nov/2012:12:55:31] mkdir -m 0700 -p /backups/backup_home_rsnapshot/
[27/Nov/2012:12:55:31] mkdir -m 0755 -p /backups/backup_home_rsnapshot/daily.0/
[27/Nov/2012:12:55:31] /usr/bin/rsync -a --delete --numeric-ids --relative --delete-excluded /home /backups/backup_home_rsnapshot/daily.0/localhost/
[28/Nov/2012:23:16:16] touch /backups/backup_home_rsnapshot/daily.0/
[28/Nov/2012:23:16:16] rm -f /var/run/rsnapshot.pid
[28/Nov/2012:23:16:16] /usr/bin/rsnapshot daily: completed successfully

So: ~1.5 days for a full backup of 6.3TB. An incremental backup a
day later took:

[29/Nov/2012:13:10:21] /usr/bin/rsnapshot daily: started
[29/Nov/2012:13:10:21] echo 20359 > /var/run/rsnapshot.pid
[29/Nov/2012:13:10:21] mv /backups/backup_home_rsnapshot/daily.0/ /backups/backup_home_rsnapshot/daily.1/
[29/Nov/2012:13:10:21] mkdir -m 0755 -p /backups/backup_home_rsnapshot/daily.0/
[29/Nov/2012:13:10:21] /usr/bin/rsync -a –delete –numeric-ids –relative –delete-excluded –link-dest=/backups/backup_home_rsnapshot/daily.1/localhost/ /home /backups/backup_home_rsnapshot/daily.0/localhost/
[29/Nov/2012:13:25:09] touch /backups/backup_home_rsnapshot/daily.0/
[29/Nov/2012:13:25:09] rm -f /var/run/rsnapshot.pid
[29/Nov/2012:13:25:09] /usr/bin/rsnapshot daily: completed successfully

So: 15 minutes… and the changed data amounted to 21GB.

This gave me a clear winner: rsnapshot! Not only is it very fast, but given its simple way of storing data restoring a backup of any file is quickly done.

We now also have answers to our questions: Our daily changing volume is of the order of ~ 100GB, there isn’t much data that can be deduplicated. We also monitored the network usage and, depending on the server load it can be limiting, but since a daily differential backup takes only 15-30 minutes that isn’t a problem.
For a remote backup sever that was connected with a 100Mbps line we did see that the initial backup took a very long time. We should try to get a faster connection to that machine.

The future

The next challenge we face is how to back up some of the large data sets we have/produce. These include aligned BAM files of next-generation sequencing data, VCF files of the same data, results from genomic imputations (both as gzip-ed text files and as binary files in DatABEL format). This also totals several TB. Luckily these files usually don’t change on a daily basis.

Related Images:

Booting an Ubuntu server with a degraded software RAID array

My home server runs Ubuntu 12.04 with a software RAID 5 array and since a couple of days I’ve been getting e-mails from the SMART daemon warning me of uncorrectable errors on one of the drives. Today I took the time to take the failing drive out and check it with the tools from the manufacturer.

Because I didn’t want to run the risk of unplugging the wrong drive with the system on (and thus losing the whole RAID array) I shut the server down, removed the harddrive and started it again. The idea was that it would boot right back into the OS, but with a degraded RAID array. Unfortunately the server didn’t come up… After connecting a keyboard and monitor to it it turned out that the system was waiting with an initramfs prompt. From there I could check that the RAID array was indeed degraded, but functioning fine as I could manually mount all partitions.

Some Googling later I found out that by default Ubuntu doesn’t boot into a degraded software RAID array. This is to make sure you as administrator know something is wrong. A good idea for a laptop or PC, but not for a standalone server. The solution is the following:

  • From the initramfs prompt mount your original filesystems, for example in /mnt.
  • Use chroot /mnt to change root into your server’s hard disks.
  • In the file /etc/initramfs-tools/conf.d/mdadm add or change the line to
    BOOT_DEGRADED=true
    
  • Then run
    update-initramfs -u

    to regenerate the initial ramdisk.

  • Type exit to exit the chroot environment.
  • Unmount your file systems and reboot

Now your server should continue booting even though it has a degraded RAID array.

Links

Related Images:

ProbABEL v0.3.0 released

On New Year’s day I released version 0.3.0 of ProbABEL, almost two months after the previous release.

This update contains a few small bug fixes, but the most important feature of this new release is that thanks to the work of Maarten Kooyman we have a four to five-fold speed increase for the types of GWAS we run at work. In his e-mail to the GenABEL developers list he explains what he did to achieve this. The take-home-message of it is that you should always look for a suitable library for important tasks of any program you write. The old ProbABEL was based on a self-written matrix class that handled things like matrix multiplication and matrix subsetting. In the new release we make use of the Eigen C++ template library, maintained and developed by people who know much more about fast implementations of linear algebra than we do.

For those of you running Ubuntu Linux (or one of its derivatives and probably also Debian) I have set up the GenABEL PPA (personal package archive) where you can download and install the ProbABEL .deb package and stay up to date with future updates.
ProbABEL is also available for MS Windows, although we don’t have much experience running it on that platform.

Development of ProbABEL (and other members of the GenABEL suite) takes place on this R-forge page. If you are in search of an open source project to contribute to, feel free to contact us!

User support for the GenABEL suite can be found at our forum.

Related Images:

© 2024 Lennart's weblog

Theme by Anders NorĂ©nUp ↑