(random unfinished engineering notes)

Monday, December 31, 2007

Linux buffer cache & how to disable it (and why ?)

Linux buffer cache provides a excelent mechanism for black-box performance optimization by a modest cost or max 2 memory hits (hit 1 -> page not found -> fetch buffer from disk to buffer cache (free heap space) -> hit 2 -> found in memory -> get block).

However, there are 2 cases when we would want to disable such behaviour :

(1) When 2-hit is too much of price to pay, we might want to think about direct io (O_DIRECT/ madvise()-style), and doing the memory buffer management by hand from userspace - often done for db cache management

(2) When we want to do unbiased benchmarking of heavily IO-dependent software (usually a single shot-benchmark is ok - pages are on the disk), but upont 2nd take , most of the pages remains in buffcache, which results in better performance, so no real metric can be imposed afterwards - so the only way to do the proper benchmark would be to disable buffacahe

Currently, we are interested in (2) , so here are couple of ideas how that could be done :

(a) create a non-trivial file of available memory size and write a simple code that mmap()'s it
(b) tune the swappiness kernel knob (/proc/sys/vm/swappiness) to 0 (proc memory over buffers), and fork() some ~64k dummy (nontrivial) processes :) - this should do the trick
(c) mounting the partition as raw device (no buffering then) - but this is usually highly impractical
(d) seting the O_DIRECT flag for every open() in the source (this is often tedious unless open() is invoked through a wrapper - a nice argument for doing so in such applications). We could write a simple wrapper (if possible for doing so):

int dopen(char *file) {
return open(file, O_DIRECT);
}

or in case of fopen() :

FILE *dfopen(char *file, char *pern) {
int fd = open(file, O_DIRECT | O_RDONLY);
return fdopen(fd,perm);
}

(note theat O_DIRECT is conditionaly defined by _GNU_SOURCE
, so don't forget to use -D_GNU_SOURCE flag when building)
(e) reboot the machine (if you are really desperate) :)
(f) allthough one might think that invoking "sync" from command line might do the trick - it actually just flushes *changed* blocks to the disk - which is not actually what we need (flushing the read-buffered blocks)
(g) the dirty way of doing (a) would be simply touching a very big file (for example dd if=/dev/urandom of=file_4GB bs=1024 count=4000000) and doing 'cat file_4GB > /dev/null'
(h) writing a simple code that allocates a huge chunk of memory and locks the allocated pages (using mlock() or similar) - thus reducing the available buffcache to a arbitrary size
(i) doing fcntl() with F_NOCACHE on all file descriptors in the source code - again quite tedious especially if there is no wrapper for open() call in the code
(g) using madvise() mechanism for telling kernel that the pages allocated won't be used in the future (which should result in kernel freeing the allocated pages from buffcache immidiatelly):

size_t dfread(void *ptr,size_t size,size_t nmemb, FILE *stream) {
size_t n;
n = fread(ptr,size,nmemb,stream);
madvise(ptr,size*nmemb,MADV_DONTNEED);
return n;
}


(h) if we're to just flush the entire buffercache, on the 2.6.16+ kernels we could use a a "drop caches" mechanism to free all pages from buffcache :
echo 1 > /proc/sys/vm/drop_caches