Is that performance I smell? Ext2 vs Ext3 on 50 spindles, testing for PostgreSQL

There are few things I like better than when a customer says to the team, "I want the best machine I can buy for XXX dollars". It inspires a certain sense of joy not unlike the feeling an average Slashdot reader gets when they walk into the local gadget store. It is particularly special because you know as much as you could make use of such a machine, you know you would never justify the expense.
In this case, the customer was willing to spend a modest but not excessive amount of money. I applaud this decision because I run into far to many people that feel that the only way to get real performance is to buy some ridiculous SAN at 10 times the cost to performance ratio.

Machine Specs:

HP DL585.
4 Dual core 8222 processors
64GB of ram. Storage:
(2) MSA70 direct attached storage arrays.
25 spindles in each array.
Single HP P800 controller.

Filesystem layout

/dev/cciss/c1d1p1 1693108576 201228 1606902380 1% /data2
/dev/cciss/c1d0p1 1693104732 201292 1606898664 1% /data1
/dev/cciss/c0d1p1 282181440 195616 267651768 1% /xlogs

Where /data[n] is an MSA70 and /xlogs is a RAID 10 on the embedded controller.

Filesystem options

/dev/cciss/c1d0p1 /data1 ext3 data=writeback 1 2
/dev/cciss/c1d1p1 /data2 ext3 data=writeback 1 2
/dev/cciss/c0d1p1 /xlogs ext2 defaults 1 2

Xlog performance

The PostgreSQL WAL is written in a sequential fashion negating the need for a large number of spindles to get reasonable performance. It is random writes that kills performance. Further, when the WAL is used for recovery purposes, it will recover up to the last known good transaction and throw all transactions after away. This ensures a consistent database regardless of crash. It also partly why we are able to forgo a journaling filesystem for the xlog files. Just for kicks, I ran tests for xlog on both ext3 and ext2. The benchmarking software being used is IOzone.

The command used was:

/opt/iozone/bin/iozone -e -i0 -i1 -i2 -i8 -t1 -s 1000m -r 8k -+u
Here are the results: xlogs ext3 with defaults (ordered mode for journaling)
Children see throughput for 1 rewriters = 87418.44 KB/sec
Parent sees throughput for 1 rewriters = 87395.65 KB/sec
Min throughput per process = 87418.44 KB/sec
Max throughput per process = 87418.44 KB/sec
Avg throughput per process = 87418.44 KB/sec

xlogs ext3 with data=writeback

Children see throughput for 1 rewriters = 84712.55 KB/sec
Parent sees throughput for 1 rewriters = 83513.39 KB/sec
Min throughput per process = 84712.55 KB/sec
Max throughput per process = 84712.55 KB/sec
Avg throughput per process = 84712.55 KB/sec

xlogs ext2 with defaults

Children see throughput for 1 rewriters = 115378.34 KB/sec
Parent sees throughput for 1 rewriters = 115345.26 KB/sec
Min throughput per process = 115378.34 KB/sec
Max throughput per process = 115378.34 KB/sec
Avg throughput per process = 115378.34 KB/sec

A pretty clear indicator that one should always consider /xlogs on a separate channel. The next series of tests I ran were with ext3 and the /data[n] partitions. Remember each of the partitions are on their own Direct Attached Storage. /data1 with data=journal

Children see throughput for 1 random writers = 49444.73 KB/sec
Parent sees throughput for 1 random writers = 48709.89 KB/sec
Min throughput per process = 49444.73 KB/sec
Max throughput per process = 49444.73 KB/sec
Avg throughput per process = 49444.73 KB/sec

/data1 with data defaults (ordered mode)

Children see throughput for 1 random writers = 142926.14 KB/sec
Parent sees throughput for 1 random writers = 142872.21 KB/sec
Min throughput per process = 142926.14 KB/sec
Max throughput per process = 142926.14 KB/sec
Avg throughput per process = 142926.14 KB/sec

/data1 with data=writeback

Children see throughput for 1 random writers = 168948.55 KB/sec
Parent sees throughput for 1 random writers = 168867.03 KB/sec
Min throughput per process = 168948.55 KB/sec
Max throughput per process = 168948.55 KB/sec
Avg throughput per process = 168948.55 KB/sec

The ext3 journal mode of writeback is the obvious winner here. A note of caution however, it is likely not safe to use writeback unless you have a battery backed RAID controller. The overall bandwidth is respectable at ~ 170MB/s. How much of that is journaling? /data1 with ext2

Children see throughput for 1 random writers = 178404.45 KB/sec
Parent sees throughput for 1 random writers = 178320.32 KB/sec
Min throughput per process = 178404.45 KB/sec
Max throughput per process = 178404.45 KB/sec
Avg throughput per process = 178404.45 KB/sec

Although ext2 is faster, I don't think it is fast enough to satisfy the downside of running a non journaled filesystem (long fsck times). What happens when we access both /data1 and /data2 at the same time. /data1 and /data2 using separate processes

 Children see throughput for 1 random writers = 93932.16 KB/sec
 Parent sees throughput for 1 random writers = 93909.48 KB/sec
 Min throughput per process = 93932.16 KB/sec
 Max throughput per process = 93932.16 KB/sec
 Avg throughput per process = 93932.16 KB/sec

 Children see throughput for 1 random writers = 105375.49 KB/sec
 Parent sees throughput for 1 random writers = 105292.74 KB/sec
 Min throughput per process = 105375.49 KB/sec
 Max throughput per process = 105375.49 KB/sec
 Avg throughput per process = 105375.49 KB/sec

I am not actually buying these numbers. The reason is as I monitored multiple thread results and how they interacted with each processor, whether I was running two processes separately or a single process over multiple threads, processor utilization was never correctly aggregated. I think this a failure of the benchmark software. In theory I should see almost identical results for a single arrray as the dual arrays. In an effort to get more accurate results across not only the arrays but the availability of processors I wrote a quick script. The script fires the benchmark software as four independent processes each with a single writer. I then executed that script on /data1 and /data2 simultaneously. This allowed us to have much better utilization of all processors and also gave us a more accurate representation of the peformance as a whole. ext3 data=writeback, /data1 and /data2, four threads each

    Children see throughput for 1 random writers = 50916.17 KB/sec
    Parent sees throughput for 1 random writers = 50909.04 KB/sec
    Min throughput per process = 50916.17 KB/sec
    Max throughput per process = 50916.17 KB/sec
    Avg throughput per process = 50916.17 KB/sec
    Children see throughput for 1 random writers = 51021.88 KB/sec
    Parent sees throughput for 1 random writers = 51013.58 KB/sec
    Min throughput per process = 51021.88 KB/sec
    Max throughput per process = 51021.88 KB/sec
    Avg throughput per process = 51021.88 KB/sec
    Children see throughput for 1 random writers = 51048.78 KB/sec
    Parent sees throughput for 1 random writers = 51040.33 KB/sec
    Min throughput per process = 51048.78 KB/sec
    Max throughput per process = 51048.78 KB/sec
    Avg throughput per process = 51048.78 KB/sec
    Children see throughput for 1 random writers = 50755.62 KB/sec
    Parent sees throughput for 1 random writers = 50746.71 KB/sec
    Min throughput per process = 50755.62 KB/sec
    Max throughput per process = 50755.62 KB/sec
    Avg throughput per process = 50755.62 KB/sec
/data2:
    Children see throughput for 1 random writers  =   49711.77 KB/sec
    Parent sees throughput for 1 random writers  =   49704.75 KB/sec
    Min throughput per process    =   49711.77 KB/sec
    Max throughput per process    =   49711.77 KB/sec
    Avg throughput per process    =   49711.77 KB/sec
    Children see throughput for 1 random writers  =   49708.98 KB/sec
    Parent sees throughput for 1 random writers  =   49695.55 KB/sec
    Min throughput per process    =   49708.98 KB/sec
    Max throughput per process    =   49708.98 KB/sec
    Avg throughput per process    =   49708.98 KB/sec
    Children see throughput for 1 random writers  =   49713.46 KB/sec
    Parent sees throughput for 1 random writers  =   49691.86 KB/sec
    Min throughput per process    =   49713.46 KB/sec
    Max throughput per process    =   49713.46 KB/sec
    Avg throughput per process    =   49713.46 KB/sec
    Children see throughput for 1 random writers  =   49707.78 KB/sec
    Parent sees throughput for 1 random writers  =   49699.04 KB/sec
    Min throughput per process    =   49707.78 KB/sec
    Max throughput per process    =   49707.78 KB/sec
    Avg throughput per process    =   49707.78 KB/sec

That is more like it... ~200MB/s. In seeing this improvement, I decided to run 8 threads per partition. I am only going to post one output but all of the threads had similar performance. The per thread performance went down but the aggregate performance for each partition was higher at ~ 280MB/s. Between the two arrays that is ~ 560MB/s.

Children see throughput for 1 random writers  =   35403.08 KB/sec
Parent sees throughput for 1 random writers  =   35398.90 KB/sec
Min throughput per process    =   35403.08 KB/sec
Max throughput per process    =   35403.08 KB/sec
Avg throughput per process    =   35403.08 KB/sec