时间:2021-07-01 10:21:17 帮助过:4人阅读
ZFS is not the only transactional filesystem, ext4, with the option “data=journal”, can also be transactional. So, the question is: is it better to have the InnoDB double write buffer enabled or to use the ext4 transaction log. Also, if this is better, how does it compare with xfs, the filesystem I use to propose but which do not support transactions.
The goal is to stress the double write buffer so the load has to be write intensive. The server has a simple mirror of two 7.2k rpm drives. There is no controller write cache and the drives write caches are disabled. I decided to use the Perconatpcc-mysqlbenchmark tool and with 200 warehouses, the total dataset size was around 18G, fitting all within the Innodb buffer pool (server has 24GB). Here’re the relevant part of the my.cnf:
innodb_read_io_threads=4innodb_write_io_threads=8#To stress the double write bufferinnodb_buffer_pool_size=20Ginnodb_buffer_pool_load_at_startup=ONinnodb_log_file_size = 32M #Small log files, more page flushinnodb_log_files_in_group=2innodb_file_per_table=1innodb_log_buffer_size=8Minnodb_flush_method=O_DIRECTinnodb_flush_log_at_trx_commit=0skip-innodb_doublewrite#commented or not depending on test
innodb_read_io_threads=4 innodb_write_io_threads=8 #To stress the double write buffer innodb_buffer_pool_size=20G innodb_buffer_pool_load_at_startup=ON innodb_log_file_size=32M#Small log files, more page flush innodb_log_files_in_group=2 innodb_file_per_table=1 innodb_log_buffer_size=8M innodb_flush_method=O_DIRECT innodb_flush_log_at_trx_commit=0 skip-innodb_doublewrite #commented or not depending on test |
So, I generated the dataset for 200 warehouses, added they keys but not the foreign key constraints, loaded all that in the buffer pool with a few queries and dumped the buffer pool. Then, with MySQL stopped, I did a file level backup to a different partition. I used the MySQL 5.6.16 version that comes with Ubuntu 14.04, at the time Percona server was not available for 14.04. Each benchmark followed this procedure:
The tpcc_start I used it the following:
./tpcc_start -h10.2.2.247 -P3306 -dtpcc -utpcc -ptpcc -w200 -c32 -r300 -l3600 -i60
./tpcc_start-h10.2.2.247-P3306-dtpcc-utpcc-ptpcc-w200-c32-r300-l3600-i60 |
I used 32 connections, let the tool run for 300s of warm up, enough to reach a steady level of dirty pages, and then, I let the benchmark run for one hour, reporting results every minute.
Test: | Double write buffer | File system options | Average NOPTM over 1h |
---|---|---|---|
ext4_dw | Yes | rw | 690 |
ext4_dionolock_dw | Yes | rw,dioread_nolock | 668 |
ext4_nodw | No | rw | 1107 |
ext4trx_nodw | No | rw,data=journal | 1066 |
xfs_dw | Yes | xfs rw,noatime | 754 |
So, from the above table, the first test I did was the common ext4 with the Innodb double write buffer enabled and it yielded 690 new order transactions per minute (NOTPM). Reading theext4 doc, I also wanted to try the “dioread_nolock” setting that is supposed to reduce mutex contention and this time, I got slightly less 668 NOTPM. The difference is within the measurement error and isn’t significant. Removing the Innodb double write buffer, although unsafe, boosted the throughput to 1107 NOTPM, a 60% increase! Wow, indeed the double write buffer has a huge impact. But what is the impact of asking the file system to replace the innodb double write buffer? Surprisingly, the performance level is only slightly lower at 1066 NOTPM and vmstat did report twice the amount writes. I needed to redo the tests a few times to convince myself. Getting a 55% increase in performance with the same hardware is not common except when some trivial configuration errors are made. Finally, I used to propose xfs with the Innodb double write buffer enabled to customers, that’s about 10% higher than ext4 with the Innodb double write buffer, close to what I was expecting. The graphic below presents the numbers in a more visual form.
TPCC NOTPM for various configurations
In term of performance stability, you’ll find below a graphic of the per minute NOTPM output for three of the tests, ext4 non-transactional with the double write buffer, ext4 transactional without the double write buffer and xfs with the double write buffer. The dispersion is qualitatively similar for all three. The values presented above are just the averages of those data sets.
TPCC NOTPM evolution over time
Innodb data corruption is not fun and removing the innodb double write buffer is a bit scary. In order to be sure it is safe, I executed the following procedure ten times:
I observed no corruption. I was still a bit preoccupied, what if the test is wrong? I removed the “data=journal” mount option and did a new run. I got corruption the first time. So given what the procedure I followed and the number of crash tests, I think it is reasonable to assume it is safe to replace the InnoDB double write buffer by the ext4 transactional journal.
I also looked at the kernel ext4 sources and changelog. Up to recently, before kernel 3.2, O_DIRECT wasn’t supported with data=journal and MySQL would have issued a warning in the error log. Now, with recent kernels, O_DIRECT is mapped to O_DSYNC and O_DIRECT is faked, always for data=journal, which is exactly what is needed. Indeed, I tried “innodb_flush_method = O_DSYNC” and found the same results. With older kernels Istronglyadvise to use the “innodb_flush_method = O_DSYNC” setting to make sure files are opened is a way that will cause them to be transactional for ext4. As always, test thoroughfully, I only tested on Ubuntu 14.04.
Since we are no longer really using O_DIRECT, even if set in my.cnf, the OS file cache will be used for InnoDB data. If the database is only using InnoDB that’s not a big deal but if MyISAM is significantly used, that may cause performance issues since MyISAM relies on the OS file cache so be warned.
If you have a SSD setup that doesn’t offer a transactional file system like the FusionIO directFS, a very interesting setup would be to mix spinning drives and SSDs. For example, let’s suppose we have a mirror of spinning drives handled by a raid controller with a write cache (and a BBU) and an SSD storage on a PCIe card. To reduce the write load to the SSD, we could send the file system journal to the spinning drives using the “journal_path=path” or “journal_dev=devnum” options of ext4. The raid controller write cache would do an awesome job at merging the write operations for the file system journal and the amount of write operations going to the SSD would be cut by half. I don’t have access to such a setup but it seems very promising performance wise.
Like ZFS, ext4 can be transactional and replacing the InnoDB double write buffer with the file system transaction journal yield a 55% increase in performance for write intensive workload. Performance gains are also expected for SSD and mixed spinning/SSD configurations.