I did some test related Oracle IO, Storage I/O and CPU limits.
Aiming to understanding environment limits, potential bottlenecks and performance characteristics and capacity, I did some test related to Oracle IO, Storage I/O and CPU limits. Enjoy and leave comments.
In this PART 1, I will try to find out the limit of single core to driving I/O using dd utility. this will give an idea of Core and I/O limit.
Why is this important? such knowledge of environment can be use in deployment planning. RDBMS related test results will be shared in PART 2 using SLOB to generate I/O workload.
Why not just trust OEM datasheet? because each environment differs in components and configurations, bottleneck can exist anywhere. Btt is a useful tool to track where you have bottleneck between OS and storage stack.
The knowledge of the “limit values” will improve accuracy of decision making especially in tuning real database workload, application design and capacity planning.
TEST1:
what is the limit of linux OS core driving IO. I used a simple DD utility to generate workload. this should give upper limited of CPU core IO processing power.
I can safely assume that single core on this system will not drive IO more than 300MB/s (at least for dataset greater than 4GB).
I used tasket to enforce particular CPU core to used, but I did not go nuclear to make sure I isolate IRQ or other users, for simplicity i assumed that IRQs and other users interference will be minimal. You can isolate CPU using tuna or boot parameter isolcpus - remove core from kernel scheduler and nohz_full - disable tick on cpu core.
Trying to ignore OS buffer, I use iflag=direct ensuring direct I/O for LUN read.
Another caution: other process might be accessing the LUN path. I am accepting this noise in the results posted here as negligible. this is an existing LUN use by ASM. so, I can only do read IO with dd utility.
Result of TEST1, using single core to drive IO
[grid@ServerE-54640 ~]$ taskset -c 77 dd if=/dev/dm-71 of=/dev/null iflag=direct bs=1M count=4000
4000+0 records in
4000+0 records out
4194304000 bytes (4.2 GB) copied, 14.3439 s, 292 MB/s
avg cpu is not here since we are running test on particular core(s).
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
dm-71 0.00 0.00 1.00 4.00 0.02 0.04 20.80 0.00 0.60 1.00 0.50 0.60 0.30
dm-71 0.00 0.00 571.00 3.00 140.17 0.03 500.24 0.71 1.23 1.23 0.67 0.34 19.50
dm-71 0.00 0.00 1829.00 7.00 456.08 0.09 508.85 3.08 1.68 1.69 0.29 0.53 97.30
dm-71 0.00 0.00 1781.00 7.00 445.02 0.04 509.77 3.45 1.93 1.93 0.43 0.55 98.50
dm-71 0.00 0.00 2201.00 4.00 547.91 0.03 508.92 3.12 1.41 1.42 0.50 0.44 97.90
dm-71 0.00 0.00 821.00 7.00 203.08 0.08 502.51 2.39 2.86 2.88 0.71 1.20 99.20
dm-71 0.00 0.00 829.00 5.00 206.78 0.03 507.86 2.35 2.78 2.80 0.60 1.19 99.60
dm-71 0.00 0.00 1205.00 4.00 298.67 0.03 505.99 3.55 2.97 2.98 0.75 0.81 98.20
dm-71 0.00 0.00 1049.00 9.00 261.08 0.08 505.52 3.86 3.67 3.69 0.78 0.93 98.80
dm-71 0.00 0.00 835.00 4.00 208.52 0.05 509.10 2.50 2.94 2.95 0.50 1.18 99.00
dm-71 0.00 0.00 890.00 5.00 219.69 0.04 502.79 2.86 3.22 3.23 0.40 1.10 98.70
dm-71 0.00 0.00 971.00 6.00 242.05 0.07 507.53 3.80 3.80 3.83 0.33 1.01 98.60
dm-71 0.00 0.00 806.00 4.00 200.80 0.05 507.81 2.98 3.80 3.81 0.50 1.23 99.50
dm-71 0.00 0.00 732.00 4.00 179.48 0.03 499.53 2.43 3.30 3.31 1.00 1.34 98.70
dm-71 0.00 0.00 733.00 8.00 182.08 0.08 503.46 2.16 2.91 2.94 0.38 1.33 98.50
dm-71 0.00 0.00 698.00 7.00 174.03 0.05 505.70 3.26 4.63 4.67 0.57 1.41 99.40
dm-71 0.00 0.00 153.00 3.00 36.14 0.03 474.88 0.43 2.79 2.84 0.00 0.92 14.40
dm-71 0.00 0.00 6.00 6.00 0.09 0.08 29.50 0.01 0.42 0.33 0.50 0.42 0.50
Do not trust "svctm %util" values in this case, the information is not reliable. %util of 99 in this case does not show that the storage is saturated because iostat do not handle parallelism well.
LUN attached to this server are RAID5 with SSD disks and storage cache, so they scale well compare to single disks.
TEST2:
Here is a simple prove of storage is not saturated. the next result shows output of iostat when two processes simultaneously executed dd using core 77 and 78.
as you can see below the read rMB/s almost doubled. we still have 99% in %util column.
r_await values are very similar in comparison to a single core. r_await is more accurate. r_await = (current total read time - previous total read time)/current total read time
DD output using parallelly two core IO.
4000+0 records in
4000+0 records out
4194304000 bytes (4.2 GB) copied, 14.396 s, 291 MB/s
4000+0 records in
4000+0 records out
4194304000 bytes (4.2 GB) copied, 14.3969 s, 291 MB/s
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
dm-71 0.00 2.00 5.00 8.00 0.08 1.18 198.69 0.01 0.62 0.20 0.88 0.54 0.70
dm-71 0.00 0.00 2500.00 3.00 623.12 0.03 509.88 4.63 1.85 1.85 0.67 0.27 67.60
dm-71 0.00 1.00 2855.00 7.00 713.05 0.36 510.50 7.45 2.51 2.51 0.71 0.35 99.80
dm-71 0.00 1.00 3321.00 14.00 829.08 0.93 509.70 7.34 2.28 2.29 1.21 0.30 99.60
dm-71 0.00 4.00 3719.00 10.00 926.70 2.15 510.13 6.63 1.77 1.77 1.70 0.27 99.80
dm-71 0.00 0.00 1105.00 8.00 275.55 0.11 507.23 4.55 3.99 4.01 0.88 0.90 100.00
dm-71 0.00 0.00 1674.26 3.96 417.87 0.05 510.00 5.08 3.08 3.09 0.25 0.59 98.91
dm-71 0.00 0.00 2111.00 4.00 525.17 0.03 508.57 6.57 3.08 3.09 1.00 0.47 99.90
dm-71 0.00 0.00 2598.00 5.00 649.03 0.06 510.70 7.59 2.95 2.95 0.40 0.38 99.90
dm-71 0.00 0.00 1709.00 4.00 426.08 0.05 509.46 5.55 3.20 3.21 0.50 0.58 99.80
dm-71 0.00 0.00 1602.00 3.00 398.16 0.03 508.09 5.90 3.72 3.72 1.00 0.62 99.80
dm-71 0.00 0.00 1963.00 6.00 490.05 0.06 509.78 7.50 3.81 3.82 0.50 0.51 99.90
dm-71 0.00 0.00 1885.00 8.00 470.08 0.07 508.64 6.93 3.64 3.66 0.38 0.53 99.60
dm-71 0.00 0.00 2135.00 3.00 530.94 0.03 508.62 5.47 2.58 2.58 0.33 0.47 99.80
dm-71 0.00 0.00 1614.00 7.00 403.27 0.08 509.59 5.07 3.12 3.14 0.71 0.62 100.10
dm-71 0.00 0.00 1280.00 5.00 319.06 0.05 508.59 5.52 4.31 4.32 0.20 0.56 71.90
dm-71 0.00 0.00 11.00 3.00 0.17 0.03 29.86 0.01 0.36 0.36 0.33 0.36 0.50
OTHER important result from TEST1:
what is the core 77 utilization when reading IO at 292 MB/s. Checked mpstat output, it is observed that core varies between 98-100%.
01:30:11 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
...
01:26:12 PM 77 1.02 0.00 2.04 96.94 0.00 0.00 0.00 0.00 0.00
01:26:13 PM 77 1.02 0.00 2.04 96.94 0.00 0.00 0.00 0.00 0.00
01:26:14 PM 77 1.02 0.00 0.00 3.06 0.00 0.00 0.00 0.00 95.92
01:26:15 PM 77 1.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 98.00
01:26:16 PM 77 1.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 98.99
01:26:17 PM 77 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
01:26:18 PM 77 1.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 98.99
01:26:19 PM 77 1.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 98.99
01:26:20 PM 77 1.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 98.00
01:26:21 PM 77 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
01:26:22 PM 77 1.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 98.98
01:26:23 PM 77 1.01 0.00 1.01 0.00 0.00 0.00 0.00 0.00 97.98
01:26:24 PM 77 1.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 98.98
01:26:25 PM 77 1.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 98.99
01:26:26 PM 77 1.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 98.00
01:26:27 PM 77 1.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 98.98
01:26:28 PM 77 2.04 0.00 1.02 0.00 0.00 0.00 0.00 0.00 96.94
01:26:29 PM 77 1.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 98.99
01:26:30 PM 77 1.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 98.00
01:26:31 PM 77 1.01 0.00 1.01 0.00 0.00 0.00 0.00 0.00 97.98
01:26:32 PM 77 1.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 98.98
01:26:33 PM 77 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 99.00
01:26:34 PM 77 1.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 98.99
01:26:35 PM 77 1.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 98.99
01:26:36 PM 77 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
TEST 3:
How many cores can drive the IO to the limit with acceptable latency and efficacy.
here are the test results: I stop at four cores because at this level r_wait reaching 10ms. see graph below.
Note at 3 cores, MB/s in dd output shows 295MB/s and at 4 core, it drop drastically to 255MB/s per core., this is significant increment in queue time.
Trade Off between Latency and Throughput:
It’s interesting that throughput is still higher for 4 cores (1020MB/s, 135MB/s higher than 3 cores) but latency start to reach 10ms. IO should be measured considering throughput but with latency.
if your SLA says 30ms is acceptance, then you should be able to increase the level of parallelism. another versatile utility to measure I/O latency is flexible I/O (see: https://github.com/axboe/fio)
Results:
1. using 1 core
$ taskset -c 77 dd if=/dev/dm-71 of=/dev/null iflag=direct bs=1M count=4000
4000+0 records in
4000+0 records out
4194304000 bytes (4.2 GB) copied, 14.3439 s, 292 MB/s
2. using 2 core
$ 4000+0 records in
4000+0 records out
4194304000 bytes (4.2 GB) copied, 14.396 s, 291 MB/s
4000+0 records in
4000+0 records out
4194304000 bytes (4.2 GB) copied, 14.3969 s, 291 MB/s
3. using 3 core:
$ 4000+0 records in
4000+0 records out
4194304000 bytes (4.2 GB) copied, 14.2124 s, 295 MB/s
4000+0 records in
4000+0 records out
4194304000 bytes (4.2 GB) copied, 14.2147 s, 295 MB/s
4000+0 records in
4000+0 records out
4194304000 bytes (4.2 GB) copied, 14.2163 s, 295 MB/s
4. using 4 core:
$ 4000+0 records in
4000+0 records out
4194304000 bytes (4.2 GB) copied, 16.425 s, 255 MB/s
4000+0 records in
4000+0 records out
4194304000 bytes (4.2 GB) copied, 16.4266 s, 255 MB/s
4000+0 records in
4000+0 records out
4194304000 bytes (4.2 GB) copied, 16.428 s, 255 MB/s
4000+0 records in
4000+0 records out
4194304000 bytes (4.2 GB) copied, 16.429 s, 255 MB/s
See below graph plotting rMB/s and r_wait:
Part II will contain discussion about result of oracle workload using SLOB.