In database processing, storage components are critical parts that determine performance and scalability. various methods have been deployed to speedup database processing. concepts such as memory buffer that acts as intermediary between CPU and device (e.g., tape, hard disk etc.), cache concept that reduce latency and avoids round-tripping of data between processor and storage device. So traditionally you have expensive small device located closely with the CPU called (RAM aka "volatile" memory) but the large capacity storage located far away from the CPU but comparably super cheaper (hard disk/tapes). The gaps have been closing over the many years, faster traditional storage devices hard drive to SSD drives. RAM now is comparably cheaper and more compact in terms of capacity versus size ratio (data density). There are commonly Terabytes of installed RAM on Servers nowadays.
But one main issue remains until now. issue of volatility of RAM devices. one of the "dirty" ways of improving data processing in database, was to put datafile in tmpfs/ramfs filesystem, and after processing completion, moved the data back to non-volatile file system.
With the advent of NVDIMM, "we can now eat the cake and still have the cake ;) "if you are willing to pay more. for simplicity sake, NVDIMM is RAM that is non-volatile.
Oracle has already deployed this Persistent Memory (NVDIMM) technologies in their engineered systems (Exadata). With 21c you can deploy this in a standalone instance without Exadata.
PMEM module is installed on the DIMM slots in motherboards, same location as RAM and delivers RAM-like speed. To access data, you do not need an intermediary. This gave birth to DAX. According to Kernel.org, For block devices that are memory-like, the page cache pages would be unnecessary copies of the original storage. The DAX code removes the extra copy by performing reads and writes directly to the storage device. EXT4 and XFS support DAX option. Applications that use PMEM do not "really" need caching and buffering.
How Oracle is taking advantage:
Because Oracle database does its own caching itself, creating on datafiles directly on file system backed by NVDIMM will incur unnecessarily overhead like "db file sequential read" events. these I/O reads are unnecessary activities if using Oracle Directly Mapped Buffer Cache (PMEM Filestore).
Above screenshot is AWR compares reports generated with:
1st - implemented Oracle Directly Mapped Buffer Cache using PMEM Filestore
2nd - without Oracle Directly Mapped Buffer Cache. datafile created directly on DAX enabled file system. oracle still treats this filesystem as a traditional filesystem
Reports show both have similar "Logical read" while 1st(PMEMfs) has fewer "Physical read" than 2nd, almost 1:6 ratio. less is more
| PMEM Filestore | XFS on NVDIMM |
Logical read (blocks) | 39,777.5 | 36,529.7 |
Physical read (blocks) | 5,211.5 | 30,594.4 |
Transactions | 215.2 | 196.8 |
Deploying Oracle Directly Mapped Buffer Cache/Persistent Memory.
Ofcos, you need a server with NVDIMM capabilities. Fortunately, this can be tested using KVM without buying the expensive NVDIMM devices.
Deploying VM with NVDIMM Capabilities
This procedure you can follow you have created KVM guest.
1. On host:
a. mount -t tmpfs -o size=1124m pmemdisk /pmem/pmemnvme0
i used tmpfs to simulate the RAM-like speed. you can consider ramfs. note everything will be lost if umount, restart, or even guest VM restart.
b. register in /etc/fstab for persistence.
c. backing file for guest nvdimm device
touch /pmem/pmemnvme0/nvdimm
2. Configuring Guest KVM config:
a.Enable maxMemory. add " <maxMemory slots='8' unit='KiB'>16777216</maxMemory>". see below:
root:~# virsh edit ol8.2
<domain type='kvm'>
<name>ol8.2</name>
<uuid>ba60c63b-3571-4d8e-8ef8-932b3c65abd7</uuid>
<metadata>
<libosinfo:libosinfo xmlns:libosinfo="http://libosinfo.org/xmlns/libvirt/domain/1.0">
<libosinfo:os id="http://oracle.com/ol/8.2"/>
</libosinfo:libosinfo>
</metadata>
<maxMemory slots='8' unit='KiB'>16777216</maxMemory>
<memory unit='KiB'>5345280</memory>
<currentMemory unit='KiB'>4194304</currentMemory>
Note: if not added the following error will be received while attaching NMVe device to guest VM
error: Failed to attach device from nvme.xml
error: unsupported configuration: cannot use/hotplug a memory device when domain 'maxMemory' is not defined
b. Enable NUMA on Guest VM:
edit vm xml with virsh edit ol8.2 #below is example, having 2 NUMA node with 2097152KB of memory each
root:~# virsh dumpxml ol8.2 | sed -n '/<cpu mode/,/<\/cpu>/p'
<cpu mode='host-passthrough' check='none'>
<numa>
<cell id='0' cpus='0' memory='2097152' unit='KiB'/>
<cell id='1' cpus='1' memory='2097152' unit='KiB'/>
</numa>
</cpu>
root:~#
NOTE: if NUMA is not enabled in kvm xml, you will receive "Error starting domain: unsupported configuration: cannot use/hotplug..."
3. Add PMEM device to the guest VM config. Details about options are here
#cat nvme.xml
<memory model='nvdimm'>
<source>
<path>/pmem/pmemnvme0/nvdimm</path>
</source>
<target>
<size unit='KiB'>1150976</size>
<node>1</node>
<label>
<size unit='KiB'>128</size>
</label>
</target>
<alias name='nvdimm6'/>
<address type='dimm' slot='6'/>
</memory>
# virsh attach-device --domain ol8.2 --file nvme.xml --config --persistent
This will update the guest config file.
NOTE: To avoid error messages below. Ensure you have available slots and maxMemory properly set, see above.
error: Failed to attach device from nvme.xml
error: unsupported configuration: Attaching memory device with size '1048576' would exceed domain's maxMemory config size '4194304'
4. Start VM and confirm NVMe Device attached.
a. #virsh start ol8.2
b. login as root to VM i.e. ol8.2
c. check with some commands: dmesg |grep pmem , lsblk , ndctl list , ndctl list -D, ndctl list -R
root@localhost ~]# dmesg |grep pmem
[ 7.977903] pmem0: detected capacity change from 0 to 1178468352
[root@localhost ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sr0 11:0 1 1024M 0 rom
vda 251:0 0 30G 0 disk
├─vda1 251:1 0 1G 0 part /boot
└─vda2 251:2 0 29G 0 part
├─ol-root 252:0 0 27G 0 lvm /
└─ol-swap 252:1 0 2.1G 0 lvm [SWAP]
vdb 251:16 0 40G 0 disk
└─vdb1 251:17 0 40G 0 part /u02
pmem0 259:0 0 1.1G 0 disk /pmemdisk
[root@localhost ~]#
[root@localhost ~]# ndctl list
[root@localhost ~]# ndctl list -RD
{
"dimms":[
{
"dev":"nmem0",
"id":"8680-5c341200",
"handle":7,
"phys_id":0
}
],
"regions":[
{
"dev":"region0",
"size":1178468352,
"available_size":0,
"max_available_extent":0,
"type":"pmem",
"mappings":[
{
"dimm":"nmem0",
"offset":0,
"length":1178468352,
"position":0
}
],
"persistence_domain":"unknown"
}
]
}
[root@localhost ~]#
5. Create FSDAX namespace
# ndctl disable-region region0
# ndctl init-labels nmem0
# ndctl enable-region region0
# ndctl create-namespace --region=region0
Checks
[root@localhost ~]# cat /proc/iomem | grep -i namespace
180000000-1c63dffff : namespace0.0
[root@localhost ~]#
[root@localhost ~]# ndctl list -NDR
{
"dimms":[
{
"dev":"nmem0",
"id":"8680-5c341200",
"handle":7,
"phys_id":0
}
],
"regions":[
{
"dev":"region0",
"size":1178468352,
"available_size":0,
"max_available_extent":0,
"type":"pmem",
"iset_id":10248238646047836,
"mappings":[
{
"dimm":"nmem0",
"offset":0,
"length":1178468352,
"position":0
}
],
"persistence_domain":"unknown",
"namespaces":[
{
"dev":"namespace0.0",
"mode":"fsdax",
"map":"dev",
"size":1157627904,
"uuid":"ae19823f-63c4-4417-a44e-5f85db8bca33",
"sector_size":512,
"align":2097152,
"blockdev":"pmem0"
}
]
}
]
}
[root@localhost ~]#
6. create xfs filesystem and mount with dax option
[root@localhost ~]# mkfs.xfs -f -m reflink=0 /dev/pmem0
meta-data=/dev/pmem0 isize=512 agcount=4, agsize=70656 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=0
= reflink=0
data = bsize=4096 blocks=282624, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
log =internal log bsize=4096 blocks=2560, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
[root@localhost ~]# mount -o dax /dev/pmem0 /pmemdisk
[root@localhost ~]# df -h
...
/dev/pmem0 1.1G 34M 1.1G 4% /pmemdisk
[root@localhost ~]#
--Checks
[root@localhost ~]# findmnt /pmemdisk
TARGET SOURCE FSTYPE OPTIONS
/pmemdisk /dev/pmem0 xfs rw,relatime,attr2,dax,inode64,logbufs=8,logbsize=32k,noquota
[root@localhost ~]#
Deploying Oracle Directly Mapped Buffer Cache/Persistent Memory.
After Oracle 21c binaries are installed and Database created, the following step will enable Oracle DMBC.
Tested on Oracle 21.7.
1. As root user edit /etc/fuse.conf
[root@localhost ~]# cat /etc/fuse.conf
# mount_max = 1000
user_allow_other
allow_other
2. Create mount point for oracle
#mkdir /pmemdisk/orcldata
#chown oracle:oinstall /pmemdisk/orcldata
3. start oracle in nomount mode
sqlplus / as sysdba
startup nomount
4. Create PMEM filestore
SQL> CREATE PMEM FILESTORE orcldata MOUNTPOINT '/u01/app/orcldata' BACKINGFILE '/pmemdisk/orcldata/orcldata';
PMEM filestore created.
SQL>
Note: pmem_filestore parameter will be updated automatically when PMEM FILESTORE is created through "create pmem file store..."
Also _db_pmem_direct_map_size will be updated to the size of PMEM filestore.
_db_pmem_direct_map_size control size to be used for DMBC, if you have created datafile of 512MB, then set _db_pmem_direct_map_size =256M, oracle will only use 256M for DMBC and traditional buffer cache will be used to satisfy IO requests (db file sequential read etc.). There might be situations where this is desirable, for example you have a large available buffer cache. Buffer cache (backed by D-RAM) is still faster than PMEM cache but consider paging (DRAM<->NVDIMM), latches and other overheads.
But if you have TBs of datafile create on pmem file store and few GBs of RAM/Buffer Cache, then _db_pmem_direct_map_size should remain default. anyways, as in any Oracle hidden parameter, it recommended not to modify except suggested by oracle SR or you understand fully the implications.
5. Check filesystem for /u01/app/orcldata and PMEM File store.
$ df -h /u01/app/orcldata /pmemdisk
Filesystem Size Used Avail Use% Mounted on
/dev/fuse 1.1G 33M 1.1G 4% /u01/app/orcldata
/dev/pmem0 1.1G 1.1G 23M 98% /pmemdisk
$
SQL> select NAME,AVAILABLE_BYTES,USED_BYTES from v$PMEM_FILESTORE;
NAME AVAILABLE_BYTES USED_BYTES
----------------------------------------------------------------- --------------- ----------
ORCLDATA 1054752768 33669120
SQL>
6. Create tablespace for on PMEM file store
SQL> create tablespace pmem_ts datafile '/u01/app/orcldata/pmem_ts_datafile1.dbf' size 900M nologging;
Tablespace created.
SQL>
Testing:
1. Compare IO performance between datafile created on PMEM file store and Traditional filesystem
2. Compare IO events between Directly Mapped Buffer Cache (DMBC) and ordinary DAX enabled filesystem
3. Setting _db_pmem_direct_map_size set=non-default
TEST 1: Compare IO performance between datafile created on PMEM file store and Traditional filesystem
1st - implemented Oracle Directly Mapped Buffer Cache using PMEM File store
2nd - Traditional filesystem (without Host cache i.e., <driver name='qemu' type='qcow2' cache='directsync'/> )
There is a massive gain Write I/O 99.7MB/s for PMEMfs compared to 3.5MB/s for traditional disk.
Also notice in 2nd, no "resmgr:cpu quantum" events because most of the DB time was waiting on sub storage system to serve IO.
Note: small buffer cache was allocated in all cases, less than 100MB.
Test 2: Comparison of IO events between Directly Mapped Buffer Cache (DMBC) and DAX enabled filesystem
This shows the benefits of (Oracle) algorithm tailored to use NVDIMM.
see comments about the above AWR screenshot in How Oracle is taking advantage
"User I/O" events as "db file sequential read" are not seen in TOP events for PMEMfs (1st) because Oracle issues IO through PMEM library calls. new calls and events are introduced such as kcbFlushPMEMBufferCbk and ”consistent gets from pmem“. Will be nice to have if Oracle put "Read IO(MB) in PMEM" as part of Load profile.
The PMEMfs also have significant reduction in Buffer Cache Reads and Direct Reads because most are satisfied by PMEM events as highlighted below. This is not yet captured by AWR "IOStat Function" but available in "Instance Activity Stats" section.
Instance Activity Stats DB/Inst: ORCL/orcl Snaps: 81-82
-> Ordered by statistic name
Statistic Total per Second per Trans
-------------------------------- ------------------ -------------- -------------
consistent gets from cache 960,584 3,178.3 14.8
consistent gets from pmem 6,392,783 21,151.8 98.3
consistent gets pin 808,690 2,675.7 12.4
consistent gets pin (fastpath) 792,582 2,622.4 12.2
consistent gets pmem direct 1,259,392 4,167.0 19.4
consistent gets pmem examination 151,905 502.6 2.3
consistent gets pmem examination 151,903 502.6 2.3
consistent gets pmem pin 4,981,486 16,482.3 76.6
consistent gets pmem pin (fastpa 2,930,435 9,696.0 45.1
..
data blocks consistent reads - u 43,296 143.3 0.7
db block changes 8,542,447 28,264.4 131.3
db block gets 4,668,705 15,447.4 71.8
db block gets direct 25 0.1 0.0
db block gets from cache 4,668,680 15,447.3 71.8
db block gets from cache (fastpa 2,854,690 9,445.3 43.9
db block gets from pmem 0 0.0 0.0
Above AWR report Instance Activity Stats section for PMEMfs.
PMEM events will not be found in database without PMEM File store even if you are using NVDIMM backed filesystem.
Conclusion:
As an entry point, this article shows how you can deploy persistent memory without EXADATA, test and some ideas about performance and tuning. Link to Oracle doc
Version Info:
qemu-kvm : 4.2.0Oracle Database : 21.7 (should work on all 21c)
Guest OS: Red Hat Enterprise Linux release 8.2 (Ootpa)
Guest Kernel Release: 5.4.17-2011.1.2.el8uek.x86_64
Host Kernel Release : 4.18.0-348.20.1.el8_5.x86_64
Host OS: Red Hat Enterprise Linux release 8.5 (Ootpa)
interesting news. this may allow Oracle to compete with cloud solutions when a customer wants really fast system.
ReplyDelete