Monday, September 26, 2022

USING PERSISTENT MEMORY WITH ORACLE DATABASE -- NVDIMM SIMULATED WITH KVM

In database processing, storage components are critical parts that determine performance and scalability. various methods have been deployed to speedup database processing. concepts such as memory buffer that acts as intermediary between CPU and device (e.g., tape, hard disk etc.), cache concept that reduce latency and avoids round-tripping of data between processor and storage device. So traditionally you have expensive small device located closely with the CPU called (RAM aka "volatile" memory) but the large capacity storage located far away from the CPU but comparably super cheaper (hard disk/tapes). The gaps have been closing over the many years, faster traditional storage devices hard drive to SSD drives. RAM now is comparably cheaper and more compact in terms of capacity versus size ratio (data density). There are commonly Terabytes of installed RAM on Servers nowadays. 

But one main issue remains until now. issue of volatility of RAM devices. one of the "dirty" ways of improving data processing in database, was to put datafile in tmpfs/ramfs filesystem, and after processing completion, moved the data back to non-volatile file system 

With the advent of NVDIMM, "we can now eat the cake and still have the cake ;) "if you are willing to pay more. for simplicity sake, NVDIMM is RAM that is non-volatile.  

 

 

Oracle has already deployed this Persistent Memory (NVDIMM) technologies in their engineered systems (Exadata). With 21c you can deploy this in a standalone instance without Exadata.  

PMEM module is installed on the DIMM slots in motherboards, same location as RAM and delivers RAM-like speed. To access data, you do not need an intermediary. This gave birth to DAX. According to Kernel.org, For block devices that are memory-like, the page cache pages would be unnecessary copies of the original storage. The DAX code removes the extra copy by performing reads and writes directly to the storage device. EXT4 and XFS support DAX option. Applications that use PMEM do not "really" need caching and buffering. 

 

 

 

How Oracle is taking advantage: 

Because Oracle database does its own caching itself, creating on datafiles directly on file system backed by NVDIMM will incur unnecessarily overhead like "db file sequential read" events. these I/O reads are unnecessary activities if using Oracle Directly Mapped Buffer Cache (PMEM Filestore).  

 

Above screenshot is AWR compares reports generated with: 

1st - implemented Oracle Directly Mapped Buffer Cache using PMEM Filestore 

2nd - without Oracle Directly Mapped Buffer Cache. datafile created directly on DAX enabled file system. oracle still treats this filesystem as a traditional filesystem 

Reports show both have similar "Logical read" while 1st(PMEMfs) has fewer "Physical read" than 2nd, almost 1:6 ratio. less is more

 

PMEM Filestore 

XFS on NVDIMM 

Logical read (blocks) 

39,777.5 

36,529.7 

Physical read (blocks) 

5,211.5 

30,594.4 

Transactions 

215.2 

196.8 


Deploying Oracle Directly Mapped Buffer Cache/Persistent Memory. 

Ofcos, you need a server with NVDIMM capabilities. Fortunately, this can be tested using KVM without buying the expensive NVDIMM devices. 

Deploying VM with NVDIMM Capabilities 

This procedure you can follow you have created KVM guest. 

1. On host:  

a. mount -t tmpfs -o size=1124m pmemdisk /pmem/pmemnvme0 

i used tmpfs to simulate the RAM-like speed. you can consider ramfs. note everything will be lost if umount, restart, or even guest VM restart. 

b. register in /etc/fstab for persistence.  

c. backing file for guest nvdimm device 

touch /pmem/pmemnvme0/nvdimm 

 

2. Configuring Guest KVM config: 

a.Enable maxMemory. add " <maxMemory slots='8' unit='KiB'>16777216</maxMemory>". see below: 

 

root:~# virsh edit ol8.2 

<domain type='kvm'> 

  <name>ol8.2</name> 

  <uuid>ba60c63b-3571-4d8e-8ef8-932b3c65abd7</uuid> 

  <metadata> 

    <libosinfo:libosinfo xmlns:libosinfo="http://libosinfo.org/xmlns/libvirt/domain/1.0"> 

      <libosinfo:os id="http://oracle.com/ol/8.2"/> 

    </libosinfo:libosinfo> 

  </metadata> 

  <maxMemory slots='8' unit='KiB'>16777216</maxMemory> 

  <memory unit='KiB'>5345280</memory> 

  <currentMemory unit='KiB'>4194304</currentMemory> 

 

 

Note: if not added the following error will be received while attaching NMVe device to guest VM 

error: Failed to attach device from nvme.xml 

error: unsupported configuration: cannot use/hotplug a memory device when domain 'maxMemory' is not defined  

 

b. Enable NUMA on Guest VM: 

edit vm xml with virsh edit ol8.2 #below is example, having 2 NUMA node with 2097152KB of memory each 

root:~# virsh dumpxml ol8.2 | sed -n '/<cpu mode/,/<\/cpu>/p' 

  <cpu mode='host-passthrough' check='none'> 

    <numa> 

      <cell id='0' cpus='0' memory='2097152' unit='KiB'/> 

      <cell id='1' cpus='1' memory='2097152' unit='KiB'/> 

    </numa> 

  </cpu> 

root:~ 

 

NOTE: if NUMA is not enabled in kvm xml, you will receive "Error starting domain: unsupported configuration: cannot use/hotplug..." 

3. Add PMEM device to the guest VM config. Details about options are here  

 

#cat nvme.xml  

<memory model='nvdimm'> 

      <source> 

        <path>/pmem/pmemnvme0/nvdimm</path> 

      </source> 

      <target> 

        <size unit='KiB'>1150976</size> 

        <node>1</node> 

        <label> 

          <size unit='KiB'>128</size> 

        </label> 

      </target> 

      <alias name='nvdimm6'/> 

      <address type='dimm' slot='6'/> 

    </memory> 

# virsh attach-device --domain ol8.2 --file nvme.xml  --config  --persistent 

This will update the guest config file. 

NOTE: To avoid error messages below. Ensure you have available slots and maxMemory properly set, see above.  

error: Failed to attach device from nvme.xml 

error: unsupported configuration: Attaching memory device with size '1048576' would exceed domain's maxMemory config size '4194304' 

4. Start VM and confirm NVMe Device attached. 

a. #virsh start ol8.2  

b. login as root to VM i.e. ol8.2  

c. check with some commands: dmesg |grep  pmem , lsblk , ndctl list , ndctl list -D,  ndctl list -R 

 

root@localhost ~]#  dmesg |grep  pmem 

[    7.977903] pmem0: detected capacity change from 0 to 1178468352 

 

[root@localhost ~]# lsblk 

NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT 

sr0          11:0    1 1024M  0 rom   

vda         251:0    0   30G  0 disk  

├─vda1      251:1    0    1G  0 part /boot 

└─vda2      251:2    0   29G  0 part  

  ├─ol-root 252:0    0   27G  0 lvm  / 

  └─ol-swap 252:1    0  2.1G  0 lvm  [SWAP] 

vdb         251:16   0   40G  0 disk  

└─vdb1      251:17   0   40G  0 part /u02 

pmem0       259:0    0  1.1G  0 disk /pmemdisk 

[root@localhost ~]# 

 

 

[root@localhost ~]# ndctl list 

[root@localhost ~]# ndctl list -RD  

{ 

  "dimms":[ 

    { 

      "dev":"nmem0", 

      "id":"8680-5c341200", 

      "handle":7, 

      "phys_id":0 

    } 

  ], 

  "regions":[ 

    { 

      "dev":"region0", 

      "size":1178468352, 

      "available_size":0, 

      "max_available_extent":0, 

      "type":"pmem", 

      "mappings":[ 

        { 

          "dimm":"nmem0", 

          "offset":0, 

          "length":1178468352, 

          "position":0 

        } 

      ], 

      "persistence_domain":"unknown" 

    } 

  ] 

} 

[root@localhost ~]#  

 

5. Create FSDAX namespace 

# ndctl disable-region region0 

# ndctl init-labels nmem0 

# ndctl enable-region region0 

# ndctl create-namespace --region=region0 

Checks 

[root@localhost ~]# cat /proc/iomem | grep -i namespace 

  180000000-1c63dffff : namespace0.0 

[root@localhost ~]#  

[root@localhost ~]# ndctl list -NDR 

{ 

  "dimms":[ 

    { 

      "dev":"nmem0", 

      "id":"8680-5c341200", 

      "handle":7, 

      "phys_id":0 

    } 

  ], 

  "regions":[ 

    { 

      "dev":"region0", 

      "size":1178468352, 

      "available_size":0, 

      "max_available_extent":0, 

      "type":"pmem", 

      "iset_id":10248238646047836, 

      "mappings":[ 

        { 

          "dimm":"nmem0", 

          "offset":0, 

          "length":1178468352, 

          "position":0 

        } 

      ], 

      "persistence_domain":"unknown", 

      "namespaces":[ 

        { 

          "dev":"namespace0.0", 

          "mode":"fsdax", 

          "map":"dev", 

          "size":1157627904, 

          "uuid":"ae19823f-63c4-4417-a44e-5f85db8bca33", 

          "sector_size":512, 

          "align":2097152, 

          "blockdev":"pmem0" 

        } 

      ] 

    } 

  ] 

} 

[root@localhost ~]#  

 

6. create xfs filesystem and mount with dax option 

[root@localhost ~]# mkfs.xfs -f -m reflink=0 /dev/pmem0 

meta-data=/dev/pmem0             isize=512    agcount=4, agsize=70656 blks 

         =                       sectsz=4096  attr=2, projid32bit=1 

         =                       crc=1        finobt=1, sparse=1, rmapbt=0 

         =                       reflink=0 

data     =                       bsize=4096   blocks=282624, imaxpct=25 

         =                       sunit=0      swidth=0 blks 

naming   =version 2              bsize=4096   ascii-ci=0, ftype=1 

log      =internal log           bsize=4096   blocks=2560, version=2 

         =                       sectsz=4096  sunit=1 blks, lazy-count=1 

realtime =none                   extsz=4096   blocks=0, rtextents=0 

[root@localhost ~]# mount -o dax /dev/pmem0 /pmemdisk 

[root@localhost ~]# df -h 

... 

/dev/pmem0           1.1G   34M  1.1G   4% /pmemdisk 

[root@localhost ~]#  

--Checks 

[root@localhost ~]# findmnt /pmemdisk 

TARGET    SOURCE     FSTYPE OPTIONS 

/pmemdisk /dev/pmem0 xfs    rw,relatime,attr2,dax,inode64,logbufs=8,logbsize=32k,noquota 

[root@localhost ~]#  

 

 Deploying Oracle Directly Mapped Buffer Cache/Persistent Memory. 

After Oracle 21c binaries are installed and Database created, the following step will enable Oracle DMBC. 

Tested on Oracle 21.7. 


1. As root user edit  /etc/fuse.conf 

[root@localhost ~]# cat /etc/fuse.conf 

# mount_max = 1000 

user_allow_other 

allow_other 

2. Create mount point for oracle 

#mkdir /pmemdisk/orcldata 

#chown oracle:oinstall /pmemdisk/orcldata 

3. start oracle in nomount mode 

sqlplus / as sysdba 

startup nomount  

4. Create PMEM filestore 

SQL> CREATE PMEM FILESTORE orcldata MOUNTPOINT '/u01/app/orcldata' BACKINGFILE '/pmemdisk/orcldata/orcldata'; 

PMEM filestore created. 

SQL>  

 

Note: pmem_filestore parameter will be updated automatically when PMEM FILESTORE is created through "create pmem file store..." 

Also _db_pmem_direct_map_size will be updated to the size of PMEM filestore. 

_db_pmem_direct_map_size control size to be used for DMBC, if you have created datafile of 512MB, then set _db_pmem_direct_map_size =256M, oracle will only use 256M for DMBC and traditional buffer cache will be used to satisfy IO requests (db file sequential read etc.). There might be situations where this is desirable, for example you have a large available buffer cache. Buffer cache (backed by D-RAM) is still faster than PMEM cache but consider paging (DRAM<->NVDIMM), latches and other overheads. 

But if you have TBs of datafile create on pmem file store and few GBs of RAM/Buffer Cache, then _db_pmem_direct_map_size should remain default. anyways, as in any Oracle hidden parameter, it recommended not to modify except suggested by oracle SR or you understand fully the implications. 

 

5.  Check filesystem for /u01/app/orcldata and PMEM File store. 

$ df -h /u01/app/orcldata /pmemdisk 

Filesystem      Size  Used Avail Use% Mounted on 

/dev/fuse       1.1G   33M  1.1G   4% /u01/app/orcldata 

/dev/pmem0      1.1G  1.1G   23M  98% /pmemdisk 

 

SQL> select NAME,AVAILABLE_BYTES,USED_BYTES from v$PMEM_FILESTORE; 

NAME   AVAILABLE_BYTES USED_BYTES 

----------------------------------------------------------------- --------------- ---------- 

ORCLDATA        1054752768   33669120 

SQL> 

 

 

6. Create tablespace for on PMEM file store 

SQL> create tablespace pmem_ts datafile '/u01/app/orcldata/pmem_ts_datafile1.dbf' size 900M nologging; 

Tablespace created. 

SQL>  

 

 

 

Testing:  

Using SLOB to perform the following comparison test: 

1. Compare IO performance between datafile created on PMEM file store and Traditional filesystem 

2. Compare IO events between Directly Mapped Buffer Cache (DMBC) and ordinary DAX enabled filesystem 

3. Setting _db_pmem_direct_map_size set=non-default 

  

TEST 1: Compare IO performance between datafile created on PMEM file store and Traditional filesystem 

1st - implemented Oracle Directly Mapped Buffer Cache using PMEM File store 

2nd - Traditional filesystem (without Host cache i.e., <driver name='qemu' type='qcow2' cache='directsync'/> )  

 

 

There is a massive gain Write I/O 99.7MB/s for PMEMfs compared to 3.5MB/s for traditional disk. 

Also notice in 2nd, no "resmgr:cpu quantum" events because most of the DB time was waiting on sub storage system to serve IO. 

Note: small buffer cache was allocated in all cases, less than 100MB. 
 

 
 
 

Test 2: Comparison of IO events between Directly Mapped Buffer Cache (DMBC) and DAX enabled filesystem 

This shows the benefits of (Oracle) algorithm tailored to use NVDIMM. 

 

see comments about the above AWR screenshot in How Oracle is taking advantage

"User I/O" events as "db file sequential read" are not seen in TOP events for PMEMfs (1st) because Oracle issues IO through PMEM library calls. new calls and events are introduced such as kcbFlushPMEMBufferCbk and consistent gets from pmem. Will be nice to have if Oracle put "Read IO(MB) in PMEM" as part of Load profile. 

 
 

The PMEMfs also have significant reduction in Buffer Cache Reads and Direct Reads because most are satisfied by PMEM events as highlighted below. This is not yet captured by AWR "IOStat Function" but available in "Instance Activity Stats" section. 

Instance Activity Stats                       DB/Inst: ORCL/orcl  Snaps: 81-82 
-> Ordered by statistic name 
 
Statistic                                     Total     per Second     per Trans 
-------------------------------- ------------------ -------------- ------------- 
consistent gets from cache                  960,584        3,178.3          14.8 
consistent gets from pmem                6,392,783       21,151.8          98.3 
consistent gets pin                         808,690        2,675.7          12.4 
consistent gets pin (fastpath)              792,582        2,622.4          12.2 
consistent gets pmem direct               1,259,392        4,167.0          19.4 
consistent gets pmem examination            151,905          502.6           2.3 
consistent gets pmem examination            151,903          502.6           2.3 
consistent gets pmem pin                  4,981,486       16,482.3          76.6 
consistent gets pmem pin (fastpa         2,930,435        9,696.0          45.1 
.. 
data blocks consistent reads - u             43,296          143.3           0.7 
db block changes                          8,542,447       28,264.4         131.3 
db block gets                             4,668,705       15,447.4          71.8 
db block gets direct                             25            0.1           0.0 
db block gets from cache                  4,668,680       15,447.3          71.8 
db block gets from cache (fastpa          2,854,690        9,445.3          43.9 
db block gets from pmem                           0            0.0           0.0 

Above AWR report Instance Activity Stats section for PMEMfs. 

PMEM events will not be found in database without PMEM File store even if you are using NVDIMM backed filesystem. 

 

Conclusion: 

As an entry point, this article shows how you can deploy persistent memory without EXADATA, test and some ideas about performance and tuning.  Link to Oracle doc

 

Version Info:

qemu-kvm        :  4.2.0
Oracle Database : 21.7 (should work on all 21c) 
Guest OS: Red Hat Enterprise Linux release 8.2 (Ootpa)
Guest Kernel Release: 5.4.17-2011.1.2.el8uek.x86_64
Host Kernel Release : 4.18.0-348.20.1.el8_5.x86_64
Host OS: Red Hat Enterprise Linux release 8.5 (Ootpa)