Wednesday 16 January 2008

Oracle Disk io subsystem performance

Oracle server disk subsystem
Disk performance for Oracle servers
Clancy Bufton
• Senior DBA
• TMD Computing
• Tasmania, Australia
clancy.bufton@dpac.tas.gov.au


Some existing installations
• Single disk
• Sliced into partitions
• UFS


Solaris UFS
• Block based Inode allocation
• Block size 8kByte
• file buffer cache
• “read ahead” and “write behind”

Solaris LVM
• Logical Volume Manager
• Mirroring
• Striping
• Concatenation
• RAID 1+0
• Software RAID 5 don’t do it


RAID levels
• RAID 0
– Stripe data across multiple disks
– No redundancy
– Decreases Mean Time To Data Loss MTTDL
– Loss of 1 disk destroys entire volume
– Higher IO throughput reads and writes
• RAID 1
– Mirrors a disk
– 2 way or 3 way
– Increases MTTDL
– Potential higher read throughput write stays the same
• RAID 0+1
– A mirrored stripe
– Loss of a single disk in one mirror destroys the entire mirror
• RAID 1+0
– A striped mirror
– Loss of a single disk only destroys mirroring of the partner disk
• RAID 4
– Stripe with parity disk
– Provides single disk failure protection
– Reads fast, writes slow as the parity drive
• RAID 5
– Stripe with striped parity
– Provide single disk failure protection, improves writes because parity is also striped, Reads are fast
• RAID 6
– Striped disk with two disk failure protection

Oracle ASM
• Integrated volume manager and database file system
• Eliminates need for Solaris LVM and UFS
• Automatically stripes and mirrors database files
• Continually optimises data file extent placement to eliminate hotspots
• Eases storage administration tasks


Disk IO
• What size is the IO ?
• How many IOs required per second ?
• IO Characteristic ?
– Sequential
– Random
• Concurrency ?
– To what degree ?

Storage hardware
• Disks
– How many
– Rotational speed ?
– Average seek time ?
– internal transfer rate ?
• Disk controllers
– How many ?
– How many channels ?
– Channel bandwidth ?
– Protocol ?
• Raid controllers
– What raid levels ?
– How many host ports ?

Solaris IO options
• Asynchronous IO
– Don’t block the process to wait for IO completion
– Issue all IO requests immediately
– Available since Solaris 6
• Direct IO
– Bypass the buffer cache
– Disables read ahead and write behind
– Simulates raw device performance


Disk IO tests
• Sequential
– Synch, buffer cache, 8KB IO size
– Asynch, buffer cache, 8KB IO size
– Asynch, direct, 8KB IO size
– Repeat but vary IO size (16KB, 128KB, etc)
– Test both reads and writes
• Random
– Synch, buffer cache, 8KB IO size
– Asynch, buffer cahce, 8KB IO size
– Asynch, direct, 8KB IO size
– Repeat but vary IO size (16KB, 128KB, etc)
– Test both reads and writes

Test cases
• Single disk
• 3 wide stripe 1M interlace
• Reads and writes

Test C code – Asynch Random Writes

/*
* wrbench
*
* SYNOPSIS A simple write-random benchmark application.
* USAGE wrbench [-d device] [-s device-size] [-b io-size]
* EXAMPLE wrbench -d /dev/rdsk/c1t6d0s0 -b 512 -s 9000000000
* (For a Sun 9GB disk, 512 byte IO size)
*
* Written by G. D. Musumeci for O'Reilly &
* Associates. This source code is released under the terms of
* the BSD License.
adapted for asynch and direct io by Clancy J Bufton
*/
#include
#include
#include
#include
#include
#include
int
main(int argc, char **argv)
{
extern char *optarg;
int c, r, FD;
char *device, *buf;
uint64_t deviceSize = 0;
int ioSize = 512, iter = 512;
hrtime_t st, ft;
long double t;
off64_t s, e;
aio_result_t aiores;
aio_result_t *restwo;
int directIO = 0;
/* Parse command-line options */
while ((c = getopt(argc, argv, "qd:s:b:i:")) != -1) {
switch(c) {
case 'd':
device = (char *)strdup(optarg);
break;
case 's':
deviceSize = atoll(optarg);
break;
case 'b':
ioSize = atoi(optarg);
break;
case 'i':
iter = atoi(optarg);
break;
case 'q':
directIO = 1;
break;
case '?':
if (isprint(optopt)) {
fprintf(stderr, "Unknown option `-%c'.\n", getopt);
} else {
fprintf(stderr, "Unknown option character `\\s%xc'.\n",
optopt);
}
return 1;
}
}
/* Quick sanity check */
if (device == NULL) {
fprintf(stderr, "wrbench: must specify a device to test with -d\n");
exit(1);
}
fprintf(stderr, "wrbench: Testing %s (%lld bytes) with %d-byte writes\n", device, deviceSize, ioSize);
/* Allocate space for *buf */
buf = (char *)malloc(ioSize * sizeof(char));
/* Seed the random number generator */
st = gethrtime();
srand48((long) st);
/* Open the device */
if ((FD = open(device, O_RDWR O_LARGEFILE )) < st =" gethrtime();" c =" 0;" s =" (((deviceSize" e =" lseek64(FD," restwo =" (aio_result_t" r =" aiowrite(FD," restwo =" aiowait(NULL))">aio_return == (ioSize * sizeof(char))) {
free(restwo);
}
else
{
perror("aio");
exit(0);
}

}
/* Stop timing */
ft = gethrtime();
directio(FD,DIRECTIO_OFF);
close(FD);
t = ((long double)ft - (long double)st) / 1000000000;
/* Output the result */
fprintf(stderr, "wrbench: %0.Lf %d-byte IOs/sec\n",
(iter / t), ioSize);
}





Test C code
• Remove lseek64 call for sequential test
• Use aiowrite, aioread for async
• directio(FD,DIRECTIO_ON) controls buffer cache on / off
• Change write to read for read test
• Original code adapted for direct io and asynch test
compile with e.g, gcc -laio seqread.c



Disk Test Results – Direct IO
• Random synchronous writes
– 3 wide 1M stripe IO/sec
• 8k = 39, 128k = 25
– Single disk IO/sec
• 8k = 35, 128k = 23
• Random asynchronous writes
– 3 wide 1M strip IO/sec
• 8k = 233, 128k = 109
– Single disk IO/sec
• 8k = 39, 128k = 25


Disk test results – async direct







Oracle IO options
• Initialisation parameter
• FILESYSTEMIO_OPTIONS =
– { none setall directIO asynch }
– Reboot server after setting this parameter
• Enable direct and asynchronous with SETALL
– In some situations directio isn’t enabled
– Workaround force directio on solaris mount point
– mount -o forcedirectio /dev/md/dsk/d1 /testfs
• Reclaim the freed memory and allocate it to the Oracle buffer cache

Determining the Oracle IO size ?
• The oracle block size 8192
• DB_FILE_MULTIBLOCK_READ_COUNT
– Controls the maximum read IO size oracle will perform
– Example: 16 x 8k = 128K
– Used for full table scans
– Test results shows fewer large IO sustains higher throughput
• Oracle IO size
– Datafiles primary 8k but up to 8k * db_file_multiblock_read_count for full table scans
– Redo logs must be forced direct
• On ufs 8k max io size, slow
• On direct ufs 60k io size, much faster

Oracle IO tests
• create table test as select * from dba_objects
• Check redo log file throughput
• Check data file read and write throughput
• Insert into test select * from test where rownum <>• First alter system checkpoint;

• Next run the insert

• Then rollback the insert so that all buffers are loaded


• Rerun the insert and note the time, commit the transaction


• Alter system checkpoint to flush the freshly dirtied buffers and note the time

Oracle test results



•Test case 1 result due to throughput attainable on redo logs
•Test case 2 result due to throughput attainable on data files





SAN Considerations • How to estimate the IO size ? – Empirical evidence required – Supports selection of an appropriate stripe depth – Oracle recommendation for SAME is 1Mbyte stripe depth • Gauge the number of IO required per second at peak times • Because the disks are connected to multiple hosts IO requests will vary • The RAID controller slices up IOs – Striping hides the IO size from the disk because disks only see IO size equal to the stripe depth – How many IOs will be split ? • How many host ports available ? • Bandwidth of host ports ?

1 comment:

Clancy said...

Direct io is an operating system enhancement for the filesystem.

The purpose of direct io is to provide as close to raw disk performance as possible for files on a filesystem.

It bybasses the Unix Buffer Cache (UBC) and writes data directly to disk or reads directly into application memory.

On Solaris, Direct IO has the advantage of avoiding file locking, this allows concurrent access to a single file, eg when DWR is flushing buffers the datafile can still be read by a session server process.

Direct IO significantly increases throughput to both redo and data files.

In all situations a database will benefit from having redo logs on a direct IO mount point, because of the sequential write nature of redo logs.

In some situations read performance could be lower.

This is likely when a datafile contains tables that are full scanned repeatedly and is small enough to fit in the UBC, causing data to be read from memory instead of disk.

This performance loss can be regained by allocating those tables to the KEEP buffer pool, so they are held in the oracle buffer cache.

Memory utilisation is improved with direct io, this is because caching of data is not duplicated in the oracle buffer cache and UBC.




http://docs.sun.com/app/docs/doc/806-7502/6jgce01te?a=view

http://www.ixora.com.au/notes/direct_io.htm

http://www.ixora.com.au/tips/avoid_buffered_io.htm

http://www.ixora.com.au/notes/inode_locks.htm

http://www.ixora.com.au/notes/paged_buffer_cache.htm

http://blogs.sun.com/glennf/entry/where_do_you_cache_oracle



Clancy Bufton
Senior Database Administrator
Computing Services, TMD
Department of Premier and Cabinet
2 Salamanca Square
Hobart Tasmania 7000



Ph: 03 62337624
Mobile: 0409164334

Clancy.Bufton@dpac.tas.gov.au