Disk performance for Oracle servers
Clancy Bufton
• Senior DBA
• TMD Computing
• Tasmania, Australia
• clancy.bufton@dpac.tas.gov.au
Some existing installations
• Single disk
• Sliced into partitions
• UFS
Solaris UFS
• Block based Inode allocation
• Block size 8kByte
• file buffer cache
• “read ahead” and “write behind”
Solaris LVM
• Logical Volume Manager
• Mirroring
• Striping
• Concatenation
• RAID 1+0
• Software RAID 5 don’t do it
RAID levels
• RAID 0
– Stripe data across multiple disks
– No redundancy
– Decreases Mean Time To Data Loss MTTDL
– Loss of 1 disk destroys entire volume
– Higher IO throughput reads and writes
• RAID 1
– Mirrors a disk
– 2 way or 3 way
– Increases MTTDL
– Potential higher read throughput write stays the same
• RAID 0+1
– A mirrored stripe
– Loss of a single disk in one mirror destroys the entire mirror
• RAID 1+0
– A striped mirror
– Loss of a single disk only destroys mirroring of the partner disk
• RAID 4
– Stripe with parity disk
– Provides single disk failure protection
– Reads fast, writes slow as the parity drive
• RAID 5
– Stripe with striped parity
– Provide single disk failure protection, improves writes because parity is also striped, Reads are fast
• RAID 6
– Striped disk with two disk failure protection
Oracle ASM
• Integrated volume manager and database file system
• Eliminates need for Solaris LVM and UFS
• Automatically stripes and mirrors database files
• Continually optimises data file extent placement to eliminate hotspots
• Eases storage administration tasks
Disk IO
• What size is the IO ?
• How many IOs required per second ?
• IO Characteristic ?
– Sequential
– Random
• Concurrency ?
– To what degree ?
Storage hardware
• Disks
– How many
– Rotational speed ?
– Average seek time ?
– internal transfer rate ?
• Disk controllers
– How many ?
– How many channels ?
– Channel bandwidth ?
– Protocol ?
• Raid controllers
– What raid levels ?
– How many host ports ?
Solaris IO options
• Asynchronous IO
– Don’t block the process to wait for IO completion
– Issue all IO requests immediately
– Available since Solaris 6
• Direct IO
– Bypass the buffer cache
– Disables read ahead and write behind
– Simulates raw device performance
Disk IO tests
• Sequential
– Synch, buffer cache, 8KB IO size
– Asynch, buffer cache, 8KB IO size
– Asynch, direct, 8KB IO size
– Repeat but vary IO size (16KB, 128KB, etc)
– Test both reads and writes
• Random
– Synch, buffer cache, 8KB IO size
– Asynch, buffer cahce, 8KB IO size
– Asynch, direct, 8KB IO size
– Repeat but vary IO size (16KB, 128KB, etc)
– Test both reads and writes
Test cases
• Single disk
• 3 wide stripe 1M interlace
• Reads and writes
Test C code – Asynch Random Writes
/*
* wrbench
*
* SYNOPSIS A simple write-random benchmark application.
* USAGE wrbench [-d device] [-s device-size] [-b io-size]
* EXAMPLE wrbench -d /dev/rdsk/c1t6d0s0 -b 512 -s 9000000000
* (For a Sun 9GB disk, 512 byte IO size)
*
* Written by G. D. Musumeci
* Associates. This source code is released under the terms of
* the BSD License.
adapted for asynch and direct io by Clancy J Bufton
*/
#include
#include
#include
#include
#include
#include
int
main(int argc, char **argv)
{
extern char *optarg;
int c, r, FD;
char *device, *buf;
uint64_t deviceSize = 0;
int ioSize = 512, iter = 512;
hrtime_t st, ft;
long double t;
off64_t s, e;
aio_result_t aiores;
aio_result_t *restwo;
int directIO = 0;
/* Parse command-line options */
while ((c = getopt(argc, argv, "qd:s:b:i:")) != -1) {
switch(c) {
case 'd':
device = (char *)strdup(optarg);
break;
case 's':
deviceSize = atoll(optarg);
break;
case 'b':
ioSize = atoi(optarg);
break;
case 'i':
iter = atoi(optarg);
break;
case 'q':
directIO = 1;
break;
case '?':
if (isprint(optopt)) {
fprintf(stderr, "Unknown option `-%c'.\n", getopt);
} else {
fprintf(stderr, "Unknown option character `\\s%xc'.\n",
optopt);
}
return 1;
}
}
/* Quick sanity check */
if (device == NULL) {
fprintf(stderr, "wrbench: must specify a device to test with -d\n");
exit(1);
}
fprintf(stderr, "wrbench: Testing %s (%lld bytes) with %d-byte writes\n", device, deviceSize, ioSize);
/* Allocate space for *buf */
buf = (char *)malloc(ioSize * sizeof(char));
/* Seed the random number generator */
st = gethrtime();
srand48((long) st);
/* Open the device */
if ((FD = open(device, O_RDWR O_LARGEFILE )) < st =" gethrtime();" c =" 0;" s =" (((deviceSize" e =" lseek64(FD," restwo =" (aio_result_t" r =" aiowrite(FD," restwo =" aiowait(NULL))">aio_return == (ioSize * sizeof(char))) {
free(restwo);
}
else
{
perror("aio");
exit(0);
}
}
/* Stop timing */
ft = gethrtime();
directio(FD,DIRECTIO_OFF);
close(FD);
t = ((long double)ft - (long double)st) / 1000000000;
/* Output the result */
fprintf(stderr, "wrbench: %0.Lf %d-byte IOs/sec\n",
(iter / t), ioSize);
}
Test C code
• Remove lseek64 call for sequential test
• Use aiowrite, aioread for async
• directio(FD,DIRECTIO_ON) controls buffer cache on / off
• Change write to read for read test
• Original code adapted for direct io and asynch test
compile with e.g, gcc -laio seqread.c
Disk Test Results – Direct IO
• Random synchronous writes
– 3 wide 1M stripe IO/sec
• 8k = 39, 128k = 25
– Single disk IO/sec
• 8k = 35, 128k = 23
• Random asynchronous writes
– 3 wide 1M strip IO/sec
• 8k = 233, 128k = 109
– Single disk IO/sec
• 8k = 39, 128k = 25
Disk test results – async direct
Oracle IO options
• Initialisation parameter
• FILESYSTEMIO_OPTIONS =
– { none setall directIO asynch }
– Reboot server after setting this parameter
• Enable direct and asynchronous with SETALL
– In some situations directio isn’t enabled
– Workaround force directio on solaris mount point
– mount -o forcedirectio /dev/md/dsk/d1 /testfs
• Reclaim the freed memory and allocate it to the Oracle buffer cache
Determining the Oracle IO size ?
• The oracle block size 8192
• DB_FILE_MULTIBLOCK_READ_COUNT
– Controls the maximum read IO size oracle will perform
– Example: 16 x 8k = 128K
– Used for full table scans
– Test results shows fewer large IO sustains higher throughput
• Oracle IO size
– Datafiles primary 8k but up to 8k * db_file_multiblock_read_count for full table scans
– Redo logs must be forced direct
• On ufs 8k max io size, slow
• On direct ufs 60k io size, much faster
Oracle IO tests
• create table test as select * from dba_objects
• Check redo log file throughput
• Check data file read and write throughput
• Insert into test select * from test where rownum <>• First alter system checkpoint;
• Next run the insert
• Then rollback the insert so that all buffers are loaded
• Rerun the insert and note the time, commit the transaction
• Alter system checkpoint to flush the freshly dirtied buffers and note the time
Oracle test results
•Test case 1 result due to throughput attainable on redo logs
•Test case 2 result due to throughput attainable on data files
SAN Considerations • How to estimate the IO size ? – Empirical evidence required – Supports selection of an appropriate stripe depth – Oracle recommendation for SAME is 1Mbyte stripe depth • Gauge the number of IO required per second at peak times • Because the disks are connected to multiple hosts IO requests will vary • The RAID controller slices up IOs – Striping hides the IO size from the disk because disks only see IO size equal to the stripe depth – How many IOs will be split ? • How many host ports available ? • Bandwidth of host ports ?