Optimizing Oracle RAC

时间：2022-03-13 23:10

Oracle Real Application Clusters (RAC) databases form an increasing proportion of Oracle database systems. RAC was introduced in 2001 as part of Oracle 9i and was based on Oracle’s earlier Oracle Parallel Server architecture. RAC is almost unique as a mature implementation of a shared disk database clustering system: it is the basis for Oracle’s Exadata database machine, and allows for economical scalability and high availability.

All the usual principles of database performance tuning apply to RAC. However, the RAC architecture provides some additional challenges and opportunities.

The key principles in RAC performance management are:

Maintaining high speed communications between the instances in the cluster.
Eliminating unnecessary inter-instance traffic.
Balancing workload between the instances in the cluster.

RAC architecture

RAC is a shared disk clustered databases: every instance in the cluster has equal access to the database’s data on disk. This is in contrast to the shared nothing architecture employed by other RDBMS clusters. In a shared nothing architecture, each instance is responsible for a certain subset of data. Whenever a session needs that data, then the appropriate instance must be involved in serving up the data.

Shared nothing database clusters have their advantages, but transparent and rapid scalability is not one of them. The shared nothing architecture requires that data be distributed across the instances in the cluster. When a new instance is added to the cluster, data needs to be redistributed across the cluster in order to maintain a reasonable balance of load. In contrast, a new instance can be added to a RAC cluster without any data rebalancing being required: the new instance has immediate and independent access to all the data in the database.

It is this ability to rapidly add or remove instances from RAC clusters that justifies the “G for Grid” in the Oracle 10g and 11g naming convention: RAC has the ability to rapidly scale database clusters by adding or removing instances: that capability is required (but perhaps not sufficient) to create a truly grid-enabled database.

The main challenge in the shared disk architecture is to establish a global memory cache across all the instances in the cluster: otherwise the clustered database becomes IO bound. Oracle establishes this shared cache via a high-speed private network referred to as the cluster interconnect.

All the instances in a RAC cluster share access to datafiles on shared disk, though each have private redo logs and undo segments. Each instance has its own SGA and background processes and each session that connects to the cluster database connects to a specific instance in the cluster.

Figure 1 RAC architecture

Global Cache Requests

Understanding how instances in the cluster communicate is critical to understanding RAC performance. As with single-instance Oracle, we want to avoid disk IO whenever possible – primarily by keeping frequently-accessed data in memory. In a RAC configuration, the data we want might be in the memory of one of the other instances. Therefore, RAC will use the interconnect to request the required data from another instance that has it in memory, rather than by reading it from disk. Each request across the interconnect is referred to as a Global Cache (GC) request.

To co-ordinate these inter-instance block transfers, Oracle assigns each block to a “master” instance. This instance is essentially responsible for keeping track of which instance has last accessed a particular block of data.

Whenever an Oracle instance wants a block of data that is not in its buffer cache, it will ask the master instance for the block. If the master instance has the data concerned, it will send it back across the interconnect: this is recorded as a “2-way” wait and is referred to as a 2-way get.

If the master instance does not have the block in memory, but has a record of another instance accessing the block, then it will forward the block request to this third instance. The third instance will then return the block to the requesting instance: this is recorded as a “3-way” wait and is referred to as a 3-way get.

If no instance has the block in memory, then the master will advise the requesting instance to retrieve the block from disk: this is recorded as a “grant”.

Regardless of which instance wants the block, which instance has the block, and which instance is mastering the block, the number of instances involved in the transfer will never be more than three. This means that the performance penalty as additional instances are added is minimized. However, as we increase the number of instances the ratio of 3-way waits to 2-way waits will increase and some reduction in Global Cache performance should be expected.

Figure 2 illustrates the sequence of events in 2-way gets, 3-way gets and grants.

Figure 2 two-way gets, three-way gets and grants

The key background process in these scenarios is the LMS (Global Cache Service - previously known as the Lock Management Service; hence the abbreviation) process. One or more of these are initiated at startup, depending on the parameter gcs_server_processes.

Block requests can be either be made for the “current” copy of the block or for a “consistent read” copy. Consistent read blocks are required when performing query processing so that all blocks are consistent as at the start of the query or a read only transaction. Most query blocks will be consistent read blocks. Current blocks are more often associated with DML operations.

RAC tuning principles

The RAC architecture outlined in the previous section leads directly to the general principles of RAC performance. RAC will perform well, and scale well, if the following are true:

The time taken to request a block across the interconnect (Global Cache requests) is much lower – say ten times less – than the time to retrieve a block from the disk. Global Cache requests are intended to avoid the necessity of a disk read, and sometimes the disk read must occur even after the Global Cache request. If the Global Cache request time is anywhere near the time it takes to read from disk, then the approach back-fires. Luckily, properly optimized Global Cache requests are quick – typically 10 times less than disk read time.
The cluster is well balanced, or at least there are no overloaded instances in the cluster. Since so many RAC operations involve two or three instances, an overloaded instance might cause problems for its neighbors as well as itself. Indeed, an overloaded CPU on a remote instance is one of the most common causes for long global cache wait times on an otherwise idle local instance.
The overhead incurred through cluster activities is a small proportion of the total database time. We want our RAC database to be a database first, and a cluster second. If the proportion of time spent performing Global Cache activities is high in proportion to other activities, then we may need to look at ways of reducing the Global Cache traffic.

Three key principles of RAC performance are:

Global Cache lookups should be much quicker than disk reads
Instances in the cluster should have a well balanced workload
Time spent performing cluster related activities should not dominate active database time.

Single instance contention and RAC

Oracle makes a point of emphasizing that applications that don’t scale well on single instance Oracle will probably not scale well on RAC. While this is generally true, the switch from single instance to RAC tends to magnify some issues while alleviating others.

Performance issues that relate to contention for specific “hot” data blocks tend to be magnified in RAC, since these contentions now also take on a cross-instance overhead. For instance, buffer busy waits can occur for a buffer which is on another instance; because of the interconnect overhead, the average time spent waiting for a buffer busy wait to complete may increase. Specific types of contention thatincrease in a RAC database are:

Buffer busy waits. The sort of operations that cause buffer busy in single-instance will also cause buffer busy in RAC, but the wait will be amplified by the need to transfer the block across the cluster once it is free.
Cache buffer chains latch contention in a single instance database will probably reduce in RAC, since latches are not acquired across instances. However, for every relatively short latch free wait, you will probably experience a much longer global cache wait when transferring the block concerned across instances.
Sequence number generation. If there are SQ enqueue waits then these will likely spread and magnify across instances in a RAC environment.

However, it’s also true that some single-instance contention issues can be reduced under RAC. RAC divides up the SGA and sessions across each instance in the cluster, effectively allowing some operations to experience an increase in concurrency. In particular, shared pool related latches and mutexes waits might reduce, since the activity will be distributed across the multiple shared pools in the cluster.

Contention for data and index blocks in a single instance database will probably magnify in a RAC environment. However, other contention points - library cache mutexes for instance - might be reduced.

Cluster overhead

Asides from anything else, we want to make sure that a RAC cluster is able to perform database activities without being impeded by cluster related overheads. In a healthy cluster, the time spent in cluster related activities is mainly determined by the average time to make a Global Cache request (Global Cache latency) multiplied by the number of Global Cache requests which must be made.

ClusterTime = AvgGCLatency x GCRequests

It therefore follows that reducing cluster overhead is mainly a process of minimizing the Global Cache latency and eliminating any unnecessary Global Cache requests. The importance of those optimizations will depend upon the relative time spent in cluster related activities.

We can see the overall contribution of cluster related waits in comparison to other high level time categories in the following query:

SQL> SELECT      wait_class time_cat ,ROUND ( (time_secs), 2) time_secs,
2           ROUND ( (time_secs) * 100 / SUM (time_secs) OVER (), 2) pct
3      FROM (SELECT wait_class wait_class,
4                   sum(time_waited_micro) / 1000000 time_secs
5              FROM gv$system_event
6             WHERE    wait_class <> ‘Idle‘
7                   AND time_waited > 0
8             GROUP BY wait_class
9            UNION
10            SELECT ‘CPU‘,
11                   ROUND ((SUM(VALUE) / 1000000), 2) time_secs
12              FROM gv$sys_time_model
13             WHERE stat_name IN (‘background cpu time‘, ‘DB CPU‘))
14 ORDER BY time_secs DESC;

Time category         TIME_SECS    PCT
-------------------- ---------- ------
CPU                    21554.33 43.45
Cluster                 7838.82 15.80
Other                   6322.23 12.75
Application             5077.09 10.24
System I/O              3387.06   6.83
User I/O                3302.49   6.66
Commit                     1557   3.14
Concurrency               371.5    .75
Network                  142.06    .29
Configuration             49.59    .10

As a rule of thumb, we might expect that cluster-related waits comprise less than 10% of total database time. Waits above 20% certainly warrant investigation.

Cluster waits times greater than 10-20% of total database time probably warrant investigation.

Although cluster waits will usually be comprised mainly of straightforward Global Cache request waits, it’s not uncommon for more “sinister” Global Cache waits to emerge: lost blocks, congestion, Global Cache buffer busy waits. Drilling down into the low level wait events will often reveal these conditions. The following query breaks out the cluster wait times:

SQL> WITH system_event AS
2     (SELECT CASE
3               WHEN wait_class = ‘Cluster‘ THEN event
4               ELSE wait_class
5             END wait_type, e.*
6       FROM gv$system_event e)
7 SELECT wait_type, ROUND(total_waits/1000,2) waits_1000 ,
8         ROUND(time_waited_micro/1000000/3600,2) time_waited_hours,
9         ROUND(time_waited_micro/1000/total_waits,2) avg_wait_ms ,
10         ROUND(time_waited_micro*100
11            /SUM(time_waited_micro) OVER(),2) pct_time
12 FROM (SELECT wait_type, SUM(total_waits) total_waits,
13               SUM(time_waited_micro) time_waited_micro
14          FROM system_event e
15         GROUP BY wait_type
16         UNION
17        SELECT ‘CPU‘,   NULL, SUM(VALUE)
18          FROM gv$sys_time_model
19         WHERE stat_name IN (‘background cpu time‘, ‘DB CPU‘))
20 WHERE wait_type <> ‘Idle‘
21 ORDER BY time_waited_micro DESC;

                                          Waits       Time Avg Wait Pct of
Wait Type                                 \1000      Hours        Ms   Time
----------------------------------- ----------- ---------- --------- ------
CPU                                                   6.15            43.62
Other                                    38,291       1.76       .17 12.50
Application                                  32       1.41    157.35 10.00
User I/O                                    822        .97      4.25   6.88
System I/O                                  995        .96      3.46   6.78
gc current multi block request            9,709        .87       .32   6.15
gc cr multi block request                16,210        .48       .11   3.37
Commit                                      300        .44      5.31   3.13
gc current block 2-way                    5,046        .37       .26   2.59
gc current block 3-way                    2,294        .28       .43   1.97
gc cr block busy                            984        .16       .58   1.11

Cluster waits summary

Here are descriptions for some of the more important Global Cache wait events:

gc cr/current block 2-way: These are waits for Global Cache block requests involving only 2 instances. As outlined at the beginning of this article, these occur when the block master instance is able to forward a block directly to the requesting instance

gc cr/current block 3-way: These waits occur when the block master does not have the block concerned, and forwards the request to a third instance.

gc cr/current multi block request: A wait that occurs when requesting multiple blocks in a single request. This is typically associated with full table or index scans.

gc cr/current grant 2-way: The block master informs the requesting instance that the requested block is not available from another instance. The requesting instance will then perform a disk IO to retrieve the block.

gc cr/current block busy: The requesting instance must wait for the instance that holds the block to complete some other operation before the block can be forwarded. This can happen in the same circumstances as for single instance buffer busy or because the requesting instance must flush redo records to the redo log before shipping a consistent copy.

gc cr/current block congested: This wait can be reported when CPU or memory pressure prevents the LMS process from keeping up with requests. Prior to Oracle 10.2, you could manually set LMS to run at a higher than default OS priority to alleviate this situation. From 10.2, LMS runs at a higher priority by default. Changing the number of LMS processes might mask the symptom; however preventing instances from overloading is a more effective overall solution.

gc cr/current block lost: Lost block waits occur when a block that has been transmitted is not received. If using of UDP – which is an “unreliable” protocol in the sense that a network operation does not require an acknowledgement – then some small number of lost blocks are to be expected. Moderate rates might suggest that the interconnect is overloaded. High rates probably indicate network hardware issues. We’ll look closer at lost blocks later in this article.

Reducing Global Cache latency

The RAC architecture requires and expects that instances will fetch data blocks across the interconnect as an alternative to reading those blocks from disk. The performance of RAC is therefore going to be very sensitive to the time it takes to retrieve a block from the Global Cache; which we will call Global Cache latency.

Some documents or presentations suggest that Global Cache latency is primarily or exclusively Interconnect latency: the time it takes to send the block across the interconnect network. Interconnect latency is certainly an important part of overall Global Cache latency: but it’s not the only part. Oracle processes such as the Global Cache Service (LMS) have to perform a significant amount of CPU intensive processing each time a block is transferred, and this CPU time is usually as least as significant as any other factor in overall Global Cache latency. In certain circumstances non-CPU operations – such flushing redo entries to disk – will also contribute to Global Cache latency.

Interconnect latency is an important factor in Global Cache latency – however Oracle CPU and IO are also important contributors.

To measure Global Cache latency, we use the wait interface as exposed by GV$SYSTEM_EVENT (The “V$” views report data for the current instance: “GV$” views report across the entire cluster.). The following query reports on average times for each of the Global Cache request types as well as single-block read times (for comparison):

SQL> SELECT event, SUM(total_waits) total_waits,
2         ROUND(SUM(time_waited_micro) / 1000000, 2)
3            time_waited_secs,
4         ROUND(SUM(time_waited_micro)/1000 /
5            SUM(total_waits), 2) avg_ms
6 FROM gv$system_event
7 WHERE wait_class <> ‘Idle‘
8        AND(   event LIKE ‘gc%block%way‘
9            OR event LIKE ‘gc%multi%‘
10            or event like ‘gc%grant%‘
11            OR event = ‘db file sequential read‘)
12 GROUP BY event
13 HAVING SUM(total_waits) > 0
14 ORDER BY event;

                                      Total         Time Avg Wait
Wait event                            Waits       (secs)      (ms)
------------------------------ ------------ ------------ ---------
db file sequential read             283,192        1,978      6.99
gc cr block 2-way                   356,193          396      1.11
gc cr block 3-way                   162,158          214      1.32
gc cr grant 2-way                   141,016           25       .18
gc cr multi block request           503,265          242       .48
gc current block 2-way              325,065          227       .70
gc current block 3-way              117,913           93       .79
gc current grant 2-way               45,580           20       .44
gc current grant busy               168,459          296      1.76
gc current multi block request       91,690           42       .46

This example output provides reason for concern. The average wait for Global Cache consistent read requests (as shown by ‘gc cr block 2-way’ and ‘gc cr block 3-way’) is more than 1 millisecond and more than 1/10th of the time for a db file sequential read. While the Global Cache is still faster than disk, it’s taking longer than we’d expect if the interconnect and RAC were fully optimized.

Examining the interconnect

When Global Cache waits are high, we should first determine if the latency is primarily the result of interconnect network waits.

The best way to determine the interconnect contribution to overall performance is to use the ping utility to measure latency independently of the Oracle stack. Ping packet handling is not identical to RAC packet handling, but if ping latency is high then you can confidently assume that network responsiveness is an issue.

In Oracle 10g the view X$KSXPIA shows the private and public IP addresses being used by the current instance. In Oracle 11g this information is available in the view Gv$cluster_interconnects. The following query shows us the private interconnect IP address plus other identifying information for the current instance (this query must be run as SYS):

SQL> SELECT instance_number, host_name, instance_name,
2         name_ksxpia network_interface, ip_ksxpia private_ip
3 FROM        x$ksxpia
4       CROSS JOIN
5           v$instance
6 WHERE pub_ksxpia = ‘N‘;

Inst Host                                       Net   Private
   # Name                      INSTANCE_NAME    IFace IP
---- ------------------------- ---------------- ----- ------------
   3 melclul32.melquest.dev.me MELRAC3          eth1 192.168.0.12
     l.au.qsft

We can then ping the IP address from another node in the cluster to determine average latency. On a Linux system, we can use the “–s 8192” flag to set an 8K packet size so as to align with the block size of this Oracle database. On Windows the appropriate flag is “-l”:

$ ping -c 5 -s 8192 192.168.0.12
PING 192.168.0.12 (192.168.0.12) 8192(8220) bytes of data.
8200 bytes from 192.168.0.12: icmp_seq=0 ttl=64 time=0.251 ms
8200 bytes from 192.168.0.12: icmp_seq=1 ttl=64 time=0.263 ms
8200 bytes from 192.168.0.12: icmp_seq=2 ttl=64 time=0.260 ms
8200 bytes from 192.168.0.12: icmp_seq=3 ttl=64 time=0.265 ms
8200 bytes from 192.168.0.12: icmp_seq=4 ttl=64 time=0.260 ms

--- 192.168.0.12 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 3999ms
rtt min/avg/max/mdev = 0.251/0.259/0.265/0.020 ms, pipe 2

The ping output above indicates low latency – about .25 ms - across the interconnect.

Use the ping utility to measure the interconnect latency independently of the Oracle software stack.

Dell Software’s Spotlight on RAC presents both ping latencies and Global Cache latencies for each instance side by side, as shown in Figure 3.

Figure 3 Global Cache and ping latencies in Quest‘s Spotlight on RAC

Very high network latencies may indicate the need to tune the interconnect as outlined below. However, probably the number one “newbie” error is to inadvertently configure RAC to use the public LAN network rather than the private interconnect network. Before tuning the private interconnect, make absolutely sure that you didn’t accidentally configure RAC to use the public LAN as the interconnect network. Use the X$KSXPIA or V$CLUSTER_INTERCONNECTS view to double check the IP addresses are those associated with the interface that is connected to the switch that implements the private network.

It’s a common mistake to configure RAC to use the public network – rather than the private network – for the interconnect. Before tuning the interconnect make absolutely sure that the interconnect is private and that RAC is configured to use the correct network.

Signs of interconnect problems

Asides from high latencies – as exposed by the ping command - interconnect issues can show up as “lost” or congested blocks.

Lost blocks occur when a block is transmitted, but never received. The following query shows the number of blocks lost compared to the number sent and received:

SQL> SELECT name, SUM(VALUE)
2 FROM gv$sysstat
3 WHERE       name LIKE ‘gc%lost‘
4        OR name LIKE ‘gc%received‘
5        OR name LIKE ‘gc%served‘
6 GROUP BY name
7 ORDER BY name;

NAME                                                             SUM(VALUE)
---------------------------------------------------------------- ----------
gc blocks lost                                                           99
gc claim blocks lost                                                      0
gc cr blocks received                                              14207701
gc cr blocks served                                                14207721
gc current blocks received                                         14471301
gc current blocks served                                           14471393

Time spent waiting for lost block retransmission is recorded in the wait events ‘gc cr request retry’, ‘gc cr block lost’ and ‘gc current block lost’. The times associated with these waits should be low: typically less than 1% of total when compared to the total number of blocks recorded in the ‘gc cr/current blocks received/served’ statistics.

If there are very high lost block counts (relative to blocks received) – or if the time associated with lost blocks becomes significant compared to total database time – then the most likely cause is a network hardware issue. This can be as simple as a poorly mounted network card, crimped networking cables or faulty network components.

Moderate lost block counts – especially if associated with very high levels of activity – might indicate an overloaded interconnect. The network optimizations below might alleviate the problem, or you may need to increase the throughput of the interconnect hardware (upgrading to Gigabit Ethernet, 10 Gigabit Ethernet or Infiniband for instance).

Global Cache “lost blocks” can be indicative of an overloaded or miss-configured interconnect or (at high levels) faulty network hardware.

Optimizing the interconnect

If the interconnect is identified as a problem, or even if we just want to optimize it to squeeze the Global Cache latencies down as far as possible, we have a few networking options we can try.

Network hardware and protocols

It’s possible to use dual Network Interconnect Cards (NICs) to reduce points of failure in the overall RAC architecture. If so, you should use NIC “bonding” (also known as link aggregation) to present the two NICs to Oracle as single logical interface. This will allow for the aggregate network bandwidth of both cards to be fully utilized.

The two most commonly used link and transport protocol combinations for the RAC interconnect are:

Gigabit Ethernet (GBe) or 10 Gigabit Ethernet (10GBe) in combination with UDP.
InfiniBand in combination with either Reliable Datagram Sockets (RDS) or Internet Protocol (IP).

The GBe/UDP option has the advantage of using standards-based commodity hardware, and being supported across a wide set of hardware and operating systems. InfiniBand offers superior throughput and latency, but at greater cost and administration effort. Note that Oracle uses InfiniBand/RDS inside Exadata both to connect the RAC instances and to attach the database nodes to the storage nodes: it’s clearly the highest performance solution.

However Gigabit Ethernet is able to sustain very high bandwidth – somewhere in the vicinity of 5,000-10,000 Global Cache transfers per second. Most RAC databases – especially those with an OLTP style workload - are unlikely to overload a GBe or 10GBe interconnect.

Many RAC databases – especially OLTP style - will be adequately served by a Gigabit Ethernet or 10 Gigabit Ethernet based interconnect. However, InfiniBand offers superior throughput and scalability.

Ethernet Jumbo frames

By default, the maximum sized packet that can be transmitted across an Ethernet network is only 1500 bytes. This is determined by the Maximum Transmission Unit (MTU). Because Oracle’s default block size is 8K, most Global Cache transfers will have to be made in multiple Ethernet packets. This multiple packets will increase network load and possibly overload the interconnect. If any one of the packets that represent a block is lost, then Oracle will have to re-transmit the lot.

Jumbo frames allows you to set a MTU of up to 9000 – large enough for a single default size Oracle block to transmit in a single packet. However, to increase the MTU you may need to apply changes at the switch, the network card and the Operating System level. Setting jumbo frames at the operating system level, but failing to do so at the switch level is a common mistake. Note too, that not all switches support jumbo frames.

To set jumbo frames at the host level, you use a utility such as ifconfig. On Linux, you can use the ifconfig command to check the MTU size:

#ifconfig eth0
eth0      Link encap:Ethernet HWaddr 00:0D:56:18:2C:60
          inet addr:10.20.12.154 Bcast:10.20.12.255 Mask:255.255.255.0
          inet6 addr: fe80::20d:56ff:fe18:2c60/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500 Metric:1
          RX packets:951407 errors:0 dropped:0 overruns:0 frame:0
          TX packets:544283 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100
          RX bytes:119973568 (114.4 MiB) TX bytes:124883921 (119.0 MiB)
          Base address:0xdf40 Memory:feae0000-feb00000

The MTU is at the default of 1500. To set it – at the Operating system level – to 9000 we again use the ifconfig command:

# ifconfig eth0 mtu 9000
# ifconfig eth0
eth0      Link encap:Ethernet HWaddr 00:0D:56:18:2C:60
          inet addr:10.20.12.154 Bcast:10.20.12.255 Mask:255.255.255.0
          inet6 addr: fe80::20d:56ff:fe18:2c60/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:9000 Metric:1
          RX packets:951649 errors:0 dropped:0 overruns:0 frame:0
          TX packets:544386 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100
          RX bytes:119996835 (114.4 MiB) TX bytes:124902029 (119.1 MiB)
          Base address:0xdf40 Memory:feae0000-feb00000

Setting jumbo frames at the OS level will achieve nothing if the NICs or switches are not configured to support an MTU of 9000 – you may need to consult with your MIS department or consult the hardware documentation. You also need to make sure that every host in the cluster has the same MTU setting.

When enabling Jumbo Frames, be certain that you enable it on all hosts in the cluster and also at the hardware level (switches and NICs).

Each OS supports a different mechanism for enabling Jumbo Frames. On Solaris, you need to edit the interface configuration file (/kernel/drv/e1000g.conf for instance).

See Oracle support note Recommendation for the Real Application Cluster Interconnect and Jumbo Frames (341788.1) for more information.

If your interconnect is configured using Ethernet (GBe or 10GBe) then enabling Jumbo Frames will reduce the packet rate, increasing reliability and throughput.

UDP buffer size

When a RAC host receives a network packet over the interconnect, the packet is held in a small memory buffer until Operating System gets around to processing the buffer. If the system is particularly busy and/or the interconnect is overloaded, then packets might be lost. Increasing the UDP receive buffer size can help.

The OS default value for the UDP receive buffer is generally in the vicinity of 128-256K which can be inadequate for the bursts of load that can occur across the interconnect. If the buffer is too small then various network errors may be reported by utilities such as netstat or ifconfig. These symptoms will include dropped packages, overflows, fragmentation or reassembly errors.

Oracle installation pre-requisites require that the value be increased, typically to about 4M.

In Linux, the kernel parameter net.core.rmem_max controls the receive buffer size. The sysctl command can be used to obtain the current value:

# sysctl -n net.core.rmem_max
4194304

Ensure that your UDP receive buffer size is set above the default value – probably to the OS maximum.

LMS waits

Interconnect performance is at the heart of Global Cache latency, but high Global Cache latencies are as often the result of delays in the Oracle software layers. The LMS service on the remote instances contributes most of the non-network latency to Global Cache requests; it is responsible for constructing and returning the requested blocks. The following query shows LMS latencies for each instance for current and consistent read requests:

SQL> WITH sysstats AS (
2      SELECT instance_name,
3             SUM(CASE WHEN name LIKE ‘gc cr%time‘
4                      THEN VALUE END) cr_time,
5             SUM(CASE WHEN name LIKE ‘gc current%time‘
6                      THEN VALUE END) current_time,
7             SUM(CASE WHEN name LIKE ‘gc current blocks served‘
8                      THEN VALUE END) current_blocks_served,
9             SUM(CASE WHEN name LIKE ‘gc cr blocks served‘
10                      THEN VALUE END) cr_blocks_served
11        FROM gv$sysstat JOIN gv$instance
12        USING (inst_id)
13      WHERE name IN
14                    (‘gc cr block build time‘,
15                     ‘gc cr block flush time‘,
16                     ‘gc cr block send time‘,
17                     ‘gc current block pin time‘,
18                     ‘gc current block flush time‘,
19                     ‘gc current block send time‘,
20                     ‘gc cr blocks served‘,
21                     ‘gc current blocks served‘)
22      GROUP BY instance_name)
23 SELECT instance_name , current_blocks_served,
24         ROUND(current_time*10/current_blocks_served,2) avg_current_ms,
25         cr_blocks_served,
26         ROUND(cr_time*10/cr_blocks_served,2) avg_cr_ms
27    FROM sysstats;

             Current Blks    Avg      CR Blks    Avg
Instance           Served CU ms       Served Cr ms
------------ ------------ ------ ------------ ------
MELRAC1         7,342,829    .03    7,647,581    .05
MELRAC2         7,330,661    .03    7,418,901    .04
MELRAC3         7,310,866    .03   12,696,127    .08

If the network is responsive and fast, but LMS latency is high, then one of the following might be implicated:

An overloaded instance is unable to respond fast enough to Global Cache requests. In particular, the LMS processes might be overloaded with requests or starved for CPU.
IO bottlenecks – particularly in redo log IO – are slowing down the response to Global Cache requests.

In the first case, the LMS process on the remote instance is simply too busy to process the Global Cache request. This can be due to an excessive volume of requests or because CPU load on the host is making it impossible for the LMS to obtain CPU. The later situation is less common from Oracle 10.2 onwards, because Oracle now runs the LMS processes at an elevated priority. Severe memory pressure may also lead to a lack of LMS responsiveness.

The “too busy” phenomenon is probably a result of an imbalanced cluster: if any instance in the cluster is significantly overloaded then Global Cache response times on the idle instances will suffer. The best solution is to try and achieve a better cluster balance – see the section on Cluster Balance below.

High Global Cache latencies can occur when one or more instances in the cluster become overloaded. Balancing the workload across the cluster is indicated.

The other typical cause of high latencies is when the LMS process must flush uncommitted changes to the redo log prior to sending the block to the requesting instance. If the application design is such that uncommitted blocks are often in demand across instances in the cluster then these redo log flushes might become common. If there are bottlenecks in the redo log devices then the IO waits will be magnified.

We can measure the impact on LMS response time by leveraging the timing information in GV$SYSTAT and the FLUSHES statistic in GV$CR_BLOCK_SERVER. Putting the two together we can calculate the proportion of blocks that required flushing and the proportion of LMS time spent performing the flush:

SQL> WITH sysstat AS (
2      SELECT SUM(CASE WHEN name LIKE ‘%time‘
3                      THEN VALUE END) total_time,
4             SUM(CASE WHEN name LIKE ‘%flush time‘
5                      THEN VALUE END) flush_time,
6             SUM(CASE WHEN name LIKE ‘%served‘
7                      THEN VALUE END) blocks_served
8      FROM gv$sysstat
9      WHERE name IN
10                    (‘gc cr block build time‘,
11                     ‘gc cr block flush time‘,
12                     ‘gc cr block send time‘,
13                     ‘gc current block pin time‘,
14                     ‘gc current block flush time‘,
15                     ‘gc current block send time‘,
16                     ‘gc cr blocks served‘,
17                     ‘gc current blocks served‘)),
18       cr_block_server as (
19      SELECT SUM(flushes) flushes, SUM(data_requests) data_requests
20      FROM gv$cr_block_server     )
21 SELECT ROUND(flushes*100/blocks_served,2) pct_blocks_flushed,
22         ROUND(flush_time*100/total_time,2) pct_lms_flush_time
23    FROM sysstat CROSS JOIN cr_block_server;

PCT_BLOCKS_FLUSHED PCT_LMS_FLUSH_TIME
------------------ ------------------
               .25              36.03

Note how even a very small proportion of block flushes can still account for a large proportion of total LMS time. Tuning the redo log IO layout might be indicated.

The LMS will sometimes need to flush redo entries to disk before returning a block. Redo log IO can therefore be a critical factor in Global Cache latency.

Cluster balance

Achieving balance in a RAC configuration is important for scalability, manageability and performance. While some variation in workload across the cluster is to be expected, in an unbalanced cluster the following undesirable situations can arise:

Sessions on busy instances get poor service time. Even though there may be spare capacity in the cluster as a whole, sessions on busy instances will be unable to utilize that capacity and will experience poor performance.
Sessions on idle instances wait for blocks from busy instances. Because a lot of operations result in requests to remote instances, an over-loaded instance can cause performance problems across the entire cluster. A session on an “idle” instance may experience high Global Cache wait times waiting on blocks from the busy instance.
Benefits of adding new instances may not be realized. If some of the instances in the cluster are subject to a higher workload, then these instances may become bottlenecks to overall database throughput. As instances are added to the cluster, expected performance improvements might be unattainable.
Tuning is harder because each instance has different symptoms. In an unbalanced cluster, sessions on busy instances may experience high CPU waits while sessions on less busy instances will experience high Global Cache waits. Troubleshooting performance problems on an unbalanced cluster can therefore be more challenging because of the inconsistent symptoms.

We can assess cluster balance fairly easily: the following query reports on CPU, DB time and logical reads on each instance within the cluster since startup:

SQL> WITH sys_time AS (
2      SELECT inst_id, SUM(CASE stat_name WHEN ‘DB time‘
3                          THEN VALUE END) db_time,
4              SUM(CASE WHEN stat_name IN (‘DB CPU‘, ‘background cpu time‘)
5                  THEN VALUE END) cpu_time
6        FROM gv$sys_time_model
7       GROUP BY inst_id                 )
8 SELECT instance_name,
9         ROUND(db_time/1000000,2) db_time_secs,
10         ROUND(db_time*100/SUM(db_time) over(),2) db_time_pct,
11         ROUND(cpu_time/1000000,2) cpu_time_secs,
12         ROUND(cpu_time*100/SUM(cpu_time) over(),2) cpu_time_pct
13    FROM     sys_time
14    JOIN gv$instance USING (inst_id);

Instance       DB Time Pct of      CPU Time   Pct of
Name            (secs) DB Time        (secs) CPU Time
-------- ------------- ------- ------------- --------
MELRAC3       3,705.30   24.48      1,119.99    17.03
MELRAC2       6,278.23   41.48      4,010.85    61.00
MELRAC1       5,150.96   34.03      1,444.06    21.96

In this example it is clear that MELRAC2 is being subjected to a disproportionate level of CPU load: if this is not addressed, increasing cluster workload will almost certainly lead to performance degradation as MELRAC2 becomes the bottleneck for the entire cluster.

The above query summarizes performance since the instances in the cluster were started. Of course, instances in a cluster can start and stop independent of the cluster as a whole, which might result in different totals even if the cluster is experiencing a balanced workload.

Quest Software’s Spotlight on RAC probably has the most advanced RAC balance monitoring. Spotlight on RAC displays cluster balance from a number of perspectives and performs a statistical analysis to determine if the imbalance is systematic or due to short term random fluctuations. Figure 4 shows example output from Spotlight on RAC.

Figure 4 RAC balance dis