I am planning to put up a few posts on StatsPack and AWR reports.
Note : Some figures / details may be slightly changed / masked to hide the real source.
Logical I/O and Change rates :
1. From a 9.2 StatsPack Report:
Cache Sizes (end) ~~~~~~~~~~~~~~~~~ Buffer Cache: Std Block Size: 4K Shared Pool Size: Log Buffer: Load Profile ~~~~~~~~~~~~ Per Second --------------- Redo size: 56,031.63 Logical reads: 68,286.24 Block changes: 314.88 Physical reads: 842.92 Physical writes: 134.76
With a 4KB Block Size 68,286.24 Logical Reads translates to slightly over 266MB/second. Logical I/O is CPU-bound. Database activity is Read-Intensive with a high rate of Reads relative to Writes.
2. From an 11.2 AWR Report :
Cache Sizes
Begin
|
End
| |||
---|---|---|---|---|
Buffer Cache:
|
Std Block Size:
|
16K
| ||
Shared Pool Size:
|
Log Buffer:
|
Load Profile
Per Second
|
Per Transaction
|
Per Exec
|
Per Call
| |
---|---|---|---|---|
DB Time(s):
| ||||
DB CPU(s):
| ||||
Redo size:
|
1,593,612.1
| |||
Logical reads:
|
51,872.5
| |||
Block changes:
|
4,212.4
| |||
Physical reads:
|
63.8
| |||
Physical writes:
|
133.1
|
With a 16KB Block Size, 51,872.5 Logical Reads translates to slightly over 810MB/second. This consumes CPU cycles. However, here the number of Block Changes is noticeably high in this environment. This is also reflected in the high Redo rate -- slightly over 5,471MB/hour (Note : "Redo size" is in Bytes).
CPU Consumption :
1. From a 9.2 StatsPack Report :
Statistic Total per Second per Trans --------------------------------- ------------------ -------------- ------------ CPU used by this session 37.5 2.1 CPU used when call started 37.6 2.1
This indicates 0.375seconds of CPU usage per second -- i.e. approximately 37.5% of 1 CPU (let's take this as an older non-multicore architecture). If the server has 4 CPUs, CPU consumption is 9.4%
2. From an 11.2 AWR Report :
Instance Activity Stats
- Ordered by statistic name
Statistic
|
Total
|
per Second
|
per Trans
|
---|---|---|---|
... deleted rows ....
| |||
... deleted rows ....
| |||
CPU used by this session
|
46.85
| ||
CPU used when call started
|
46.27
|
This indicates 0.468seconds of CPU usage per second -- i.e. approximately 46.8% of 1 Core. This is also reflected in the Load Profile section :
Load Profile
Per Second
|
Per Transaction
|
Per Exec
|
Per Call
| |
---|---|---|---|---|
DB Time(s):
| ||||
DB CPU(s):
|
0.5
|
0.1
|
0.00
|
0.00
|
How many CPUs does this machine have ? AWR reports this :
Host Name
|
Platform
|
CPUs
|
Cores
|
Sockets
|
Memory (GB)
|
---|---|---|---|---|---|
Linux x86 64-bit
|
16
|
16
|
2
|
That means we are using less than half of 1 of 16 cores ! This translates to CPU consumption of 3.125% The server has too many CPU cores !
Comparing Reports :
Here are two 9.2 StatsPack extracts from one database:
Extract A : 9.2 StatsPack
Load Profile ~~~~~~~~~~~~ Per Second Per Transaction --------------- --------------- Redo size: 56,031.63 3,084.68 Logical reads: 68,286.24 3,759.32 Block changes: 314.88 17.33 Physical reads: 842.92 46.40 Physical writes: 134.76 7.42 User calls: 271.32 14.94 Parses: 146.46 8.06 Hard parses: 7.37 0.41 Sorts: 93.83 5.17 Logons: 0.33 0.02 Executes: 296.70 16.33 Transactions: 18.16
Extract B : 9.2 StatsPack
Load Profile ~~~~~~~~~~~~ Per Second Per Transaction --------------- --------------- Redo size: 89,615.76 2,960.48 Logical reads: 210,302.81 6,947.42 Block changes: 541.83 17.90 Physical reads: 1,465.04 48.40 Physical writes: 161.68 5.34 User calls: 213.82 7.06 Parses: 125.28 4.14 Hard parses: 6.13 0.20 Sorts: 104.31 3.45 Logons: 0.35 0.01 Executes: 664.81 21.96 Transactions: 30.27
Extract B shows a higher volume of activity -- Transactions per second, Redo Size per second, Logical Reads per second.
Actually, although these are both from the same database but for two different time-windows on different dates. The fact that the time-windows aren't the same make the comparisons difficult. Extract A is for an 11-hour window on day 1 and Extract B is for a 6-hour window on day 6. The two windows aren't strictly comparable. You have to be very careful when comparing StatsPack or AWR reports, even when from the same database if they don't cover comparable time-windows with similar loads. The first Report includes non-peak hours, the second is for Peak hours only. The level of activity averaged out over 11hours that include non-peak hours isn't strictly comparable with a 6hour peak window.
Extract C : 10.2 AWR
Load Profile
Per Second
|
Per Transaction
| |
---|---|---|
Redo size:
|
520,776.15
|
50,948.36
|
Logical reads:
|
353,525.71
|
34,585.98
|
Block changes:
|
1,854.93
|
181.47
|
Physical reads:
|
14,285.23
|
1,397.55
|
Physical writes:
|
295.84
|
28.94
|
User calls:
|
1,265.14
|
123.77
|
Parses:
|
822.64
|
80.48
|
Hard parses:
|
15.40
|
1.51
|
Sorts:
|
168.09
|
16.44
|
Logons:
|
0.16
|
0.02
|
Executes:
|
1,040.31
|
101.78
|
Transactions:
|
10.22
|
Extract D : 10.2 AWR
Load Profile
Per Second
|
Per Transaction
| |
---|---|---|
Redo size:
|
517,862.01
|
54,681.39
|
Logical reads:
|
288,341.09
|
30,446.12
|
Block changes:
|
1,879.27
|
198.43
|
Physical reads:
|
12,820.96
|
1,353.77
|
Physical writes:
|
323.90
|
34.20
|
User calls:
|
1,115.78
|
117.82
|
Parses:
|
719.39
|
75.96
|
Hard parses:
|
17.28
|
1.82
|
Sorts:
|
95.74
|
10.11
|
Logons:
|
0.15
|
0.02
|
Executes:
|
935.33
|
98.76
|
Transactions:
|
9.47
|
Extracts C and D are from the same database for the same time-window on 2 consecutive workdays. They are comparable. A difference of 10% for some statistic may not be significant. Understand that every business, every activity, every system will have some "normal" fluctuations day-to-day.
Identifying Unusual Events / Behaviours / Applications
Here I find a few "unusual" events and mark them out.
Extract A : 11.2 AWR
Snap Id
|
Snap Time
|
Sessions
|
Cursors/Session
| |
---|---|---|---|---|
Begin Snap:
|
.... 03:00:57
|
107
|
.7
| |
End Snap:
|
.... 07:00:07
|
114
|
.9
| |
Elapsed:
|
239.17 (mins)
| |||
DB Time:
|
22.61 (mins)
|
Top 5 Timed Foreground Events
Event
|
Waits
|
Time(s)
|
Avg wait (ms)
|
% DB time
|
Wait Class
|
---|---|---|---|---|---|
DB CPU
|
1,332
|
98.16
| |||
SQL*Net more data to client
|
49,701
|
20
|
0
|
1.50
|
Network
|
SQL*Net more data from client
|
213,915
|
5
|
0
|
0.34
|
Network
|
db file scattered read
|
1,159
|
1
|
1
|
0.08
|
User I/O
|
db file sequential read
|
7,547
|
1
|
0
|
0.07
|
User I/O
|
The two "SQL*Net more data" sets of waits are the unusual events.
The Time on SQL*Net more data to/from client is negligible isn't it ? So, should I be concerned ? Over a 4 hour period, only 20seconds were on "SQL*Net more data to client". Time on "SQL*Net more data from client" is much lower at a total time of 5seconds only. So "Time based" tuning would ignore these two waits.
The two "SQL*Net more data" sets of waits are the unusual events.
The Time on SQL*Net more data to/from client is negligible isn't it ? So, should I be concerned ? Over a 4 hour period, only 20seconds were on "SQL*Net more data to client". Time on "SQL*Net more data from client" is much lower at a total time of 5seconds only. So "Time based" tuning would ignore these two waits.
Foreground Wait Events
Event
|
Waits
|
%Time -outs
|
Total Wait Time (s)
|
Avg wait (ms)
|
Waits /txn
|
% DB time
|
---|---|---|---|---|---|---|
SQL*Net more data to client
|
49,701
|
0
|
20
|
0
|
37.78
|
1.50
|
SQL*Net more data from client
|
213,915
|
0
|
5
|
0
|
162.62
|
0.34
|
db file scattered read
|
1,158
|
0
|
1
|
1
|
0.88
|
0.08
|
db file sequential read
|
7,550
|
0
|
1
|
0
|
5.74
|
0.07
|
SQL*Net message to client
|
652,102
|
0
|
0
|
0
|
495.89
|
0.04
|
SQL*Net message from client
|
652,102
|
0
|
183,327
|
281
|
495.89
|
Not that Oracle treats "SQL*Net message from client" as an idle wait so the 183,327seconds of wait time do NOT appear in the Top 5 Timed Foreground Events list.
I would draw attention to the high number of "more data from client" waits and the correlation with the "message from client" waits. Either extremely large SQL statements or PLSQL blocks are being submitted very frequently or row inserts with very large array sizes (number of rows per insert) are being received. In this case, further investigation reveals an ETL loader that does bulk inserts of a number of rows per array. If we need tuning, tuning the SDU may help.
Similarly the "more data to client" indicates large data sets are being returned. The numbers of columns and rows per every "send" are high.
Instance Activity Stats (from the same AWR)
Statistic
|
Total
|
per Second
|
per Trans
|
---|---|---|---|
Requests to/from client
|
654,738
|
45.63
|
497.90
|
SQL*Net roundtrips to/from client
|
654,740
|
45.63
|
497.91
|
bytes received via SQL*Net from client
|
1,793,072,463
|
124,950.54
|
1,363,553.20
|
bytes sent via SQL*Net to client
|
552,048,247
|
38,469.57
|
419,808.55
|
logical read bytes from cache
|
762,514,227,200
|
53,135,924.61
|
579,858,727.91
|
physical read total bytes
|
8,772,479,488
|
611,311.62
|
6,671,087.06
|
physical write total bytes
|
25,334,243,328
|
1,765,420.76
|
19,265,584.28
|
redo size
|
6,373,204,848
|
444,117.79
|
4,846,543.61
|
1.793billion bytes received in 654K SQL*Net trips is 2741bytes per trip received at 45 messages per second. Given that it is still only 2,741bytes per trip, possibly the array size could also be tuned with the SDU and TDU.
So, this is an AWR that doesn't call for tuning but reveals information about how the database is being used. Large number of rows (large number of columns) being inserted and retrieved in each call. The Performance Analyst needs to be aware of the nature of the "application" --- here it is not OLTP users but an ETL job that is the "application". Although the database had more than a hundred sessions a very small number of sessions (possibly 4 ?) were active doing ETL and checking the status of ETL tables during this window.
Would you care to analyze the other statistics I've listed -- bytes read and bytes written ?
Extract B : 10.2 AWR
Snap Id
|
Snap Time
|
Sessions
|
Cursors/Session
| |
---|---|---|---|---|
Begin Snap:
|
00:00:07
|
228
|
4.5
| |
End Snap:
|
23:00:06
|
232
|
4.3
| |
Elapsed:
|
1,379.97 (mins)
| |||
DB Time:
|
11,543.24 (mins)
|
I know. It is really bad and most likely meaningless to get an AWR for a 24hour range. (I'll not go into the details about why the AWR is for 24hours -- save that I did NOT ask for a 24hour AWR report).
Top 5 Timed Events
Event
|
Waits
|
Time(s)
|
Avg Wait(ms)
|
% Total Call Time
|
Wait Class
|
---|---|---|---|---|---|
CPU time
|
258,101
|
37.3
| |||
db file sequential read
|
62,150,655
|
208,148
|
3
|
30.1
|
User I/O
|
db file scattered read
|
28,242,529
|
141,638
|
5
|
20.4
|
User I/O
|
RMAN backup & recovery I/O
|
1,597,421
|
37,137
|
23
|
5.4
|
System I/O
|
enq: TX - row lock contention
|
22,276
|
34,942
|
1,569
|
5.0
|
Application
|
The RMAN Backup load is expected. (Why ? Given a 24hour report, I expect RMAN to have run at least once during the day).
For performance tuning, I would look at the "db file ... read" events and identify and analyze SQL statements and the schema.
What is the "unusual" event here ? It is the "enq: TX - row lock contention". Over a period of 1,380minutes, there were 22,276 Row-Lock Contention waits. Actually, this application does not have the same load throughout the 23hours. Most likely, it had load for 15hours only. So, we had 22,276 Row-Lock Contention waits over 15hours. That translates to 1,485 waits per hour or one Row-Lock Contention wait every 2.5seconds. Now, that is a very high frequency. Either users are locking each other out for a very short while (1.569seconds per wait on average) or there is/are one or more jobs that run at a very high frequency and update a common "reference" table. I won't reveal my findings here but analysis of the SQL statements indicates what the "problem" is.
Now, should "Time based performance tuning" be worried about the 5% of time lost on these waits ? Probably not. But they do indicate something peculiar in the design of this system. There are less than 250 user sessions in this OLTP database but there is/are one or more jobs that is/are locking itself every 2.5seconds -- so there is some point of serialisation occurring. Is that job also accounting for CPU time or 'db file read' time ? That needs further analysis.
Both these cases show how a Performance Analyst needs to know how the database is being used. What sort of jobs are submitted, besides OLTP users ?
Buffer Cache Hit Ratios
Many novice DBAs may use Hit Ratios as indicators of performance. However, these can be misleading or incomplete.
Here are two examples :
Extract A: 9i StatsPack
Instance Efficiency Percentages (Target 100%)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Buffer Hit %: 99.06
It would seem that with only 0.94% of reads being physical reads, the database is performing optimally. So, the DBA doesn't need to look any further.
Or so it seems.
If he spends some time reading the report, he also then comes across this :
Top 5 Timed Events
~~~~~~~~~~~~~~~~~~ % Total
Event Waits Time (s) Ela Time
-------------------------------------------- ------------ ----------- --------
db file sequential read 837,955 4,107 67.36
CPU time 1,018 16.70
db file scattered read 43,281 549 9.00
Avg
Total Wait wait Waits
Event Waits Timeouts Time (s) (ms) /txn
---------------------------- ------------ ---------- ---------- ------ --------
db file sequential read 837,955 0 4,107 5 403.3
db file scattered read 43,281 0 549 13 20.8
Physical I/O is a significant proportion (76%) of total database time. 88% of the physical I/O is single-block reads ("db file sequential read"). This is where the DBA must identify that tuning *is* required.
Considering the single block access pattern it is likely that a significant proportion are index blocks as well. Increasing the buffer cache might help cache the index blocks.
Extract B : 10.2 AWR
Instance Efficiency Percentages (Target 100%)
Buffer Nowait %:
|
99.98
|
Redo NoWait %:
|
100.00
|
Buffer Hit %:
|
96.43
|
In-memory Sort %:
|
99.99
|
Library Hit %:
|
97.16
|
Soft Parse %:
|
98.16
|
Execute to Parse %:
|
25.09
|
Latch Hit %:
|
99.85
|
Parse CPU to Parse Elapsd %:
|
89.96
|
% Non-Parse CPU:
|
96.00
|
The Buffer Hit Ratio is very good. Does that mean that I/O is not an issue ?
Look again at the same report
Top 5 Timed Events
Event
|
Waits
|
Time(s)
|
Avg Wait(ms)
|
% Total Call Time
|
Wait Class
|
---|---|---|---|---|---|
CPU time
|
147,593
|
42.3
| |||
db file sequential read
|
31,776,678
|
87,659
|
3
|
25.1
|
User I/O
|
db file scattered read
|
19,568,220
|
79,142
|
4
|
22.7
|
User I/O
|
RMAN backup & recovery I/O
|
1,579,314
|
37,650
|
24
|
10.8
|
System I/O
|
read by other session
|
3,076,110
|
14,216
|
5
|
4.1
|
User I/O
|
User I/O is actually significant. The SQLs with the highest logical I/O need to be reviewed for tuning.
No comments:
Post a Comment