<$BlogRSDUrl$>

Tuesday, April 12, 2005

There have been a number of concurrent threads on a large number of different forums scattered around the net recently, all concerning the merits of taking an evidence based approach versus that of taking a holistic experienced practitioner's view to Oracle performance problems.

I'm not going to comment too much in this entry on the subject, but I am considering a slightly longer article on what to do when experts disagree. Mostly I'm going to remark upon Mike Ault's latest blog entry, which Mark Rittman has pointed out.

Before I do that, I suspect its worth laying out a little of the background.

Back in February Don Burleson posted a 'challenge' asking for a solution (a script or scripts) that would predict when tables would benefit from reorganisation. Tom Kyte was asked about this on http://asktom.oracle.com . His reply was essentially, the factors listed aren't deterministic, in fact of the 4 listed, the first was true for some sorts of activity and false for others, the second was largely irrelevant, the third could be better fixed by a design change than the sticking plaster of periodic re-orgs and the fourth displayed an incomplete understanding of the technical workings of Oracle (freelists vs freelist groups). Tom being Tom demonstrated all this with code we can run ourselves. All things being considered not a contest that it was worth winning. It is of course possibly impertinent to point out that the reward for this challenge was $250 worth of books, and the right for Don to publish the work in an upcoming, for sale, book.

Faced with this reply, Don then posted a discussion topic apparently discussing whether sqlplus scripts could be used to prove anything. Admittedly this soon became whether sqlplus scripts could prove anything (looks like the winner of the challenge will be a full-blown multi-user benchmark). Now it may or may not be the case that this was really the aim of the discussion. Don wrote the following.

I'm NOT going to bash another Oracle author, it's unprofessional. . . . I was interviewing a DBA yesterday (30 years DBA experience, Certified Oracle Master, ex-Oracle University instrucor), and I asked "What do you think about the work of Tom Kyte"?He replied that he suspected that Kyte had very little real-world experience diagnosing and tuning large production databases. (I've wondered about that myself, since I cannot find his experience and qualifications anywhere online.)

Quite frankly the claim to be interviewing someone looks like a lie (Note the first relational database was not available until 1977.), it looks like good old smear tactics.

So my summary of the background is;

The owner of Burleson Consulting published a challenge asking for scripts to prove something.
Tom Kyte, when asked about it suggested that the challenge was rather seriously flawed and provided evidence to show why.
The owner of Burleson Consulting then instituted a discussion, on his own boards but also taken up elsewhere, aimed at discrediting evidence based argument in general and Tom Kyte in particular.

Don't take my word for it though, read the discussions and form your own conclusions.

Its against this background that I come to Mike's blog. Mike is of course also a Burleson Consulting employee. Mike is also taking issue with the idea that scripts can be used to show things that are useful from a tuning perspective. Firstly he describes a database health check. They come in check the database wide settings and stats, security and backup and recovery procedures. This can be very useful, especially to shops with limited dba resource.

Next however he goes on to describe a tuning assignment. Quotation in italics - comments in normal.

We never go onsite and start issuing recommendations without analysis. The last time I saw this technique used was from Oracle support three days ago. Without looking at detailed statistics or analysis they suggested increasing the shared pool (even though all of it wasn't being used), increasing the database cache size (even though 30% of the buffers were free in V$BH), turn off MTS (we had already tried this and it made no difference) and several other Oracle "Silver Bullets", none of which made the slightest difference in the problem.

So the description describes a problem, support had come in, tried various things with no supporting evidence - the classic silver bullet approach. It didn't work.

It turned out the problem was that 10g has a default value of TRUE for the parameter _B_TREE_BITMAP_PLANS which, even though the application had been running for months, suddenly started making a difference. Note there were no bitmaps in the database other than those that Oracle had in their schemas. They offered no proofs, no explanations, just a sigh of relief that this Oracle Silver Bullet fixed the problem. This is a classic example of why "simple", "repeatable" proofs can be hazardous. By all of the indications there was nothing wrong at the database level. No waits or events showed the problem, even plans for the SQL didn't change. Any SQL activity resulted in CPU spiking of up to 70-80%. However, the setting of this undocumented parameter had a profound effect on the optimizer and overall database performance. If an Oracle analyst hadn't "seen this before" and remembered this "Silver Bullet" the client would still be having problems. The troubled queries dropped form taking several minutes to 6 seconds

Well it looks like the silver-bullet troubleshooting process was continued try switching off _b_tree_bitmap_plans see what happens. It also rather looks as if Mike hasn't come across _b_tree_bitmap_plans which allows for bitmap access paths even without bitmap indexes (and the parameter got this default in 9i). The comment about no bitmaps makes no sense otherwise.

There are two interesting observation here, and that is that the elapsed response time for specific queries dropped from several minutes to 6 seconds, and that execution plans didn't change.

I find this interesting for two reasons, first there is a correct change of focus from database wide to problem queries (and measurement and evidence), and second that there is a clear implication - parsing time was taking several minutes per query. (plans and waits were the same - must be the variable cpu which is in the parse phase). The second feels like a bug, (or perhaps a misdiagnosing of the actual plans in use).


.When going from 7 to 8 to 8i to 9i to 10g Oracle has been notorious for resetting their undocumented parameters from true to false, false to true, adding new undocumented and documented parameters and of course varying the various number based parameters to get the right "mix". You have to understand this in order to trouble shoot Oracle performance problems. Even within the same release some of these change based on platform.This is the point I am trying to make about single user, small database proofs, they are great for documenting Oracle behavior on single user, small databases. They would have been completely useless in this case.

But having identified single statements that were taking an excessive length of time to parse, it would seem that single-user sqlplus scripts would be great

alter session set events '10053 trace name context forever, level 1'; -- to get the parsing activity
alter session set events '10046 trace name context forever, level 812'; -- to get timing

turn off the events.

a pound to a penny the impact of excessive consideration (and then rejection) of b-tree-bitmap plans would have fallen out. The alternative scenario is this, suppose you'd got a different analyst who hadn't seen this before. What would have happened when he ran out of all the things he had seen before. What if you hit a new problem. What then?


3 Comments
3 Comments:
Niall,
Excellent post. I enjoyed reading it, and had one question. I have also read Mike's blog entry, and something stood out for me and made me wonder. He stated that these queries had been running fine for months, and then suddenly started exhibiting performance problems. The final solution was changing the setting of this undocumented parm. So, if this parm had already been set wrong for months with no discernible effect , isn't that indicative of something else affecting these queries as well?

Your thoughts as always would be appreciated.

Regards,

Bill S.
 
The _B_TREE_BITMAP_PLANS parameter has caused me problems before and those problems have exhibited the same symptoms as Mike described on his blog. Assuming that what I've seen and what Mike described in fact are the same, then let me help clarify the scenario a little.

As Niall pointed out in his post, the CBO can choose to use bitmap access paths without the existence of bitmap indexes and in order to do so, it uses BITMAP CONVERSION FROM ROWIDS and BITMAP CONVERSION TO ROWIDS operations. Those operations are CPU intensive.

If the query for which those operations are performed selects a small number of rows, then there isn't much of an impact. However, if those queries select a large number of rows, the cost involved gets rediculous. The cost is not incurred during the parse, it's all incurred during the fetching.

Assuming Mike was seeing what I've experienced, then he was correct in that "No waits or events showed the problem" if you assume that means that a level 8/12 even 20046 trace shows no significant waits in the wait summary.

However, I can't agree with Mike's statement that "even plans for the SQL didn't change". When I've seen problems from _B_TREE_BITMAP_PLANS = TRUE, and have seen a performance gain from turning off that parameter, it was precisely because the execution plan did change - the BITMAP CONVERSION operations were no longer performed and his CPU time decreased accordingly.

To help explain the above, here's an example of a case I saw from back in June, 2003 on a 9i database. (I don't have the actual trace files anymore, but I did glean this information from an email of mine that I sent out then.)

Note that the query, table, column and index names have been mangled for privacy reasons, but I hope that it remains illustrative. I also hope that the formatting after I post is readable.

===================================
a) The initial query with _B_TREE_BITMAP_PLANS = TRUE.
===================================

SELECT a.col1, a.col2, a.col3
FROM TABLE1 A
WHERE exists(
select *
from TABLE2 c
where a.col3=c.col3 AND a.col2=c.col2
and c.colLowCard in('some value') )

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.01 0 84 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 76 5128.984154483524.82 99 4410159 0 7549
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 78 5128.984154483524.84 99 4410243 0 7549

Misses in library cache during parse: 1
Optimizer goal: CHOOSE
Parsing user id: 26 (MY_USER)

Rows Row Source Operation
------- ---------------------------------------------------
7549 NESTED LOOPS SEMI (cr=4410159 r=99 w=0 time=4608600925 us)
11393 TABLE ACCESS FULL TABLE1 (cr=159 r=81 w=0 time=303624 us)
7549 TABLE ACCESS BY INDEX ROWID TABLE2 (cr=4410000 r=18 w=0 time=8902727618 us)
7976 BITMAP CONVERSION TO ROWIDS (cr=4402145 r=18 w=0 time=8902437553 us)
7549 BITMAP AND (cr=4402145 r=18 w=0 time=8902368613 us)
11393 BITMAP CONVERSION FROM ROWIDS (cr=22891 r=18 w=0 time=816318 us)
11936 INDEX RANGE SCAN IDX_TABLE2_COL3 (cr=22891 r=18 w=0 time=678697 us)(object id 169962)
11409 BITMAP CONVERSION FROM ROWIDS (cr=4379254 r=0 w=0 time=8901311475 us)
1562094532 INDEX RANGE SCAN IDX_TABLE2_COLLOWCARD (cr=4379254 r=0 w=0 time=3033955058 us)(object id 169961)

Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
SQL*Net message to client 76 0.00 0.00
db file sequential read 19 0.01 0.13
db file scattered read 11 0.01 0.14
SQL*Net more data to client 74 0.00 0.00
SQL*Net message from client 76 0.47 2.95

===================================
b) The same query with _B_TREE_BITMAP_PLANS = FALSE.
===================================

SELECT a.col1, a.col2, a.col3
FROM TABLE1 A
WHERE exists(
select *
from TABLE2 c
where a.col3=c.col3 AND a.col2=c.col2
and c.colLowCard in('some value') )

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.04 0.35 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 505 1.98 2.05 35 3817 0 7549
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 507 2.03 2.40 35 3817 0 7549

Misses in library cache during parse: 1
Optimizer goal: CHOOSE
Parsing user id: 26 (MY_USER)

Rows Row Source Operation
------- ---------------------------------------------------
7549 HASH JOIN SEMI (cr=3817 r=35 w=35 time=2001873 us)
11393 TABLE ACCESS FULL TABLE1 (cr=84 r=0 w=0 time=29272 us)
163053 TABLE ACCESS BY INDEX ROWID TABLE2 (cr=3733 r=0 w=0 time=1177671 us)
166859 INDEX RANGE SCAN IDX_TABLE2_COLLOWCARD (cr=969 r=0 w=0 time=366546 us)(object id 169961)

(Sorry but my email didn't include the waits for this second query.)
 
Lets get the facts out straight away...

I'm a fan of the Tom Kyte approach and regularly use his website
I use Tom's approach daily and advocate it to the many Oracle consultants, engineers and architects I come into contact with each day.
I've bought one book by Mr Burleson and from that book and other things I've read on his websites and the internet I don't much like Mr Burleson or his approach.
I know who I'd trust to help me design, build and implement a system - and I'm sure you do too!


If the article were written by Tom he would have posted the plans before and after to show that they had changed - that's the whole point of Tom's approach - that he makes an assertion and then proves it in front of your eyes so you can then take it away and test it yourself in your environment - and in some cases, as Tom would I'm sure readily admit, find flaws or extensions to the solution. I agree with the other comments that I find it difficult to believe the plans hadn't changed...but obviously without the supporting information we can't draw any firm conclusions.

This argument that has been raging between Research DBA's and Reactive DBA's doesn't really make sense to me. The Research DBAs are trying to ensure that the design and approach is correct by testing their hypotheses and illustrating the evidence - the scientific approach, e.g. is a bitmap index better than a b tree index in a specific scenario that will appear in production. The Reactive DBA's are just going to an ailing system and running a series of scripts to gather real world statisical evidence, analysing the evidence and then taking appropriate steps to resolve the problem(s), e.g. realising that the use of a bitmap index on an OLTP system was causing contention and choosing instead to convert the index to a B Tree.

I must admint, what I do find offensive is Mr Burleson's xenophobic attitude with his comment about not using the term Engineer as it means something specific in the USA - Hello, the world does not revolve around the USA! His comment in the Oracle 911 article "We outsourced our DBA staff to India. They can’t figure it out and suggested that we call you." could also be construed as offensive in suggesting that Indian DBA resources are substandard.

Cheers
Jeff Moss
 
Post a Comment