Friday, April 29, 2005

Niall's corollary to my first law of disaster recovery from yesterday You never notice the disaster until 5 minutes after you can easily recover it

If you comment publicly on someone else's unfortunate error, you will make one yourself within 24 hours

So I decided that I was going to rejig the file layout on my laptop, moving Oracle to a different dedicated volume, as its a laptop I just decided to trash the existing database, remove the software and re-install (this works really well on 10g by the way). Naturally though I remembered to export the simple profiler schema and move some custom .java files i'd mis-stored in ORACLE_HOME, the simple matter of keeping my tnsnames.ora escaped me. Only 25 more entries to recreate...

If this was the first time I'd done this it might be more forgivable.


Wednesday, April 27, 2005

Two semi-related things to talk about this morning.

The first is a new policy for Burleson Consulting. [blockquote]We are under no obligation whatsoever to provide reproducible evidence or test-cases to beginners and it is our readers trust in our judgment and real-world experience that counts. Resist the temptation to fall-in to the “prove it” trap.[/blockquote]

I really don't know what to say other that this is a hugely different attitude than mine, now this is partly because I don't trust my own judgement enough to do stuff without proving it first - I used to and got burned. And partly because real-world experience teaches you that things change making previously good judgements poor and previous experience invalid.

I guess my summary would be, if you want to rely on someone who states on the web that they have lots of experience and excellent judgement then you know where to go. If you want to be sceptical about things, evaluate peoples credibility for yourself and so on - there are plenty of other places to go (its one of the things that my recommended sites have in common but there are plenty of others).

The second thing can be illustrated in Tom Kyte's blog. It isn't so much the mishap itself, but the fact that however much you have new and clever means of getting out of a hole - flashback query - a standby that is n hours old or whatever - you won't notice the problem until just after you've replicated it to all your standby sites and lost the quick recovery functionality


Monday, April 25, 2005

I almost always find it interesting when real people write about real projects with which they are involved. It is one of the things that makes Tales of the Oak Table(ignore the first barking mad review) a must buy book (Dave Ensor's potted history of Oracle corp is the other). It can also be enlightening to discover how often accidents of fate, huge problems and smart people combine to add real value to commercial products. For example both Joan Laiza's introduction of Extended SQL trace in Oracle7 and the DTrace functionality in Solaris10 apppear to have been borne out of trying to win benchmarks, and not having a clue why the new shiny product was underperforming so badly. You really need the people involved to speak about this - corporate marketing won't generally admit to this sort of thing - for reasons that reflect more on marketing itself it seems to me.

Anyway, most people familiar with the Oracle scene will be familiar with the work of Tom Kyte - in particular with his asktom column in Oracle magazine, which is now of course also a very popular website. Tom has just recently started a new blog, and todays article, gives the history behind asktom. Well worth a read.


Tuesday, April 19, 2005


Whats the difference between a data warehouse and an OLTP system? Well there are a bunch of differences, but David Aldridge has an interesting take on tuning in his new blog He describes a tuning exercise that resulted in execution times that were massively improved. The only problem, the end-users didn't care.

I think David has nailed part of the problem, the attention-span thing, but I think that part of it comes down to the DSS vs OLTP bias that he has (mine is the exact opposite). Most reports, or batch users, expect their process to take some time, sometimes really a lot of time. This isn't a problem because you often don't interact with a report, you analyse it, you think about it, you have meetings about it and then you implement something (or perhaps don't implement something you were going to do). Now the business might well like this process to take a shorter period of time, but it probably isn't the running of the report that takes the time, its all the clever application of human intelligence to business data (which sadly we don't have instrumented yet). In an OLTP environment though you want your data entry people to enter as much data in as short a time as possible,(especially of course if you can use the web to persuade your customers to be data entry people for you) - that way you get a large number of transactions at a low cost. Here the time a technical process takes really is a key component and in this environment small absolute changes can have a really big impact.

btw when did we stop doing DSS and start doing Data Warehouses, I much prefer the first name as it suggests the human intelligence bit more than the stockpiling vast quantities of data at high cost for no discernible benefit that the second name suggests.

There are a number of articles scattered around the internet, for example here, here and here that suggest that DBAs should routinely set two optimizer parameters optimizer_index_caching and optimizer_index_cost_adj to values other than their default. There are rather fewer articles that describe what these parameters do, or note that setting them has an unpredictable (in practice rather than theory) effect on the behaviour of your applications. Now it so happens that i got a chance recently to do some investigation into a performance problem that we were having with one of our applications. This part of the application acts as a document workflow - routing documents for approval, authorisation and so on.

The tables involved looked like this. I have renamed them to illustrate function better (and to hide the product). The trans table holds a register of transactions, the docs table is a table that holds documents about the transactions, the tasks table holds a list of current and completed workflow tasks, and the workflow table itself maps the current state of the document within the workflow. The script task_demo that I have uploaded allows you to create this schema extract with representative sample data in it yourself.

Name Null? Type
----------------------------------------- -------- --------------


Name Null? Type
----------------------------------------- -------- --------------


Name Null? Type
----------------------------------------- -------- ---------------


Name Null? Type
----------------------------------------- -------- -----------


The problem that we were having was that when users of the system who took part in the workflow logged on the the web interface for this system were experiencing a significant delay before the page rendered. Tracking this down in third party code in a connection pooled environment was a challenge on its own, but after a while we determined that the delay was attributable to this statement. sql2.sql holds this code

docs d,
tasks tk,
workflow wf,
trans t
TK.TK_status IN(2,16)
AND TK.TK_activity = 'APPROVE'
AND TK.tk_wfid = wf.wf_id
AND wf.wf_status IN(2,4)
AND wf.wf_docid = d.docid
AND TO_CHAR(t.transno) = d.doc_transno
AND t.company = 'C1'
AND t.line# = d.doc_line#
AND t.treatment_code = '7'
AND d.doc_extref = 'C1;7;'||d.doc_transno||';'||TO_CHAR(d.doc_line#)

Unfortunately on our test system this statement ran fine - even against the same data. After some investigation we determined that in fact the optimizer_index parameters were different, they were still set to their default values on the test system, but to realistic values on the live system. I ran the silver_bullet script with various values of these parameters and obtained the following results.Results As you can see in the test I have only the o_i_c_a parameter was important, setting this very low flipped the execution plan to a significantly worse one. In the real life situation there was in fact an interaction between the two parameters, but the same general point held true, setting the parameters to commonly accepted values can cause otherwise perfectly performing sql to behave poorly.

Incidentally you might be wondering why the script is called sql2, well there is also a sql1 which illustrates the plan switching towards nested loops from a hash join (this is the classic example found in this presentation by Jonathan Lewis for example. The problem was that although the plans did switch, the difference between the runtime didn't. I did some further research and can find no timed examples showing an improvement from setting these parameters, just examples showing that the same statement can change execution plans, the fact that two different plans can be used is not evidence of which setting is better.


Tuesday, April 12, 2005

There have been a number of concurrent threads on a large number of different forums scattered around the net recently, all concerning the merits of taking an evidence based approach versus that of taking a holistic experienced practitioner's view to Oracle performance problems.

I'm not going to comment too much in this entry on the subject, but I am considering a slightly longer article on what to do when experts disagree. Mostly I'm going to remark upon Mike Ault's latest blog entry, which Mark Rittman has pointed out.

Before I do that, I suspect its worth laying out a little of the background.

Back in February Don Burleson posted a 'challenge' asking for a solution (a script or scripts) that would predict when tables would benefit from reorganisation. Tom Kyte was asked about this on http://asktom.oracle.com . His reply was essentially, the factors listed aren't deterministic, in fact of the 4 listed, the first was true for some sorts of activity and false for others, the second was largely irrelevant, the third could be better fixed by a design change than the sticking plaster of periodic re-orgs and the fourth displayed an incomplete understanding of the technical workings of Oracle (freelists vs freelist groups). Tom being Tom demonstrated all this with code we can run ourselves. All things being considered not a contest that it was worth winning. It is of course possibly impertinent to point out that the reward for this challenge was $250 worth of books, and the right for Don to publish the work in an upcoming, for sale, book.

Faced with this reply, Don then posted a discussion topic apparently discussing whether sqlplus scripts could be used to prove anything. Admittedly this soon became whether sqlplus scripts could prove anything (looks like the winner of the challenge will be a full-blown multi-user benchmark). Now it may or may not be the case that this was really the aim of the discussion. Don wrote the following.

I'm NOT going to bash another Oracle author, it's unprofessional. . . . I was interviewing a DBA yesterday (30 years DBA experience, Certified Oracle Master, ex-Oracle University instrucor), and I asked "What do you think about the work of Tom Kyte"?He replied that he suspected that Kyte had very little real-world experience diagnosing and tuning large production databases. (I've wondered about that myself, since I cannot find his experience and qualifications anywhere online.)

Quite frankly the claim to be interviewing someone looks like a lie (Note the first relational database was not available until 1977.), it looks like good old smear tactics.

So my summary of the background is;

The owner of Burleson Consulting published a challenge asking for scripts to prove something.
Tom Kyte, when asked about it suggested that the challenge was rather seriously flawed and provided evidence to show why.
The owner of Burleson Consulting then instituted a discussion, on his own boards but also taken up elsewhere, aimed at discrediting evidence based argument in general and Tom Kyte in particular.

Don't take my word for it though, read the discussions and form your own conclusions.

Its against this background that I come to Mike's blog. Mike is of course also a Burleson Consulting employee. Mike is also taking issue with the idea that scripts can be used to show things that are useful from a tuning perspective. Firstly he describes a database health check. They come in check the database wide settings and stats, security and backup and recovery procedures. This can be very useful, especially to shops with limited dba resource.

Next however he goes on to describe a tuning assignment. Quotation in italics - comments in normal.

We never go onsite and start issuing recommendations without analysis. The last time I saw this technique used was from Oracle support three days ago. Without looking at detailed statistics or analysis they suggested increasing the shared pool (even though all of it wasn't being used), increasing the database cache size (even though 30% of the buffers were free in V$BH), turn off MTS (we had already tried this and it made no difference) and several other Oracle "Silver Bullets", none of which made the slightest difference in the problem.

So the description describes a problem, support had come in, tried various things with no supporting evidence - the classic silver bullet approach. It didn't work.

It turned out the problem was that 10g has a default value of TRUE for the parameter _B_TREE_BITMAP_PLANS which, even though the application had been running for months, suddenly started making a difference. Note there were no bitmaps in the database other than those that Oracle had in their schemas. They offered no proofs, no explanations, just a sigh of relief that this Oracle Silver Bullet fixed the problem. This is a classic example of why "simple", "repeatable" proofs can be hazardous. By all of the indications there was nothing wrong at the database level. No waits or events showed the problem, even plans for the SQL didn't change. Any SQL activity resulted in CPU spiking of up to 70-80%. However, the setting of this undocumented parameter had a profound effect on the optimizer and overall database performance. If an Oracle analyst hadn't "seen this before" and remembered this "Silver Bullet" the client would still be having problems. The troubled queries dropped form taking several minutes to 6 seconds

Well it looks like the silver-bullet troubleshooting process was continued try switching off _b_tree_bitmap_plans see what happens. It also rather looks as if Mike hasn't come across _b_tree_bitmap_plans which allows for bitmap access paths even without bitmap indexes (and the parameter got this default in 9i). The comment about no bitmaps makes no sense otherwise.

There are two interesting observation here, and that is that the elapsed response time for specific queries dropped from several minutes to 6 seconds, and that execution plans didn't change.

I find this interesting for two reasons, first there is a correct change of focus from database wide to problem queries (and measurement and evidence), and second that there is a clear implication - parsing time was taking several minutes per query. (plans and waits were the same - must be the variable cpu which is in the parse phase). The second feels like a bug, (or perhaps a misdiagnosing of the actual plans in use).

.When going from 7 to 8 to 8i to 9i to 10g Oracle has been notorious for resetting their undocumented parameters from true to false, false to true, adding new undocumented and documented parameters and of course varying the various number based parameters to get the right "mix". You have to understand this in order to trouble shoot Oracle performance problems. Even within the same release some of these change based on platform.This is the point I am trying to make about single user, small database proofs, they are great for documenting Oracle behavior on single user, small databases. They would have been completely useless in this case.

But having identified single statements that were taking an excessive length of time to parse, it would seem that single-user sqlplus scripts would be great

alter session set events '10053 trace name context forever, level 1'; -- to get the parsing activity
alter session set events '10046 trace name context forever, level 812'; -- to get timing

turn off the events.

a pound to a penny the impact of excessive consideration (and then rejection) of b-tree-bitmap plans would have fallen out. The alternative scenario is this, suppose you'd got a different analyst who hadn't seen this before. What would have happened when he ran out of all the things he had seen before. What if you hit a new problem. What then?

Very short one.

When I logged on to blogger to post the next post, there was a short note there is a problem with the problem page that is being fixed.

Don't you hate it when error handlers cause errors.



Wednesday, April 06, 2005

A recent discussion on the merits of investigating Oracle behaviour with single-user scripts, which I won't link to since it rather degenerated, rather than relying on the accumulated experience of experts led me to upload a script that I use for running an arbitrary number of concurrent sqlplus scripts, this can be an effective way of simulating load on a server.

The script needs to be run from a commandline - and will echo usage back to you if you get it wrong and is available from the link at the right OraLoader or here.

For those of you that don't like downloading and running vbs scripts unseen the code is shown below.

'Name OraLoader.vbs
'Purpose: Run multiple sqlplus sessions concurrently
'Description: Run n instances of a named script simultaneously.

dim oShell
dim sqlScript
dim cmdLine
dim strUsage
dim iInstances
dim dbUser,dbPwd

strUsage = "Usage: cscript n scriptname username password"
strUsage = strUsage & vbCRLF & vbTab & "n = No of instances "
strUsage = strUsage &amp; vbCRLF & vbTab & "scriptname = path to sql script"
strUsage = strUsage & vbCRLF & vbTab & "username = database username"
strUsage = strUsage &amp; vbCRLF & vbTab & "password = database password"

if (Wscript.Arguments.count <> 4) then
Wscript.stdout.write strUsage
end if

' assign variables
iInstances = Wscript.Arguments(0)
sqlScript = Wscript.Arguments(1)
dbUser = Wscript.Arguments(2)
dbPwd = Wscript.Arguments(3)

cmdLine = "sqlplus " & dbUser &amp; "/" & dbPwd & " @" & sqlScript
set oShell = CreateObject("wscript.shell")
wscript.stdout.write cmdline
runCommand iInstances,cmdLine
set oShell= nothing

sub runCommand(iload,strCommand)

for i = 1 to iload

end sub

My advice is to be very sure that the sql scripts that you call exit cleanly at the end, otherwise you can be left with a bunch of unnecessary sqlplus sessions running.

I have a number of similar scripts for loading remote servers, but the example above should give you the idea.


Tuesday, April 05, 2005

Raymond Allo points out that there is a typo in my instructions for relinking Oracle for asynchronous io on Linux. This should read make -f ins_rdbms.mk async_on rather than asynch_on.


Friday, April 01, 2005

A brief note on the layout here, I am aware that the site displays oddly in Internet Explorer - it does display correctly in Firefox. I hope to be able to resolve this shortly.