<$BlogRSDUrl$>

Friday, September 24, 2004

A throwaway comment on Oracle-L to the effect that the CBO will make better choices if you use an artificial value to indicate an unknown value (in this case a current date) led me to knock together this demonstration that this is in general probably not true.The test goes as followscreate a table with a pk and 4 columns - 2 of type date and one each of numeric and character.Then populate these columns with 20000 values that represent date data for (approx) the last 5 years. In the case of the character and numeric columns we use 'dates' of the format YYYYMMDD so that they sort properly.

then I have updated the last 30 days worth of history for records that have not yet had a date assigned (think payment date for invoices).the first column is updated to 31st December 9999 (obviously artificial), the second to NULL the last two to ' ' and 0 respectively. I also attempted to simulate data error/disputed payment or whatever by incorrectly updating 20% of the remaining data to the same value. Finally like a good dba I collect stats.I then explain two queries against the data. The first just tries to select those records with 'unassigned' dates - current records. This should return 1279 records out of 20000. The second query covers (and admittedly somewhat artificial) range and should return 104 records. This gives me the following results for the expected cardinalities of the queries

Query Actual Cardinality Artificial Date Date Artificial Char Artificial Number
Current 1279 11 1179 11 11
Access Path IDX FTS IDX IDX
Range Scan 104 21 124 11 20
IDX IDX IDX IDX
Current - Histograms 1279 1404 1331 1248 1248
FTS FTS FTS FTS
Range Scan - Histograms 104 103 42 195 195
IDX IDX IDX IDX

There are also two things to consider. The first is how does the query under test perform, but the second is how would the figures computed above affect the optimizer if the results from this query were to feed another operation in a more complex query plan. There are a number of interesting things that one can draw from this, but I'll limit myself to the following suggestions.

  • In the absence of histograms only using the correct combination of datatype and null gives the correct access path for query one or anything like the correct cardinality. Using the other 'strategies' would likely result in inappropriate hash (or less likely sort-merge) joins due to the artifically low cardinalities.
  • Histograms make life a lot easier (with the curious exception of range scans when the correct datatype and null value was used). but of course maintaining histograms is not a free operation.
  • the full results from my 10g system are here. If you were wondering about the last 2 columns, yes I have seen - indeed use - a third party system that uses characters and numbers to represent dates.


    1 Comments
    1 Comments:
    For a demo of why the "store a date as a string" approach affects performance see Effective Oracle by Design, Ch. 7, p375. As always Tom Kyte makes the case for using the correct datatype very clear.

    cheers.
     
    Post a Comment