Saturday, January 31, 2009

Edition-based redifinition (a speculation)

As you can see from this link, there is a feature called Edition-based redefinition mentioned in the 11GR1 documentation. Unfortunately, all it says is that this feature is unavailable as of now so there is really not much you can tell about it.

A speculation

I'm usually not a great fun of doing any sort of a guesswork, however, in this case it might be an interesting exercise to observe some bits and pieces presented in the current release of 11GR1 in order to see whether it can give us some clues as to what expect from this new feature as well as what the potential underpinning might look like. As well as prepare yourself for some implications...

SYS.OBJ$

Perhaps this will be the first thing what you'll notice as soon as 11G's data dictionary is concerned. For example, this is how dba_synonyms view is defined in 10GR2:
create or replace view dba_synonyms
(owner, synonym_name, table_owner, table_name, db_link)
as
select u.name, o.name, s.owner, s.name, s.node
from sys.user$ u, sys.syn$ s, sys.obj$ o
where o.obj# = s.obj#
and o.type# = 5
and o.owner# = u.user#;
Now, take a look at the same view's definition in 11GR1:
create or replace view dba_synonyms
(owner, synonym_name, table_owner, table_name, db_link)
as
select u.name, o.name, s.owner, s.name, s.node
from sys.user$ u, sys.syn$ s, sys."_CURRENT_EDITION_OBJ" o
where o.obj# = s.obj#
and o.type# = 5
and o.owner# = u.user#;
For you see, the reference to SYS.OBJ$ was replaced with SYS."_CURRENT_EDITION_OBJ". This replacement occurs all around the place in the data dictionary which alone makes it interesting enough to attract a bit of attention.

SYS._CURRENT_EDITION_OBJ

What is _CURRENT_EDITION_OBJ? It's a view:
SQL> select object_type
2 from dba_objects
3 where object_name='_CURRENT_EDITION_OBJ';

OBJECT_TYPE
-------------------
VIEW
Let's take a look at this view's definition (I've omitted fields list for the sake of clarity):
select ...
from obj$ o, user$ u
where o.owner# = u.user#
and ( /* non-versionable object */
( o.type# not in (4,5,7,8,9,10,11,12,13,14,22,87,88)
or bitand(u.spare1, 16) = 0)
/* versionable object visible in current edition */
or ( o.type# in (4,5,7,8,9,10,11,12,13,14,22,87)
and ( (u.type# <> 2 and
sys_context('userenv', 'current_edition_name') = 'ORA$BASE')
or (u.type# = 2 and
u.spare2 = sys_context('userenv', 'current_edition_id'))
or exists (select 1 from obj$ o2, user$ u2
where o2.type# = 88
and o2.dataobj# = o.obj#
and o2.owner# = u2.user#
and u2.type# = 2
and u2.spare2 =
sys_context('userenv', 'current_edition_id'))
)
)
);
First of all, it mentions about (A) versionable objects and (B) current edition which is referenced through session's context sys_context('userenv', 'current_edition_id').

Versionable objects

Only the objects of the following types seems to be versionable:
  • VIEW
  • SYNONYM
  • PROCEDURE
  • FUNCTION
  • PACKAGE
  • PACKAGE BODY
  • TRIGGER
  • TYPE
  • TYPE BODY
  • LIBRARY
  • ASSEMBLY
Object type# = 88 seems to be something special as I couldn't find what it is from sql.bsq (nor dba_objects) and it is being referenced using somewhat special way: if obj# matches dataobj# of some entry with type# = 88 and this entry is part if the current edition, the object is considered to be a part of the current edition as well. Note that the join itself permits this special object to exist in some other schema, though I don't know whether this has any practical meaning or not.

New userenv attributes

There are two (at least) new userenv context attributes:
SQL> select sys_context('userenv', 'current_edition_name')
2 current_edition_name from dual;

CURRENT_EDITION_NAME
--------------------------------------------------------------------------------
ORA$BASE
SQL> select sys_context('userenv', 'current_edition_id')
2 current_edition_id from dual;

CURRENT_EDITION_ID
--------------------------------------------------------------------------------
100
Changing these will change data returned by _CURRENT_EDITION_OBJ view as well and, given that that view is referenced instead of obj$, it looks like these contexts will be the primary mechanism responsible for referencing one version (edition) of the object or the other.

New user type

Again, by looking at the view definition, we may spot that versionable objects are being concerned as far as user$.type# = 2. To remind you, 0 is a role and 1 is a regular user, so there will be something new, perhaps this new feature could be enabled on a per-user level and/or versioned objects will be owned by some other "ghost" schema (my comment about object with type 88 seems to be allowing this, at least technically).

Some interesting clues might be get from _CURRENT_EDITION_OBJ's field definition itself:
when (u.type# = 2) then
(select eo.name from obj$ eo where eo.obj# = u.spare2)
else
'ORA$BASE'
end
The objects which are owned by this new user type will have editition_id as their object_id and will be named after edition name. That also means that object_id is no longer unique:
create unique index i_obj1 on obj$(obj#, owner#, type#) (11GR1)
create unique index i_obj1 on obj$(obj#) (10GR2)
(from sql.bsq)
I wander whether this have some chances to break legacy code which references sys.obj$ by obj# and expect one row to be returned at most...

Does all that, again, points out that object versions will be kept under this special user and will be linked back using data_object_id? Is this also a reason why objects which do have "proper" data_object_id are not versionable?

A speculation (again)

Pleases note that everything said in this post is based on some (random) observations of preliminary data available in the current release (11.1.0.7). There are chances that some (if not all) of this data may became altered or even completely invalid when this feature will be officially announced.

However, one thing seems to be clear -- in order to support this new feature, some serious alterations of the data dictionary are taking place and you should keep yourself alerted...

Create database or who wants some DMT?

Simpler than ever

Starting from Oracle 10G, creating the database can be as simple as this:
SQL> create database;

Database created.
It's not a surprise that this feature was somewhat advertised here and there. However, what was missing in these advertisements is this:
SQL> select name, decode(bitmapped, 0, 'DMT', 'LMT')
from ts$
order by name; 2 3

NAME DEC
------------------------------ ---
SYSAUX LMT
SYSTEM DMT
SYS_UNDOTS LMT
For you see, create database will make your SYSTEM tablespace to be dictionary managed by default. I don't really know if there are any reasons for this and since everything else will default to LMT, this should not be a big deal for most of you anyway. Just don't forget that SYSTEM hosts objects like AUD$ (audit log), FGA_LOG$ (fine-grained audit log) or NCOMP_DLL$ (natively compiled objects) which could grow to a fairly large number of extents.

I was a bit surprised watching this relic appear even when you do this in 11.1.0.7...

Friday, January 30, 2009

Moving a datafile

Sometimes you need to move a datafile into a different mount point or ASM diskgroup. This could make you wandering what technique you can use in order to minimize downtime. I'll show you one of my favorite methods which works well under certain circumstances.

Let's say you want to move the following datafile...
SQL> select file_id, file_name
2 from dba_data_files
3 where tablespace_name='USERS';

FILE_ID FILE_NAME
------- --------------------------------------------------------
4 /u01/oradata/ORA11GR1/datafile/o1_mf_users_4q759m64_.dbf
...into mount point /u02.

Backup as copy

The first thing we need to do is backup this datafile as copy using RMAN:
[oracle@ora11gr1a ~]$ rman

Recovery Manager: Release 11.1.0.7.0 - Production on Thu Jan 29 17:29:59 2009

Copyright (c) 1982, 2007, Oracle. All rights reserved.

RMAN> connect target;

connected to target database: ORA11GR1 (DBID=3707369966)

RMAN> backup as copy datafile 4
2> format '/u02/oradata/ORA11GR1/datafile/users01.dbf';

Starting backup at 29-JAN-09
using channel ORA_DISK_1
channel ORA_DISK_1: starting datafile copy
input datafile file number=00004 name=/u01/oradata/ORA11GR1/datafile/o1_mf_users
_4q759m64_.dbf
output file name=/u02/oradata/ORA11GR1/datafile/users01.dbf tag=TAG20090129T1825
32 RECID=14 STAMP=677442334
channel ORA_DISK_1: datafile copy complete, elapsed time: 00:00:03
Finished backup at 29-JAN-09
Rollforward image copy

Since switching to datafile copy will require datafile recover, it might be a good idea to rollforward this image copy first, in order to bring it up to date...
RMAN> list copy of datafile 4;

List of Datafile Copies
=======================

Key File S Completion Time Ckp SCN Ckp Time
------- ---- - --------------- ---------- ---------------
14 4 A 29-JAN-09 2033489 29-JAN-09
Name: /u02/oradata/ORA11GR1/datafile/users01.dbf
Tag: TAG20090129T182532

RMAN> backup incremental from scn 2033489 datafile 4 format '/u02/oradata/ORA11G
R1/datafile/%U';

Starting backup at 29-JAN-09

using channel ORA_DISK_1
backup will be obsolete on date 05-FEB-09
archived logs will not be kept or backed up
channel ORA_DISK_1: starting full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00004 name=/u01/oradata/ORA11GR1/datafile/o1_mf_users
_4q759m64_.dbf
channel ORA_DISK_1: starting piece 1 at 29-JAN-09
channel ORA_DISK_1: finished piece 1 at 29-JAN-09
piece handle=/u02/oradata/ORA11GR1/datafile/1nk61t1d_1_1 tag=TAG20090129T183004
comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:01

using channel ORA_DISK_1
backup will be obsolete on date 05-FEB-09
archived logs will not be kept or backed up
channel ORA_DISK_1: starting full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
including current control file in backup set
channel ORA_DISK_1: starting piece 1 at 29-JAN-09
channel ORA_DISK_1: finished piece 1 at 29-JAN-09
piece handle=/u02/oradata/ORA11GR1/datafile/1ok61t1e_1_1 tag=TAG20090129T183004
comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:01
Finished backup at 29-JAN-09

RMAN> recover copy of datafile 4;

Starting recover at 29-JAN-09
using channel ORA_DISK_1
channel ORA_DISK_1: starting incremental datafile backup set restore
channel ORA_DISK_1: specifying datafile copies to recover
recovering datafile copy file number=00004 name=/u02/oradata/ORA11GR1/datafile/u
sers01.dbf
channel ORA_DISK_1: reading from backup piece /u02/oradata/ORA11GR1/datafile/1nk
61t1d_1_1
channel ORA_DISK_1: piece handle=/u02/oradata/ORA11GR1/datafile/1nk61t1d_1_1 tag
=TAG20090129T183004
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: restore complete, elapsed time: 00:00:01
Finished recover at 29-JAN-09

RMAN> list copy of datafile 4;

List of Datafile Copies
=======================

Key File S Completion Time Ckp SCN Ckp Time
------- ---- - --------------- ---------- ---------------
15 4 A 29-JAN-09 2033633 29-JAN-09
Name: /u02/oradata/ORA11GR1/datafile/users01.dbf
Tag: TAG20090129T182532
Note that image copy's checkpoint SCN has moved forward. Keep in mind that this step generally makes sense only if you have block change tracking enabled and/or there is a huge amount of archivelogs to apply, as it will be a trade-off between creating and applying the incremental backup compared to directly applying all necessarily archivelogs. This step can be done starting from Oracle 10G.

Switch datafile

All we have to do now is execute small RMAN block...
RMAN> run
2> {
3> sql 'alter database datafile 4 offline';
4> switch datafile 4 to datafilecopy '/u02/oradata/ORA11GR1/datafile/users0
1.dbf';
5> recover datafile 4;
6> sql 'alter database datafile 4 online';
7> }

sql statement: alter database datafile 4 offline

datafile 4 switched to datafile copy
input datafile copy RECID=15 STAMP=677442659 file name=/u02/oradata/ORA11GR1/dat
afile/users01.dbf

Starting recover at 29-JAN-09
using channel ORA_DISK_1

starting media recovery
media recovery complete, elapsed time: 00:00:00

Finished recover at 29-JAN-09

sql statement: alter database datafile 4 online
This is where you'll have some downtime. The amount of downtime depends on how long it will take to recover the datafile which will generally be a function of how many archivelogs needs to be applied which, in turn, can be reduced by using incremental backup. The point is that this step can be really fast.

Don't forget to watch against nologging operations!

Sunday, January 25, 2009

Update and rownum oddity

Take a look at the following table:
SQL> create table codes
2 (
3 code varchar2(10),
4 used number,
5 constraint pk_codes primary key (used, code)
6 ) organization index;

Table created.

SSQL> insert into codes
2 select dbms_random.string('x', 10),
3 case when level <= 5000 then 1 else 0 end
4 from dual
5 connect by level <= 100000;

100000 rows created.

SQL> commit;

Commit complete.

SQL> exec dbms_stats.gather_table_stats(user, 'codes');

PL/SQL procedure successfully completed.
This table contains a set of codes with used column representing whether the code was already used (1) or not (0). We need to return one (random) unused code from the above table and mark this code as used. This is very easy to archive using the following update statement:
SQL> variable code varchar2(10);
SQL> set autot traceonly
SQL> update codes set used=1
2 where used=0 and rownum=1
3 returning code into :code;

1 row updated.


Execution Plan
----------------------------------------------------------
Plan hash value: 1169687698

-----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------
| 0 | UPDATE STATEMENT | | 1 | 13 | 117 (1)| 00:00:02 |
| 1 | UPDATE | CODES | | | | |
|* 2 | COUNT STOPKEY | | | | | |
|* 3 | INDEX FAST FULL SCAN| PK_CODES | 50000 | 634K| 117 (1)| 00:00:02 |
-----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter(ROWNUM=1)
3 - filter("USED"=0)


Statistics
----------------------------------------------------------
1 recursive calls
5 db block gets
12 consistent gets
0 physical reads
124 redo size
913 bytes sent via SQL*Net to client
874 bytes received via SQL*Net from client
3 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
However, the plan is not exactly what I would expect (I'm running this on 11.1.0.7)... Why do IFFS when we can do IRS to get only one row? This is exactly what regular select does, after all:
SQL> select *
2 from codes
3 where used=0 and rownum=1;


Execution Plan
----------------------------------------------------------
Plan hash value: 802332609

------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 13 | 2 (0)| 00:00:01 |
|* 1 | COUNT STOPKEY | | | | | |
|* 2 | INDEX RANGE SCAN| PK_CODES | 1 | 13 | 2 (0)| 00:00:01 |
------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter(ROWNUM=1)
2 - access("USED"=0)


Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
2 consistent gets
0 physical reads
0 redo size
590 bytes sent via SQL*Net to client
520 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

10053 trace

Let's take a look at 10053 trace output for both select and update statements. I'm picking up relevant sections.

UPDATE:
***************************************
BASE STATISTICAL INFORMATION
***********************
Table Stats::
Table: CODES Alias: CODES
#Rows: 100000 #Blks: 423 AvgRowLen: 13.00
Index Stats::
Index: PK_CODES Col#: 2 1
LVLS: 1 #LB: 423 #DK: 100000 LB/K: 1.00 DB/K: 1.00 CLUF: 0.00
Access path analysis for CODES
***************************************
SINGLE TABLE ACCESS PATH
Single Table Cardinality Estimation for CODES[CODES]
Table: CODES Alias: CODES
Card: Original: 100000.000000 Rounded: 50000 Computed: 50000.00 Non Adjusted: 50000.00
Access Path: index (index (FFS))
Index: PK_CODES
resc_io: 116.00 resc_cpu: 20012369
ix_sel: 0.000000 ix_sel_with_filters: 1.000000
Access Path: index (FFS)
Cost: 116.81 Resp: 116.81 Degree: 1
Cost_io: 116.00 Cost_cpu: 20012369
Resp_io: 116.00 Resp_cpu: 20012369
OPTIMIZER PERCENT INDEX CACHING = 0


Access Path: index (IndexOnly)
Index: PK_CODES
resc_io: 213.00 resc_cpu: 11516867
ix_sel: 0.500000 ix_sel_with_filters: 0.500000
Cost: 213.47 Resp: 213.47 Degree: 1
Best:: AccessPath: IndexFFS
Index: PK_CODES
Cost: 116.81 Degree: 1 Resp: 116.81 Card: 50000.00 Bytes: 0

***************************************
Nothing else is being tried and this is what optimizer selects as the best execution plan. Note how cardinalities are being reported (highlighted in red).

SELECT:

In addition to the above, has one more section:
***************************************
SINGLE TABLE ACCESS PATH (First K Rows)
Single Table Cardinality Estimation for CODES[CODES]
Table: CODES Alias: CODES
Card: Original: 2.000000 Rounded: 1 Computed: 1.00 Non Adjusted: 1.00
Access Path: index (index (FFS))
Index: PK_CODES
resc_io: 2.00 resc_cpu: 7461
ix_sel: 0.000000 ix_sel_with_filters: 1.000000
Access Path: index (FFS)
Cost: 2.00 Resp: 2.00 Degree: 1
Cost_io: 2.00 Cost_cpu: 7461
Resp_io: 2.00 Resp_cpu: 7461
OPTIMIZER PERCENT INDEX CACHING = 0


Access Path: index (IndexOnly)
Index: PK_CODES
resc_io: 2.00 resc_cpu: 14443
ix_sel: 0.500000 ix_sel_with_filters: 0.500000
Cost: 2.00 Resp: 2.00 Degree: 1
Best:: AccessPath: IndexRange
Index: PK_CODES
Cost: 2.00 Degree: 1 Resp: 2.00 Card: 1.00 Bytes: 13

First K Rows: unchanged join prefix len = 1
Join order[1]: CODES[CODES]#0
***********************
Note how cardinalities has changed (highlighted in green) this time. What happened is rownum = 1 predicate resulted in fist_rows(1) mode (highlighted in blue), affecting how cardinalities were calculated.

We know that first_rows(n) hint is being ignored in update and delete statements, thus our update statement always goes in all_rows mode.

You can confirm that select behaves exactly the same way when in all_rows mode:
SQL> select /*+ all_rows */ *
2 from codes
3 where used=0 and rownum=1;


Execution Plan
----------------------------------------------------------
Plan hash value: 2682988822

----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 13 | 117 (1)| 00:00:02 |
|* 1 | COUNT STOPKEY | | | | | |
|* 2 | INDEX FAST FULL SCAN| PK_CODES | 50000 | 634K| 117 (1)| 00:00:02 |
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter(ROWNUM=1)
2 - filter("USED"=0)


Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
12 consistent gets
0 physical reads
0 redo size
590 bytes sent via SQL*Net to client
520 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
Well, I guess here goes my next wish for CBO improvement regarding how update and delete statements are handled with predicates involving rownum...

Friday, January 23, 2009

11G Managed Recovery Process

MRP process is commonly referenced throughout the web as the process which performs redo apply to your managed standby database. MRP may can in a team with PQ slave (or prnn in 11G) processes in case you start managed recovery in parallel.

Unfortunately, the term performs redo apply seems to be causing some confusion along the way as well. I found it very common that people believes that it's MRP process which performs both reads from the redo streams as well as writes changes into datafiles. For example, sometimes they try to battle slow log apply by increasing managed recovery parallelism without realizing that there is in fact a bit more to the puzzle.

Some details

I'm going to use my 11G DataGuard setup to demonstrate a couple of key points. My setup is operating using real time apply (no parallel), which makes the entire example a bit simpler to demonstrate.

Let's update a row on the source DB:
SQL> update t set n=n;

1 row updated.

SQL> commit;

Commit complete.
Now, take a look at MRP strace output which was produced as a result of the above change:
[oracle@ora11gr1b fd]$ ps -fp 6364
UID PID PPID C STIME TTY TIME CMD
oracle 6364 1 0 19:59 ? 00:00:00 ora_mrp0_ora11gr1
[oracle@ora11gr1b fd]$ strace -e pread,pwrite -p 6364
Process 6364 attached - interrupt to quit
pread(32,..., 512, 45568) = 512
pread(32,..., 1024, 46080) = 1024
pread(30,..., 8192, 2228224) = 8192
pread(30,..., 8192, 259858432) = 8192
pread(31,..., 8192, 9461760) = 8192
I've set filesystemio_options=none so we can observe pread/pwrite syscalls which are easier to follow compared to asynch io_submit/io_getvents system calls (and we don't care about O_DIRECT flag either).

Let's check what are these file descriptors:
[oracle@ora11gr1b fd]$ cd /proc/6364/fd
[oracle@ora11gr1b fd]$ file 30
30: symbolic link to `/u01/oradata/ORA11GR1B/datafile/o1_mf_undotbs1_0fk5fp2c_.dbf'
[oracle@ora11gr1b fd]$ file 31
31: symbolic link to `/u01/oradata/ORA11GR1B/datafile/o1_mf_users_0ik5fp4u_.dbf'
[oracle@ora11gr1b fd]$ file 32
32: symbolic link to `/u01/oradata/ORA11GR1B/onlinelog/o1_mf_9_4qn2rkhk_.log'
In other words, the process read from standby logfile, undo and users (this is where our table is) tablespaces. However, as you might notice, all these calls were reads, we didn't write anything.

From time to time MRP gets a bit more interesting, for example during logfile switches:
...
pwrite(27,..., 16384, 16384) = 16384
pread(27,..., 16384, 16384) = 16384
pread(28,..., 8192, 8192) = 8192
pread(29,..., 8192, 8192) = 8192
pread(30,..., 8192, 8192) = 8192
pread(31,..., 8192, 8192) = 8192
pread(27,..., 16384, 393216) = 16384
pwrite(28,..., 8192, 8192) = 8192
pwrite(29,..., 8192, 8192) = 8192
pwrite(30,..., 8192, 8192) = 8192
pwrite(31,..., 8192, 8192) = 8192
...
Here we actually wrote (27 is a controlfile, 28 and 29 are system and sysaux tablespaces respectively) something. However, from the offset (fourth parameter) you can realize that we are writing to the second block in these datafiles. There is no (can't be) any user data there.

Who is writing the data then?

The first thing you might want to check is, of course, the database writer process:
[oracle@ora11gr1b ~]$ ps -fp 6303
UID PID PPID C STIME TTY TIME CMD
oracle 6303 1 0 19:55 ? 00:00:00 ora_dbw0_ora11gr1
[oracle@ora11gr1b ~]$ strace -e pread,pwrite -p 6303
Process 6303 attached - interrupt to quit
pwrite(23,..., 8192, 2097152) = 8192
pwrite(23,..., 8192, 35987456) = 8192
pwrite(24,..., 8192, 9461760) = 8192
This is the output produced by standby's dbwr right after we updated our table on the source. We wrote two undo blocks (23) and one block in users tablespace (24). By looking at the offset for file descriptor 24 we can confirm that we wrote the table itself:
SQL> select segment_name
2 from dba_extents
3 where tablespace_name='USERS'
4 and 9461760/8192 between block_id and block_id + blocks-1;

SEGMENT_NAME
--------------------------
T
From the above you can confirm that it is DBWR process which wrote the changes for us and it plays crucial role during your standby database operations.

MRP's workload consists mostly from reading the redo stream, datafiles, controlfiles and occasional writes into the controlfile and datafiles header.

If your standby is suffering from the redo apply performance, you may want to pay attention to both MRP and DBWR processes.

Tuesday, January 20, 2009

Latch waits in 11G: queuing

Following up on my previous post about 11G latch waits using semaphore post-wait with a latch holder...

What I was curious to confirm is whether latch holder posts only the first waiter in the list or all of them?

The results

Here are the results I've got using three sessions:
  1. Session 1: gets RC latch
  2. Session 2: misses the latch, begins to sleep
    semop(753664, 0x7fbfffa348, 1)
  3. Session 3: misses the latch, begins to sleep
    semop(753664, 0x7fbfff5f58, 1)
  4. Session 1: completes, posts session 2
    semctl(753664, 28, SETVAL, 0x1)
  5. Session 2: completes, posts session 3
    semctl(753664, 31, SETVAL, 0x1)
Note that though we used the same semaphore set (753664), sessions 1 and 2 posted different semaphores within that set (28 and 31 respectively) and waiters were wakened up by their respective holders according to the order they begun to wait.

sleeps + spin_gets = misses ?

That's an ideal case, of course, as just awakened process might at least have to compete for the same latch with a previous holder if that holder decided to get the same latch again. If awakened process doesn't manage to get the latch during the spin, it'll have to sleep again (thus breaking the above formula). However, the proximity to the above equation might be useful as one of the indicators whether latch is a potential subject to the "new" algorithm or not (this new wait model should tend to produce less sleeps, hence spins, per miss, otherwise there were not much points about it).
SQL> select misses, sleeps, spin_gets, sleeps+spin_gets
2 from v$latch
3 where name='Result Cache: Latch';

MISSES SLEEPS SPIN_GETS SLEEPS+SPIN_GETS
---------- ---------- ---------- ----------------
8208 275 7946 8221
The numbers are pretty close indeed. All this made me a bit more curios and I've decided to take a bit closer look at what happens in the strace then waiter has to sleep more than once in a row.

I've ran two parallel sessions executing a loop which selects from result cache...
begin
for i in 1 .. 1000000
loop
for cur in (select /*+ result_cache */ * from t where n=1)
loop
null;
end loop;
end loop;
end;
...made sure the above formula became distorted further and took a look at strace cutput for one of the sessions which led me to an interesting discovery:
[oracle@ora11gr1a ~]$ strace -e semop,semctl,semtimedop -p 7591
Process 7591 attached - interrupt to quit
...
semctl(1015808, 30, SETVAL, 0x1) = 0
semctl(1015808, 30, SETVAL, 0x1) = 0
semtimedop(1015808, 0x7fbfff42b8, 1, {0, 10000000}) = 0
semtimedop(1015808, 0x7fbfff42b8, 1, {0, 10000000}) = -1 EAGAIN (Resource temporarily unavailable)
semop(1015808, 0x7fbfff5f58, 1) = 0
...
What we see there is traditional time-based sleep using semtimedop syscall. Does that means that Oracle still can choose between algorithms dynamically? The answer, according to extended SQL trace, seems to be no, as I wasn't able to match these semtimedop syscalls with latch waits in SQL trace (so these syscalls were used for something else).

P.S. It looks like 11G has invalidated all that pseudo code you may find around about latch acquisition algorithm Oracle uses.

Monday, January 19, 2009

"LONGHOLD" latch waits on Linux

Take a look at the figure from my previous blog post:
call     count       cpu    elapsed       disk      query    current        rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 5001 0.53 53.71 0 0 0 500000
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 5003 0.53 53.71 0 0 0 500000

...

Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
latch free 1 53.20 53.20
What's unusual about it is a single latch wait for 53 seconds without consuming any CPU time. This is something different from a usual
get -> miss -> spin -> sleep
latch acquisition algorithm we generally used to.

Sequence of events

To remind you, the sequence of events goes like this:
  1. First session: do select /*+ result_cache */ * from t which places a single big result into Result Cache memory.
  2. First session: do select count(*) from v$result_cache_memory which grabs Result Cache latch for a long amount of time.
  3. Second session: do select /*+ result_cache */ * from t which forces us to wait until Result Cache latch will be freed up by our first session.
Latch statistics

Let's take a look at v$latch right after we execute the above three steps:
SQL> select gets, misses, sleeps
2 from v$latch
3 where name='Result Cache: Latch';

GETS MISSES SLEEPS
---------- ---------- ----------
53494 1 1
As we can see, the session indeed went into a sleep after a single miss. The first thing which may come into your mind is that I'm running this on a single CPU box where we don't spin if we miss a latch (due to _spin_count=1 being default on a single CPU boxes). This is not the case:
SQL> show parameter cpu_count

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
cpu_count integer 2
And my DB's _spin_count has a default value of 2000 as well.

Let's look a bit further:
SQL> select location, sleep_count, wtr_slp_count, longhold_count
2 from v$latch_misses
3 where parent_name='Result Cache: Latch'
4 and sleep_count+wtr_slp_count+longhold_count > 0;

LOCATION SLEEP_COUNT WTR_SLP_COUNT LONGHOLD_COUNT
------------------------------ ----------- ------------- --------------
Result Cache: Serialization12 0 1 0
Result Cache: Serialization16 1 0 1
Note LONGHOLD_COUNT column equals 1 for Result Cache: Serialization16. That means that we've holded a latch for the entire duration of someone else's sleep which, again, conforms to our observation. But what is the underlying mechanism for this?

Under the hood

Here is an excerpt from our second session's strace output:
...
getrusage(RUSAGE_SELF, {ru_utime={0, 728889}, ru_stime={1, 317799}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 728889}, ru_stime={1, 317799}, ...}) = 0
semop(98304, 0x7fbfff5f58, 1) = 0 -- enters the sleep here
times({tms_utime=72, tms_stime=131, tms_cutime=0, tms_cstime=0}) = 429496789
getrusage(RUSAGE_SELF, {ru_utime={0, 729889}, ru_stime={1, 317799}, ...}) = 0
...
And the first session:
...
mmap(0x2a972e7000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 9, 0) = 0x2a972e7000
mmap(0x2a972f7000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 9, 0) = 0x2a972f7000
semctl(98304, 31, SETVAL, 0x7f00000001) = 0 -- issues upon completion
getrusage(RUSAGE_SELF, {ru_utime={206, 80670}, ru_stime={2, 686591}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={206, 81670}, ru_stime={2, 686591}, ...}) = 0
...
In other words:
  1. Second session: upon discovering that the latch is busy, it begins to sleep by issuing
    semop(98304, 0x7fbfff5f58, 1) command.
  2. First session: upon completion issues
    semctl(98304, 31, SETVAL, 0x7f00000001) which wakes our second session up.
The first argument is a semid which means that both sessions are operating on the same semaphore. Both semaphore syscalls are not presented if we execute each of the above sessions separately.

Though we do sleep under normal latch acquisition algorithm as well, the sleep is based on time, not the latch holder posting us.

Which algorithm?

The interesting question to answer is how the waiter determines which algorithm to use. It could be (A) a special "longhold" mode get for a latch or (B) the waiter might check for how long the latch was already held. Thus far I tend to think that it is more close to (A) (a special get mode or a separate flag somewhere) because if we execute above two sessions concurrently we always get an enormously unusual number of sleeps even if we place only one single-row result into Result Cache memory (thus making sure there are no significant delays selecting from v$result_cache_memory).

Maybe some of you could share other ideas as well.

Sunday, January 18, 2009

v$result_cache_memory or what is the newest method to hang stuff up?

Sometimes I get quite amused by the "side effects" you can discover while playing with features.

Prerequisites

First, we need to place something relatively big into result cache memory. The easiest way to archive this is start like this:
SQL> alter system set result_cache_max_size=64m;

System altered.

SQL> alter system set result_cache_max_result=100;

System altered.
The first parameter controls how much memory out of your shared pool Oracle can dedicate for result cache. The second parameter tells how much memory, in percent, can be dedicated for a single resultset. Default value is 5 and I've set it to 100 so we don't have to dedicate a lot of memory to demonstrate my point.

All we have to do now is populate the result cache memory:
SQL> create table t as
2 select level n, rpad('*', 100, '*') v
3 from dual
4 connect by level <= 500000;

Table created

SQL> begin
2 for cur in (select /*+ result_cache*/ * from t)
3 loop
4 null;
5 end loop;
6 end;
7 /

PL/SQL procedure successfully completed
What's next?

Let's check how much result cache memory is allocated:
SQL> set timing on
SQL> select count(*) from v$result_cache_memory;

COUNT(*)
----------
65536

Executed in 55.984 seconds
That's almost a minute to get the results, yes.

So what?

There are (nasty) consequences.

It's time to buy Core i7

Let's start by taking a look at the trace file of our session:
SQL ID: 0zudkuq64dkwx
Plan Hash: 600544352
select count(*)
from
v$result_cache_memory


call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 1 54.65 56.96 0 0 0 1
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 3 54.65 56.96 0 0 0 1

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 38

Rows Row Source Operation
------- ---------------------------------------------------
1 SORT AGGREGATE (cr=0 pr=0 pw=0 time=0 us)
65536 FIXED TABLE FULL X$QESRCMEM (cr=0 pr=0 pw=0 time=0 us cost=0 size=13 card=1)


Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
SQL*Net message to client 2 0.00 0.00
SQL*Net message from client 2 0.00 0.01
First of all, all the time was spent actively consuming CPU. Selecting from X$QESRCMEM became a pretty expensive thing (in fact, at first I confused such a huge delay with a session being simply hang while spinning on the CPU).

Can't use result cache

Any session which attempts to use result cache will be forced to wait until our first session completes:
SQL ID: bzf4nvdav1m2j
Plan Hash: 1601196873
SELECT /*+ result_cache*/ *
FROM
T


call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 5001 0.53 53.71 0 0 0 500000
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 5003 0.53 53.71 0 0 0 500000

Misses in library cache during parse: 0
Optimizer mode: ALL_ROWS
Parsing user id: 38 (recursive depth: 1)

Rows Row Source Operation
------- ---------------------------------------------------
500000 RESULT CACHE 49yjv3h37cjhz39gkd90p4413p (cr=0 pr=0 pw=0 time=0 us)
0 TABLE ACCESS FULL T (cr=0 pr=0 pw=0 time=0 us cost=2127 size=30436770 card=468258)


Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
latch free 1 53.20 53.20
And the latch name is...
WAIT #2: nam='latch free' ela= 53205337 address=1610838648 number=377 tries=0 obj#=-1 tim=1232316789354155

SQL> select name from v$latch where latch#=377;

NAME
----------------------------------------------------------------
Result Cache: Latch
Surprised? I'm not. Don't expect me to start complaining about the absence of shared mode gets again.

The Real Deal or result_cache_mode = force

Things gets really nasty if you set result_cache_mode = force which has a potential outcome of rendering your entire system unusable for some extended period of time. Because even the simplest queries like select * from dual will hang, escalating situation to a point where your normal users will be unable to even connect to a database, let alone executing the queries. For example, trying to connect with SQL*Plus results in the following:
SQL ID: d6vwqbw6r2ffk
Plan Hash: 1388734953
SELECT USER
FROM
DUAL


call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.01 0.01 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 2 0.00 43.40 0 0 0 1
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 4 0.01 43.41 0 0 0 1

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 38

Rows Row Source Operation
------- ---------------------------------------------------
1 RESULT CACHE 4c48ztv5ztq25dkd78360h2uxb (cr=0 pr=0 pw=0 time=0 us)
1 FAST DUAL (cr=0 pr=0 pw=0 time=0 us cost=2 size=0 card=1)


Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
SQL*Net message to client 2 0.00 0.00
latch free 1 43.39 43.39
SQL*Net message from client 2 0.00 0.00
I don't have to mention the latch name again.

Real World?

What are the chances of meeting the above situation on some real system? First of all, having a system with more than a 100GB SGA is not uncommon these day. The thing gets worse the more memory is being occupied by a single result. Having a lot of small results seems to be not a problem. If we take 5% as a default threshold to be eligible to be placed into result cache memory, facing the above situation will require 1280M for result cache memory. That looks like quite a big number, however, on the scale of hundred gigs SGA it's no longer seems to be improbable. Aggregating a multi gigabytes of data into 64MB result doesn't sound insane either (at least not to me, you may have a different opinion here).

Plus it takes someone to query v$result_cache_memory. So just don't do it. There are other views (like v$result_cache_objects) which you can use to at least get a part of that info, alas at lower granularity.

Environment

The above tests were verified on Oracle 11GR1 running on Windows XP x64 (native) and Linux x86-64 under VMware (with XP x64 acting as a host OS).

Friday, January 16, 2009

insert /*+ append */ into ... values (...)

Alex Egorov has left a very fascinating comment to my post about 11G adaptive direct path reads which I thought definitely deserves mentioning.

What we are talking about here is a serious behavior change how statements like...
insert /*+ append */ into t values (1)
...are handled in 11G.

Saving yourself from yourself... not anymore

In previous Oracle versions, the append hint in the cases like above was simply ignored for some good reasons. The hint is being obeyed in 11G.

To better elaborate what that means, let's take the following example:
begin
for i in 1 .. 1000
loop
insert /*+ append */ into t values (i);
commit;
end loop;
end;
...and compare how behavior will change in 11G.

First of all, I've seen legacy code like the above in many places where developers were putting append hint everywhere they could because they heard it is like fast=true thing.

After you upgrade, you might expect the following:

  • If you don't do commit after each row -- the code will break itself with ORA-12838: cannot read/modify an object after modifying it in parallel.
  • If you do commit (as in my example) -- you have all chances to grow the table beyond the skies as each direct path insert writes beyond the HWM (in 11G, the above code will grow the table by 1000 blocks each time it is executed).
  • Serious locking issues as each direct path insert has to obtain an exclusive lock on the table first.

The above might produce quite an unpleasant surprises for some of you...

Thursday, January 15, 2009

Hot backup mode and a little known fact

Most people are aware of the implication hot backup puts on your redo stream. They'll tell you that on first modification Oracle will dump the entire block image into a redo stream and subsequent writes of the block will go as usual.

This is only half of the story.

The example

Let's start by creating a simple table I'll use as an example:
SQL> create table t pctfree 99 pctused 1 as
2 select level n
3 from dual
4 connect by level <= 100;

Table created.
Now, let's measure the amount of redo generated if I'm going to update all rows in the table:
SQL> set autot traceonly stat
SQL> update t set n=n;

100 rows updated.


Statistics
----------------------------------------------------------
30 recursive calls
20 db block gets
44 consistent gets
18 physical reads
12152 redo size
826 bytes sent via SQL*Net to client
769 bytes received via SQL*Net from client
3 SQL*Net roundtrips to/from client
2 sorts (memory)
0 sorts (disk)
100 rows processed
That's almost 12K of redo. Let's see what happens if I put tablespace into a hot backup mode:
SQL> alter tablespace users begin backup;

Tablespace altered.

SQL> update t set n=n;

100 rows updated.


Statistics
----------------------------------------------------------
0 recursive calls
20 db block gets
20 consistent gets
0 physical reads
152708 redo size
830 bytes sent via SQL*Net to client
769 bytes received via SQL*Net from client
3 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
100 rows processed
The amount of redo grew up to almost 150K. Oracle has to dump the entire block image into a redo stream due to fractured blocks problem. If we do the same update second time:
SQL> update t set n=n;

100 rows updated.


Statistics
----------------------------------------------------------
0 recursive calls
17 db block gets
20 consistent gets
0 physical reads
11980 redo size
830 bytes sent via SQL*Net to client
769 bytes received via SQL*Net from client
3 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
100 rows processed
It goes as usual. In fact, and this is kind of amazing, this is what most of the sources available on the internet will tell you about the behavior: the full image of the block gets written only on first modification in order to prevent fractured blocks problem and subsequent modifications will go as usuals.

Only half of the story

Let me continue with my example:
SQL> alter system flush buffer_cache;

System altered.

SQL> update t set n=n;

100 rows updated.


Statistics
----------------------------------------------------------
0 recursive calls
20 db block gets
20 consistent gets
20 physical reads
151916 redo size
831 bytes sent via SQL*Net to client
769 bytes received via SQL*Net from client
3 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
100 rows processed
What you see here is the entire image of the blocks written into a redo stream again.

The true story

Corrected 27-Jan-2009: initially I thought that it is a write to disk what matters, however, thanks to Jonathan Lewis for correcting me -- it's a read from disk to buffer. Please see his first comment.

Don't confuse correlation with causation.

Wednesday, January 14, 2009

Consistent gets from cache (fastpath)

There is an intresting optimization introduced in 11G regarding the way blocks are accessed in the buffer cache.

Let's start with a simple example to demonstrate my point.

10GR2 example
SQL> create table t (
2 n number,
3 v varchar2(100),
4 constraint pk_n primary key (n)
5 );

Table created

SQL>
SQL> insert into t
2 select level, rpad('*', 100, '*')
3 from dual
4 connect by level <= 1000;

1000 rows inserted

SQL> commit;

Commit complete
I'm going to create the procedure to return number of consistent gets in my session:
SQL> create or replace procedure get_cg(
2 p_cg out number
3 ) is
4 begin
5 select ms.value end into p_cg
6 from v$mystat ms, v$statname sn
7 where ms.STATISTIC#=sn.STATISTIC#
8 and sn.NAME='consistent gets';
9 end get_cg;
10 /

Procedure created
Let's execute a simple select and measure the number of consistent gets:
SQL> declare
2 l_cg_b number;
3 l_cg_a number;
4 begin
5 get_cg(l_cg_b);
6 for cur in (select n from (select mod(level, 1000)+1 l from dual connect by
level <= 100000) l, t where t.n=l.l)
7 loop
8 null;
9 end loop;
10 get_cg(l_cg_a);
11 dbms_output.put_line('consistent gets: '||to_char(l_cg_a-l_cg_b));
12 end;
13 /

consistent gets: 100012

PL/SQL procedure successfully completed
This is somewhat expected number, we did an index lookup (a unique scan) for all 100K rows returned from our connect by query.

11GR1 example

Let's do the same example in 11G. I've slightly modified the procedure to return two statistics -- consistent gets and consistent gets from cache (fastpath):
SQL> create or replace procedure get_cg(
2 p_cg out number,
3 p_cgfp out number
4 ) is
5 begin
6 select max(case sn.NAME when 'consistent gets' then ms.value end),
7 max(case sn.NAME when 'consistent gets from cache (fastpath)'
then ms.value end)
8 into p_cg, p_cgfp
9 from v$mystat ms, v$statname sn
10 where ms.STATISTIC#=sn.STATISTIC#
11 and sn.NAME in ('consistent gets', 'consistent gets from
cache (fastpath)');
12 end get_cg;
13 /

Procedure created

SQL> declare
2 l_cg_b number;
3 l_cgfp_b number;
4 l_cg_a number;
5 l_cgfp_a number;
6 begin
7 get_cg(l_cg_b, l_cgfp_b);
8 for cur in (select n from (select mod(level, 1000)+1 l from dual connect by
level <= 100000) l, t where t.n=l.l)
9 loop
10 null;
11 end loop;
12 get_cg(l_cg_a, l_cgfp_a);
13 dbms_output.put_line('consistent gets: '||to_char(l_cg_a-l_cg_b));
14 dbms_output.put_line('consistent gets (fastpath): '||to_char(l_cgfp_a-l_cgfp_b));
15 end;
16
/
consistent gets: 2602
consistent gets (fastpath): 1400

PL/SQL procedure successfully completed
That's quite a difference! 2602 consistent gets for 11GR1 versus ~100K for 10GR2. Note a new consistent gets (fastpath) statistic appearing in this test case.

A slightly different example

To make you a bit more curious, here is a slightly different example executed on 11GR1:
SQL> declare
2 l_cg_b number;
3 l_cgfp_b number;
4 l_cg_a number;
5 l_cgfp_a number;
6 begin
7 get_cg(l_cg_b, l_cgfp_b);
8 for cur in (select n from (select trunc(dbms_random.value(1, 1000)) l from
dual connect by level <= 100000) l, t where t.n=l.l)
9 loop
10 null;
11 end loop;
12 get_cg(l_cg_a, l_cgfp_a);
13 dbms_output.put_line('consistent gets: '||to_char(l_cg_a-l_cg_b));
14 dbms_output.put_line('consistent gets (fastpath): '||to_char(l_cgfp_a-l_cgfp_b));
15 end;
16 /
consistent gets: 101998
consistent gets (fastpath): 2934

PL/SQL procedure successfully completed
As soon as we've started reading index blocks in a random fashion, the effect, well, disappeared. Thus far optimization seems to be kicking in only if we repeaditly accessing the same block over and over again.

Another slightly different example

10GR2:
SQL> declare
2 l_cg_b number;
3 l_cg_a number;
4 begin
5 get_cg(l_cg_b);
6 for cur in (select v from (select mod(level, 1000)+1 l from dual connect by
level <= 100000) l, t where t.n=l.l)
7 loop
8 null;
9 end loop;
10 get_cg(l_cg_a);
11 dbms_output.put_line('consistent gets: '||to_char(l_cg_a-l_cg_b));
12 end;
13 /
consistent gets: 201001

PL/SQL procedure successfully completed
What we are doing here is reading the data from a table itself (not only from index). Again, that's expected number. We had to do an index unique scan followed by table access by rowid, resulting in two LIOs per returned row. Now, take a look what happens in 11GR1:
SQL> declare
2 l_cg_b number;
3 l_cgfp_b number;
4 l_cg_a number;
5 l_cgfp_a number;
6 begin
7 get_cg(l_cg_b, l_cgfp_b);
8 for cur in (select v from (select mod(level, 1000)+1 l from dual connect by
level <= 100000) l, t where t.n=l.l)
9 loop
10 null;
11 end loop;
12 get_cg(l_cg_a, l_cgfp_a);
13 dbms_output.put_line('consistent gets: '||to_char(l_cg_a-l_cg_b));
14 dbms_output.put_line('consistent gets (fastpath): '||to_char(l_cgfp_a-
l_cgfp_b));
15 end;
16 /
consistent gets: 102602
consistent gets (fastpath): 1400

PL/SQL procedure successfully completed
In other words, index blocks were "neutralized", but table blocks were not. Another point is that optimization still works for index blocks even though we accessing table blocks in between.

Works only for index blocks?

That's really simple to verify using the following example:
SQL> declare
2 l_cg_b number;
3 l_cgfp_b number;
4 l_cg_a number;
5 l_cgfp_a number;
6 begin
7 get_cg(l_cg_b, l_cgfp_b);
8 for cur in (select v from (select case mod(level, 2) when 0 then
'AAADq1AAEAAAAJHAAA' else 'AAADq1AAEAAAAJHAAB' end l from dual connect by level <=
100000) l, t where t.rowid=l.l)
9 loop
10 null;
11 end loop;
12 get_cg(l_cg_a, l_cgfp_a);
13 dbms_output.put_line('consistent gets: '||to_char(l_cg_a-l_cg_b));
14 dbms_output.put_line('consistent gets (fastpath): '||to_char(l_cgfp_a-
l_cgfp_b));
15 end;
16 /
consistent gets: 1000
consistent gets (fastpath): 1000

PL/SQL procedure successfully completed
The same example in 10GR2, of course, produces ~100K LIOs, so the optimization seems to be kicking in for table blocks as well. Further, the behavior in the previous example can not be explained by table blocks being randomly distributed since the clustering factor is perfect:
SQL> select sum(bn_diff)
2 from (
3 select case when bn!=lag(bn, 1, bn) over (order by n)then 1 else 0
end bn_diff
4 from (select n, dbms_rowid.rowid_block_number(rowid) bn from
t)
5 );

SUM(BN_DIFF)
------------
15

SQL> select count (distinct dbms_rowid.rowid_block_number(rowid)) from t;

COUNT(DISTINCTDBMS_ROWID.ROWID
------------------------------
16
Perhaps, there is some overhead associated with fastpath and the logic assumes a bad clustering factor by default (which works for majority of cases) thus doesn't tries to do a fastpath for table blocks in our previous example.

Moving further

Take a look at this:
SQL> declare
2 l_cg_b number;
3 l_cgfp_b number;
4 l_cg_a number;
5 l_cgfp_a number;
6 begin
7 get_cg(l_cg_b, l_cgfp_b);
8 for cur in (select t1.n n1, t2.n n2 from (select mod(level, 1000)+1 l from
dual connect by level <= 100000) l, t t1, t t2 where t1.n=l.l and t2.n=l.l)
9 loop
10 null;
11 end loop;
12 get_cg(l_cg_a, l_cgfp_a);
13 dbms_output.put_line('consistent gets: '||to_char(l_cg_a-l_cg_b));
14 dbms_output.put_line('consistent gets (fastpath): '||to_char(l_cgfp_a-
l_cgfp_b));
15 end;
16 /
consistent gets: 5204
consistent gets (fastpath): 2800

PL/SQL procedure successfully completed
I've simply added the same table into a join twice. Though we were accessing exactly the same blocks from the same table, the optimization didn't span both tables.

Doesn't work across fetches
SQL> declare
2 l_cg_b number;
3 l_cgfp_b number;
4 l_cg_a number;
5 l_cgfp_a number;
6 begin
7 get_cg(l_cg_b, l_cgfp_b);
8 for cur in (select n from (select 1 l from dual connect by level <= 100000)
l, t where t.n=l.l)
9 loop
10 null;
11 end loop;
12 get_cg(l_cg_a, l_cgfp_a);
13 dbms_output.put_line('consistent gets: '||to_char(l_cg_a-l_cg_b));
14 dbms_output.put_line('consistent gets (fastpath): '||to_char(l_cgfp_a-
l_cgfp_b));
15 end;
16 /
consistent gets: 1008
consistent gets (fastpath): 1001

PL/SQL procedure successfully completed
While we were accessing the same row (hence, the block), we got 1K LIOs. It is explained by 11G doing implicit array fetches by 100 rows at a time and also points out at the fact that consistent gets (fastpath) can not operate across fetches (though the same rule applies for "regular" consistent gets as well).

In a nutshell

Overall, this looks like a nice feature. The downside, of course, is less predictability. 11G seems to have introduces these little tricks and features all over the place, which can really help you under certain circumstances but can also make your head spin while trying to figure out for "magical" performance differences...

Added 16-Jan-2009: Take a look at Concurrency post by Jonathan Lewis for some more details.

Monday, January 12, 2009

Airlines...

Reliability considered to be a table stake in the airlines industry. Something you have to have if you want to play the planes game at all.

I had a flight to Toronto a couple of days ago which I can describe as very fascinating and enlightening.

My flight was scheduled at 8AM and I arrived to Ottawa International airport around 7:10AM. After checking in and passing required security checks I was waiting to board my plane. Almost immediately after I begun to wait, airport intercom said:
- If there is anyone willing to board an earlier flight to Toronto, you can do so.
Well, fine. It is better to be in Pearson earlier than waiting in Ottawa. So I boarded a flight which was supposed to take off 25 minutes earlier.

As soon as everyone has boarded, they closed the doors and started to move the plane. When, suddenly, boom! I'm observing a lot of smoke coming from a back of the plane. The plane stops very hard, almost instantly. I can smell partially burned jet fuel. A minute later a fire extinguisher machine comes and stops a few meters from us doing nothing. I'm watching people becoming more and more curious (is the plane on fire? is it time to jump off the plane to avoid being cooked?). The pilot says:
- It seems that we've got some electronics snag here, let me try to reboot the on-board computer to see if that fixes the problem.
Now, that's cool. Ummm... Windows embedded or what? A couple of minutes later pilot comes again:
- It looks like our problem still persist, so I'm going to power off the entire plane and then put the power back on, let's hope it will fix our problem, because usually it does.
If there is a moment when you start thinking whether you should be sitting on the plane at all, it seems to be it. Some people has already stood up and clearly wandering how they can get off the plane. This is really a time when stewardess can help everyone:
- We were trying to figure out if there is a possibility to board the other plane but, unfortunately, all morning flights are extremely busy.
What a perfect argument to keep the plane's doors closed. The plane backs to life as pilots recycled the power...
- It looks like our problem was solved, so we are going to proceed for takeoff...
Why do I have that strange feeling that it would better be not solved so we can board off? Anyway, they are starting to move us again... When, suddenly, boom! I'm observing a lot of smoke coming from a back of the plane. Here we go again, a fire extinguisher machine comes. That's enough, who can show me the exit? The pilot:
- It appears that our problem reoccurred again, the brakes applied without anything telling them to apply.
Let me think what would happen had the brakes applied on a runway itself... not a rocket since.
- We are going to call the maintenance to see if they can fix the problem, it should take around 25-30 minutes...
Right, while they are calling the maintenance, I'm watching my flight taking off... Some guy comes in and disappears under the plane.
- It looks like maintenance fixed the problem and after they'll be done with a paperwork, we can proceed for takeoff...
Read the book, as Gordon says, if you don't do scheduled maintenance then the plane will schedule one for you. Trying to save costs? I have no doubts you can make service to be so cheap that no one wants to use it. Well, at least they gonna miss one of their KPIs, arrival on time, so there is a hope someone is going to do something about it...

Saturday, January 10, 2009

Using DBMS_RULE_ADM to evaluate dynamic predicates

The best comments are the ones which inspire you to try and learn something new. I received one of these to my recent blog post about Scaling dynamic SQL. Stas, one of my good friends, asked whether I have tried DBMS_RULE_ADM package for this task. Well, now I did. Stas, you should be careful what you ask next time, you might get it :)

The first thing we need to do is create an evaluation context:
SQL> declare
2 l_vtl sys.re$variable_type_list;
3 begin
4 l_vtl:=sys.re$variable_type_list(
5 sys.re$variable_type('age', 'number', null, null),
6 sys.re$variable_type('balance', 'number', null, null),
7 sys.re$variable_type('birthdate', 'date', null, null),
8 sys.re$variable_type('today', 'date', null, null)
9 );
10
11 dbms_rule_adm.create_evaluation_context(
12 evaluation_context_name => 'adverts_ctx',
13 variable_types => l_vtl
14 );
15 end;
16 /

PL/SQL procedure successfully completed
We can create the rules itself now:
SQL> declare
2 l_rules dbms_sql.Varchar2_Table;
3 begin
4 dbms_rule_adm.create_rule_set(
5 rule_set_name => 'adverts'
6 );
7
8 l_rules(1):=':age between 16 and 18';
9 l_rules(2):=':birthdate = :today or :balance > 1000';
10 l_rules(3):=':balance between 100 and 200 and :age > 18';
11
12 for i in 1 .. 3
13 loop
14 dbms_rule_adm.create_rule(
15 rule_name => 'rule_'||to_char(i),
16 condition => l_rules(i),
17 evaluation_context => 'adverts_ctx'
18 );
19
20 dbms_rule_adm.add_rule(
21 rule_name => 'rule_'||to_char(i),
22 rule_set_name => 'adverts'
23 );
24 end loop;
25 end;
26 /

PL/SQL procedure successfully completed
Since there will be no SQL executed, I've decided to simply measure wall clock time, given that we can obtain latch wait information from the extended SQL trace:
SQL> create table results
2 (
3 sid number,
4 cs number
5 );

Table created
Here is the test procedure itself:
SQL> create or replace procedure test_dra(
2 p_i in number
3 ) is
4 l_age sys.re$variable_value;
5 l_balance sys.re$variable_value;
6 l_birthdate sys.re$variable_value;
7 l_today sys.re$variable_value;
8 l_vvl sys.re$variable_value_list;
9 l_true sys.re$rule_hit_list;
10 l_maybe sys.re$rule_hit_list;
11 l_time number;
12 begin
13 l_age:=sys.re$variable_value('age', anydata.convertnumber(16));
14 l_balance:=sys.re$variable_value('balance', anydata.convertnumber(1500));
15 l_birthdate:=sys.re$variable_value('birthdate', anydata.convertdate(to_date('20090101', 'yyyymmdd')));
16 l_today:=sys.re$variable_value('today', anydata.convertdate(trunc(sysdate)));
17
18 l_vvl:=sys.re$variable_value_list(l_age, l_balance, l_birthdate, l_today);
19
20 l_true:=sys.re$rule_hit_list();
21 l_maybe:=sys.re$rule_hit_list();
22
23 l_time:=dbms_utility.get_time;
24 dbms_monitor.session_trace_enable(waits => true, binds => false);
25
26 for i in 1 .. p_i
27 loop
28 dbms_rule.evaluate(
29 rule_set_name => 'adverts',
30 evaluation_context => 'adverts_ctx',
31 variable_values => l_vvl,
32 stop_on_first_hit => false,
33 true_rules => l_true,
34 maybe_rules => l_maybe
35 );
36 end loop;
37
38 dbms_monitor.session_trace_disable;
39 insert into results values (sys_context('userenv', 'sid'), dbms_utility.get_time-l_time);
40 end test_dra;
41 /

Procedure created
As before, I've used four parallel jobs with 100000 iterations each. Here are the results I've got:
SQL> select * from results;

SID CS
---------- ----------
134 3472
131 3412
133 3205
132 3358
And tkprof of one of the jobs:
OVERALL TOTALS FOR ALL NON-RECURSIVE STATEMENTS

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 0 0.00 0.00 0 0 0 0
Execute 0 0.00 0.00 0 0 0 0
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 0 0.00 0.00 0 0 0 0

Misses in library cache during parse: 0

Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
library cache: mutex X 769 0.00 0.04
latch free 39 0.00 0.00
Which is, well, about 6.5 times slower given wall clock time compared to the cached results using package from my previous blog post:
OVERALL TOTALS FOR ALL RECURSIVE STATEMENTS

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 3 0.00 0.00 0 0 0 0
Execute 300000 4.85 4.97 0 0 0 300000
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 300003 4.85 4.97 0 0 0 300000

Misses in library cache during parse: 0

Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
library cache: mutex X 209 0.00 0.00
cursor: pin S wait on X 98 0.01 1.04
cursor: pin S 360 0.00 0.00
The interesting thing to note here is that DBMS_RULE_ADM seems to have a better scalability, however, I can't really say at what point it could be justified (if at all) given a pretty hefty wall clock time difference.

Result Cache and simple queries

I don't really want to beat the dead horse here, however, I came through an interesting observation today.

It turned out that the latest version of APEX (3.1.2, or maybe some oldest versions as well, don't have these to check out) does use result_cache hint for a queries like this:
SELECT /*+ result_cache */ MESSAGE_TEXT
FROM WWV_FLOW_MESSAGES$
WHERE SECURITY_GROUP_ID = 10
AND MESSAGE_LANGUAGE = :B2
AND NAME = UPPER(:B1 )

SELECT /*+ result_cache */ NLS_LANGUAGE, NLS_TERRITORY, NLS_SORT, NLS_WINDOWS_CHARSET
FROM WWV_FLOW_LANGUAGES
WHERE LANG_ID_UPPER = UPPER(:B1 )
Both of these queries are simple lookups...

Wednesday, January 07, 2009

Searching over structured and unstructured data in 10G

In my previous blog post I've explained a nice little feature introduced to Oracle Text Context indexes in 11G which allows you to do efficient searches involving both structured and unstructured data. It is time to talk about what could be done by those of you using previous Oracle versions.

I'll start with the same table and data which I've used in my previous blog post.

Tag it

Literally. Because the best way to search over structured and unstructured data is, well, not to mix it. I'll show you how to use your own datastore procedure to archive this goal.

First of all, we need to declare an author tag, we will use it for our searches later:

SQL> begin
2 ctx_ddl.create_section_group('books_section_group', 'basic_section_group');
3 ctx_ddl.add_field_section('books_section_group', 'author', 'author', false);
4 end;
5 /

PL/SQL procedure successfully completed
All we need after that is a simple custom datastore procedure to merge it all together (this function is what Oracle Text will index as well) and tag the author:
SQL> create or replace procedure books_ds(
2 p_rid in rowid,
3 p_clob in out clob
4 ) is
5 begin
6 for cur in (select '<author>'||author||'</author>' author, text from books where rowid=p_rid)
7 loop
8 dbms_lob.copy(p_clob, cur.author, dbms_lob.getlength(cur.author));
9 dbms_lob.append(p_clob, cur.text);
10 end loop;
11 end;
12 /

Procedure created
Let's create an index now:
SQL> alter table books add (books_info varchar2(1));

Table altered

SQL> begin
2 ctx_ddl.create_preference('books_ds', 'user_datastore');
3 ctx_ddl.set_attribute('books_ds', 'procedure', 'books_ds');
4 end;
5 /

PL/SQL procedure successfully completed

SQL> CREATE INDEX ctx_books_info
2 ON books (books_info)
3 INDEXTYPE IS CTXSYS.CONTEXT
4 PARAMETERS('filter ctxsys.null_filter section group books_section_group lexer books_lexer datastore books_ds memory 64m');

Index created
Continuing on the example I've used in my previous post, searching on books which mention procedures and are written by author XDB will look like this:
SQL> select author, title, published
2 from books
3 where contains(books_info, 'xdb within author and procedure', 1) > 0
4 order by score(1) desc;

37 rows selected.


Statistics
----------------------------------------------------------
13 recursive calls
0 db block gets
40 consistent gets
0 physical reads
0 redo size
1379 bytes sent via SQL*Net to client
372 bytes received via SQL*Net from client
4 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
37 rows processed

Saturday, January 03, 2009

Maintaining summaries in a highly concurrent fashion

I was involved in designing a highly-concurrent OLTP system last year (more than 300K users during peak hours) which were allowing users to play online and win some prizes. As part of the prize winning logic we had to maintain count of prizes won in order to be able to display it in real-time.

The table contained won prizes looked basically like this:
SQL> create table winners
2 (
3 prize_id number primary key,
4 user_id number
5 );

Table created
That is -- we were storing prize_id among with the user_id who won it. There were a couple of millions prizes available for winning. You can accomplish won prize counting in a number of ways:

  • Count it each time.
    Do this only if you partner up with the same folks who sell you hardware and Oracle CPU licenses.

  • Create on-commit materialized view with count.
    Though much better than previous KIWI'ish approach, it brings a bit of overhead for maintaining a materialized view log plus, if you don't take special cares, may blocks users during commit. Nevertheless, this is usually a valid approach, especially if you want to leverage features like query rewrite. We didn't really need it and, to make things more interesting, someone was winning a prize every so milliseconds. Do simple things (insert into winners, mview log, update summaries table, delete from mview log) a lot of times and wander how quickly it starts to add up.

  • Maintain summaries yourself.
    This is relatively easy to do, especially when application does nothing but calls a set of PL/SQL APIs (which means you can do whatever you want to archive the required output).
However, doing a simple update winners_cnt set cnt=cnt+1 is going to serialize all the winners so we need an easy way to spread stuff out. The simplest, yet very efficient, way to archive this is to do something like this:
SQL> create table winners_cnt_t
2 (
3 slot_id number primary key,
4 cnt number
5 ) organization index;

Table created

SQL> insert into winners_cnt_t
2 select level-1, 0
3 from dual
4 connect by level <= 11;

11 rows inserted

SQL> commit;

Commit complete
Make sure your slots count is a prime number. Then you can hide it behind the view to make things transparent:
create view winners_cnt as
select sum(cnt) cnt
from winners_cnt_t;
After that, all you have to do is maintain number of winners the following way...
update winners_cnt_t
set cnt=cnt+1
where slot_id=mod(p_prize_id, 11);
...which will distribute updates evenly among all rows in the table. You can vary number of slots to match your degree of concurrency.

11G adaptive direct path reads -- what is the cached/dirty blocks threshold?

11G's ability to do direct path reads during full table scans without utilizing PQ was covered in a number of places already (see this post by Doug Burns for example).

When direct path reads starts to happen?

It is known that somewhat reliable figure is your _small_table_threshold multiplied by 5 (mentioned by Tanel Poder on oracle-l recently). You can discover it using quick and dirty test case similar to this:
SQL> create tablespace adr_test datafile size 64m segment space management manual;

Tablespace created

SQL> create table t (v varchar2(100)) pctused 1 pctfree 99 tablespace adr_test;

Table created

SQL> create or replace function get_adr_trsh(
2 p_step in number,
3 p_start in number default 0,
4 p_stop in number default null
5 ) return number is
6 l_prd number;
7 l_blocks number:=0;
8 l_start number:=p_start;
9 begin
10 execute immediate 'truncate table t';
11
12 loop
13 insert /*+ append */ into t
14 select rpad('*', 100, '*')
15 from dual
16 connect by level <= p_step + l_start;
17 commit;
18
19 l_blocks:=l_blocks + p_step + l_start;
20 l_start:=0;
21
22 execute immediate 'alter system flush buffer_cache';
23
24 select /*+ full(t) */ count(*) into l_cnt from t;
25
26 select value into l_prd
27 from v$segment_statistics
28 where owner=user
29 and object_name='T'
30 and statistic_name='physical reads direct';
31
32 exit when (l_prd > 0 or l_blocks > nvl(p_stop, l_blocks));
33
34 end loop;
35
36 return l_blocks - p_step;
37 end;
38 /

Function created
My _small_table_threshold is:
SQL> select ksppstvl
2 from x$ksppi x, x$ksppcv y
3 where (x.indx = y.indx)
4 and ksppinm='_small_table_threshold';

KSPPSTVL
--------------------------------------------------------------------------------
314
which is about 2% of my buffer cache (128MB) so you may expect 11G switch to direct path reads once table goes beyond 1570 blocks. Let's check it:
SQL> declare
2 l_trsh number;
3 begin
4 l_trsh:=get_adr_trsh(10, 1500, 2000);
5
6 dbms_output.put_line(l_trsh);
7 end;
8 /

1570

PL/SQL procedure successfully completed
Note that that number is somewhat "about" and you can get different results depending on stuff like using ASSM/MSSM.

What is the cached blocks threshold?

Direct path reads stops happening after certain amount of your table's blocks are in the buffer cache already. Discovering it is fairly easy as well:
SQL> --need this so we can do irs and cache table blocks
SQL> create index i_t on t (1);

Index created

SQL> create or replace function get_cached_trsh(
2 p_start in number default 0,
3 p_step in number default 1
4 ) return number is
5 cursor l_cur is select /*+ index(t i_t) */ * from t;
6 l_v varchar2(100);
7 l_trsh number:=0;
8 l_prd number:=0;
9 l_cnt number:=0;
10 l_start number:=p_start;
11 begin
12 execute immediate 'alter system flush buffer_cache';
13 open l_cur;
14
15 loop
16 for i in 1 .. p_step+l_start
17 loop
18 fetch l_cur into l_v;
19 end loop;
20 l_trsh:=l_trsh+p_step+l_start;
21 l_start:=0;
22
23 select /*+ full(t) */ count(*) into l_cnt from t;
24
25 select value into l_cnt
26 from v$segment_statistics
27 where owner=user
28 and object_name='T'
29 and statistic_name='physical reads direct';
30
31 exit when l_cnt=l_prd or l_cur%notfound;
32
33 l_prd:=l_cnt;
34
35 end loop;
36
37 close l_cur;
38 return l_trsh;
39 end;
40 /

Function created
Now, we can see after how many blocks 11G will stop doing direct path reads:
SQL> declare
2 l_trsh number;
3 begin
4 l_trsh:=get_cached_trsh(500, 1);
5
6 dbms_output.put_line(l_trsh);
7 end;
8 /

789

PL/SQL procedure successfully completed
Which happens to be half of the table's blocks. I've repeated the above test with 256MB buffer cache and got 3140 blocks (number of blocks for direct read to start happening) and 1568 (number of cached blocks) respectively. Please note that cached blocks threshold seems to be not dependent on your buffer cache size (to a degree where it can find space of course).

What is the dirty blocks threshold?

Doing direct path reads requires segment level checkpoint which may not be something you would like to do if you have a lot of these for the sake of direct read alone.

Something we can start with:
SQL> create or replace function get_dirty_trsh(
2 p_step in number,
3 p_start in number default 0,
4 p_stop in number default null
5 ) return number is
6 l_trsh number:=0;
7 l_prd number:=0;
8 l_cnt number:=0;
9 l_start number:=p_start;
10 begin
11 execute immediate 'alter system flush buffer_cache';
12
13 loop
14 l_trsh:=l_trsh+p_step+l_start;
15 update t set v=v where rownum <= l_trsh;
16 commit;
17 l_start:=0;
18
19 select /*+ full(t) */ count(*) into l_cnt from t;
20
21 select value into l_cnt
22 from v$segment_statistics
23 where owner=user
24 and object_name='T'
25 and statistic_name='physical reads direct';
26
27 exit when l_cnt=l_prd or l_trsh > nvl(p_stop, l_trsh);
28
29 l_prd:=l_cnt;
30
31 end loop;
32
33 return l_trsh;
34 end;
35 /

Function created

SQL> declare
2 l_trsh number;
3 begin
4 l_trsh:=get_dirty_trsh(1, 350, 400);
5
6 dbms_output.put_line(l_trsh);
7 end;
8 /
384

PL/SQL procedure successfully completed.
Which turns out to be 1/4 of a table size.

Quick and dirty

Please note that adaptive direct path reads could (and most probably do) have much more variables to make a decision. The above test were done using ad-hoc approach to at least have an idea what could be potential factors there. Things like system statistics, tablespace block sizes, delayed blocks cleanouts, etc. has a potential to interfere over there.

SQL> select * from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.1.0.7.0 - 64bit Production
PL/SQL Release 11.1.0.7.0 - Production
CORE 11.1.0.7.0 Production
TNS for 64-bit Windows: Version 11.1.0.7.0 - Production
NLSRTL Version 11.1.0.7.0 - Production

Thursday, January 01, 2009

Scaling dynamic SQL

The story

I was working for one of the Russian biggest telcos back then and marketing guys came out with their next idea. One of the most frequent things your customers do is checking their balances. The idea was to attach an advertising to each balance response. However, usually you don't really want to attach some static message that you show to everyone because it is not the way your customers care about it. Think of Google, which became so popular because it prioritizes information by relevance. Doing something useless to your customers not only annoys them but wastes your bandwidth as well.

What you have to do is targeted adverts. What that means is that you may substantially increase the value by figuring out what your customers would like to see based on a data about them which you have in your billing system. And you know literally hundreds of things about any single customer. Starting from age, martial status, birthday, active services and ending up with approximate location based on triangulation data. Think of Google's sponsored links.

The system

Upon receiving a balance request, the system had to find out various stuff about requester and then figure out what to show based on the rules from a marketing department. Rules had to offer virtually unlimited flexibility and the system itself had to be flexible to introduce and remove targeting conditions on the fly.

The implementation

What we had is a simple table with a PL/SQL predicates, like this:
SQL> create table p
2 (
3 p_id number primary key,
4 p varchar2(4000)
5 );

Table created

SQL> insert into p values (1, 'to_number(:age) between 16 and 18');

1 row inserted
SQL> insert into p values (2, 'to_date(:birthday, ''yyyymmdd'')=trunc(sysdate) or :balance > 1000');

1 row inserted
SQL> insert into p values (3, 'to_number(:balance) between 100 and 200 and :age > 18');

1 row inserted

SQL> commit;

Commit complete
Of course, in reality it was a bit more complex than that, allowing for prioritizing, schedules, etc. As you'd probably already guessed, upon receiving a balance request we were binding data about customer into the above predicates to figure out which one evaluates to true which in turn lead to relevant advert. Given that you may be getting around 5000 balance requests each second plus you have to multiply this by a number of predicates you have to check, you already talking about tens of thousands evaluations per second. In other words, you have to make the entire thing pretty darn efficient.

Soft parsing

Soft parsing is something which is deemed by many to be inevitable as soon as you have to deal with stuff like above -- storing dynamically generated PL/SQL predicates in a table and executing these during runtime. And this is something we had to avoid because, even with stuff like session_cached_cursor, parse is still a parse, a scalability inhibitor and CPU waster. Think of not using KIWI.

What to do?

I'll show you a little trick you can use to break the dreaded marriage of dynamic SQL with parsing. Take a look at the following package:
create or replace package eval is

type t_cached_cursors is table of number index by p.p%type;
g_cached_cursors t_cached_cursors;

function evaluate(
p in p.p%type,
p_n in dbms_sql.Varchar2_Table,
p_v in dbms_sql.Varchar2_Table
) return boolean;

function evaluate_nc(
p in p.p%type,
p_n in dbms_sql.Varchar2_Table,
p_v in dbms_sql.Varchar2_Table
) return boolean;
end eval;

create or replace package body eval is

function evaluate(
p in p.p%type,
p_n in dbms_sql.Varchar2_Table,
p_v in dbms_sql.Varchar2_Table
) return boolean is
l_cursor number;
l_res number;
begin
begin
l_cursor:=g_cached_cursors(p);
exception when no_data_found then
l_cursor:=dbms_sql.open_cursor;
dbms_sql.parse(l_cursor, p, dbms_sql.native);
g_cached_cursors(p):=l_cursor;
end;
dbms_sql.bind_variable(l_cursor, 'l_res', l_res);

for i in 1 .. p_n.count
loop
if (instr(p, ':'||p_n(i)) > 0)
then
dbms_sql.bind_variable(l_cursor, p_n(i), p_v(i));
end if;
end loop;

l_res:=dbms_sql.execute(l_cursor);
dbms_sql.variable_value(l_cursor, 'l_res', l_res);

return (l_res=1);
end;

function evaluate_nc(
p in p.p%type,
p_n in dbms_sql.Varchar2_Table,
p_v in dbms_sql.Varchar2_Table
) return boolean is
l_cursor number;
l_res number;
begin
l_cursor:=dbms_sql.open_cursor;
dbms_sql.parse(l_cursor, p, dbms_sql.native);
dbms_sql.bind_variable(l_cursor, 'l_res', l_res);

for i in 1 .. p_n.count
loop
if (instr(p, ':'||p_n(i)) > 0)
then
dbms_sql.bind_variable(l_cursor, p_n(i), p_v(i));
end if;
end loop;

l_res:=dbms_sql.execute(l_cursor);
dbms_sql.variable_value(l_cursor, 'l_res', l_res);
dbms_sql.close_cursor(l_cursor);

return (l_res=1);
end;

end eval;
The package has two functions -- evaluate and evaluate_nc, the first one is using a simple caching trick. The idea behind evaluate function is really that simple -- upon opening and parsing a cursor, place it into in-memory table indexed by cursor text for further reuse instead of closing it. Each execution peeks at that in-memory table to see if there is a cursor we can reuse instead of going through the whole parsing exercise.

The usage is simple as well:
SQL> declare
2 l_n dbms_sql.varchar2_table;
3 l_v dbms_sql.varchar2_table;
4 l_res boolean;
5 begin
6 l_n(1):='age';
7 l_n(2):='birthday';
8 l_n(3):='balance';
9 l_v(1):='16';
10 l_v(2):='20090101';
11 l_v(3):='1500';
12
13 for cur in (select p_id, 'declare l_res number; begin :l_res:=case when ('||p||') then 1 else
0 end; end;' p from p)
14 loop
15 l_res:=eval.evaluate(cur.p, l_n, l_v);
16 dbms_output.put_line('eval p'||to_char(cur.p_id)||': '||case when l_res then 'true' else 'fal
se' end);
17 end loop;
18 end;
19 /
eval p1: true
eval p2: true
eval p3: false
Let's do some tests now. Here are two test procedures:
create or replace procedure test_cached(
p_i in number
) is
l_p dbms_sql.Varchar2_Table;
l_n dbms_sql.varchar2_table;
l_v dbms_sql.varchar2_table;
l_res boolean;
begin
l_n(1):='age';
l_n(2):='birthday';
l_n(3):='balance';

l_v(1):='100';
l_v(2):='20090101';
l_v(3):='1000';

select 'declare /* cached */ l_res number; begin :l_res:=case when ('||p||') then 1 else 0 end; end;'
bulk collect into l_p
from p;

dbms_monitor.session_trace_enable(waits => true, binds => false);
for i in 1 .. p_i
loop
for j in 1 .. l_p.count
loop
l_res:=eval.evaluate(l_p(j), l_n, l_v);
end loop;
end loop;
dbms_monitor.session_trace_disable;
end;

create or replace procedure test_not_cached(
p_i in number
) is
l_p dbms_sql.Varchar2_Table;
l_n dbms_sql.varchar2_table;
l_v dbms_sql.varchar2_table;
l_res boolean;
begin
l_n(1):='age';
l_n(2):='birthday';
l_n(3):='balance';

l_v(1):='100';
l_v(2):='20090101';
l_v(3):='1000';

select 'declare /* not_cached */ l_res number; begin :l_res:=case when ('||p||') then 1 else 0 end; end;'
bulk collect into l_p
from p;

dbms_monitor.session_trace_enable(waits => true, binds => false);
for i in 1 .. p_i
loop
for j in 1 .. l_p.count
loop
l_res:=eval.evaluate_nc(l_p(j), l_n, l_v);
end loop;
end loop;
dbms_monitor.session_trace_disable;
end;
I tested each of these using four parallel jobs with 100000 iterations (p_i parameter) each. Here is what I've got (I'm using one of the predicates as an example):
declare /* cached */ l_res number; begin :l_res:=case when (to_number(:age) 
between 16 and 18) then 1 else 0 end; end;


call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 100000 1.40 1.29 0 0 0 100000
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 100001 1.40 1.29 0 0 0 100000

Misses in library cache during parse: 0
Optimizer mode: ALL_ROWS
Parsing user id: 40 (recursive depth: 2)

Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
cursor: pin S wait on X 35 0.01 0.37
cursor: pin S 91 0.00 0.00

declare /* not_cached */ l_res number; begin :l_res:=case when
(to_number(:age) between 16 and 18) then 1 else 0 end; end;


call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 100000 1.00 1.21 0 0 0 0
Execute 100000 2.73 2.74 0 0 0 100000
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 200000 3.73 3.95 0 0 0 100000

Misses in library cache during parse: 1
Misses in library cache during execute: 1
Optimizer mode: ALL_ROWS
Parsing user id: 40 (recursive depth: 2)

Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
library cache: mutex X 116 0.00 0.00
cursor: mutex S 24 0.00 0.00
cursor: pin S 83 0.00 0.00
cursor: pin S wait on X 30 0.01 0.28
That's a three times improvement. Note that in first tkprof report we got only one parse, while in second one amount of parses equals executions, as we expected.