Wednesday, December 31, 2008

Searching over structured and unstructured data in 11G

For those of you who have to deal with searches involving both structured and unstructured data, there is a small neat feature introduced to Oracle Text Context indexes in Oracle 11G which you may find to be very useful.

The example

Let's say you have a books table like this:
SQL> create table books
2 (
3 author varchar2(30),
4 title varchar2(30),
5 published date,
6 text clob
7 ) lob (text) store as securefile
8 (
9 enable storage in row
10 cache
11 );

Table created.
I'm going to populate it using some test data:
SQL> declare
2 l_clob clob;
3 begin
4 for obj_cur in (
5 select owner, object_name, created
6 from dba_objects
7 where object_type in ('PROCEDURE','PACKAGE','PACKAGE BOD
Y','FUNCTION','TRIGGER')
8 ) loop
9 insert into books values (obj_cur.owner, obj_cur.object_name, ob
j_cur.created, empty_clob())
10 returning text into l_clob;
11
12 dbms_lob.open(l_clob, dbms_lob.lob_readwrite);
13
14 for text_cur in (
15 select text
16 from dba_source
17 where owner=obj_cur.owner
18 and name=obj_cur.object_name
19 order by type, line
20 ) loop
21 dbms_lob.writeappend(l_clob, dbms_lob.getlength(text_cur
.text), text_cur.text);
22 end loop;
23
24 dbms_lob.close(l_clob);
25 end loop;
26 end;
27 /

PL/SQL procedure successfully completed.

SQL> commit;

Commit complete.
What I have basically done there is dumped the source text of some of my database objects into books table. Object owner goes as an author, object_name represents a title, object creation date goes as published and, of course, the source itself goes as book's text.

The search

Let's say I'm interested in all the books which write about procedures. To do that kind of search using Oracle Text is fairly straightforward.

First of all, we want to make sure that underscore symbol goes as part of a token. It is not really required for our example, however, it is generally a good idea to do since we want stuff like DBMS_OUTPUT or DBMS_SQL appear as a single tokens instead of being split in half due to underscore symbol treated as a special character:
SQL> begin
2 ctx_ddl.create_preference('books_lexer', 'BASIC_LEXER');
3 ctx_ddl.set_attribute('books_lexer', 'printjoins', '_');
4 end;
5 /

PL/SQL procedure successfully completed.
All I have to do now is to create my index:
SQL> CREATE INDEX ctx_books_text
2 ON books (text)
3 INDEXTYPE IS CTXSYS.CONTEXT
4 PARAMETERS('filter ctxsys.null_filter lexer books_lexer memory 64m');

Index created.
And do the search:
SSQL> select author, title, published
2 from books
3 where contains(text, 'procedure', 1) > 0
4 order by score(1) desc;

622 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 2979142934

-------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 785 | 1576K| | 352 (1)| 00:00:05 |
| 1 | SORT ORDER BY | | 785 | 1576K| 2104K| 352 (1)| 00:00:05 |
| 2 | TABLE ACCESS BY INDEX ROWID| BOOKS | 785 | 1576K| | 12 (0)| 00:00:01 |
|* 3 | DOMAIN INDEX | CTX_BOOKS_TEXT | | | | 12 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

3 - access("CTXSYS"."CONTAINS"("TEXT",'procedure',1)>0)

Note
-----
- dynamic sampling used for this statement


Statistics
----------------------------------------------------------
13 recursive calls
0 db block gets
486 consistent gets
0 physical reads
0 redo size
19748 bytes sent via SQL*Net to client
801 bytes received via SQL*Net from client
43 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
622 rows processed
We got 622 books which took us 486 LIOs. So far so good.

Let's add something else

Things will get a bit interesting (or complicated, depending on the perspective) once you'll try to mix both structured and unstructured data in one search. If I will take the above search, but this time I'm interested only in books written by author XDB, the search will look like this:
SQL> select author, title, published
2 from books
3 where contains(text, 'procedure', 1) > 0
4 and author='XDB'
5 order by score(1) desc;

37 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 2979142934

-----------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 18 | 37026 | 13 (8)| 00:00:01 |
| 1 | SORT ORDER BY | | 18 | 37026 | 13 (8)| 00:00:01 |
|* 2 | TABLE ACCESS BY INDEX ROWID| BOOKS | 18 | 37026 | 12 (0)| 00:00:01 |
|* 3 | DOMAIN INDEX | CTX_BOOKS_TEXT | | | 12 (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("AUTHOR"='XDB')
3 - access("CTXSYS"."CONTAINS"("TEXT",'procedure',1)>0)

Note
-----
- dynamic sampling used for this statement


Statistics
----------------------------------------------------------
13 recursive calls
0 db block gets
486 consistent gets
0 physical reads
0 redo size
1448 bytes sent via SQL*Net to client
372 bytes received via SQL*Net from client
4 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
37 rows processed
Note that I've got only 37 books this time, however, it took me the same amount of resources. It is not a surprise, after all, because condition on an author column went as a filter predicate. In case your table has some indexes on structured columns already:
create index i_books_author on books (author);

Index created.
You can try to search using B*Tree index:
SQL> select /*+ index(books i_books_author */ author, title, published
2 from books
3 where contains(text, 'procedure', 1) > 0
4 and author='XDB'
5 order by score(1) desc;

37 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 806035686

-----------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 47 | 96679 | 288 (1)| 00:00:04 |
| 1 | SORT ORDER BY | | 47 | 96679 | 288 (1)| 00:00:04 |
|* 2 | TABLE ACCESS BY INDEX ROWID| BOOKS | 47 | 96679 | 287 (0)| 00:00:04 |
|* 3 | INDEX RANGE SCAN | I_BOOKS_AUTHOR | 88 | | 1 (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("CTXSYS"."CONTAINS"("TEXT",'procedure',1)>0)
3 - access("AUTHOR"='XDB')

Note
-----
- dynamic sampling used for this statement


Statistics
----------------------------------------------------------
9 recursive calls
0 db block gets
223 consistent gets
0 physical reads
0 redo size
1454 bytes sent via SQL*Net to client
372 bytes received via SQL*Net from client
4 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
37 rows processed
Though a bit better, it just really went the other way around this time. We are now fetching all books written by XDB and then passing each book to Oracle Text for filtering.

The problem

As long as one of your structured/ustructured predicates have a good selectivity, the search using only either of indexes will produce good results, there is not much to post-filter, after all. However, if both predicates are selective only when applied together, you need something else to speed the things up. Indexes combined through BITMAP AND can produce some improvements:
SQL> select author, title, published
2 from books
3 where contains(text, 'procedure', 1) > 0
4 and author='XDB'
5 order by score(1) desc;

37 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 3589384316

----------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 47 | 96679 | 17 (6)| 00:00:01 |
| 1 | SORT ORDER BY | | 47 | 96679 | 17 (6)| 00:00:01 |
| 2 | TABLE ACCESS BY INDEX ROWID | BOOKS | 47 | 96679 | 16 (0)| 00:00:01 |
| 3 | BITMAP CONVERSION TO ROWIDS | | | | | |
| 4 | BITMAP AND | | | | | |
| 5 | BITMAP CONVERSION FROM ROWIDS| | | | | |
|* 6 | INDEX RANGE SCAN | I_BOOKS_AUTHOR | | | 1 (0)| 00:00:01 |
| 7 | BITMAP CONVERSION FROM ROWIDS| | | | | |
| 8 | SORT ORDER BY | | | | | |
|* 9 | DOMAIN INDEX | CTX_BOOKS_TEXT | | | 12 (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

6 - access("AUTHOR"='XDB')
9 - access("CTXSYS"."CONTAINS"("TEXT",'procedure',1)>0)

Note
-----
- dynamic sampling used for this statement


Statistics
----------------------------------------------------------
13 recursive calls
0 db block gets
57 consistent gets
0 physical reads
0 redo size
1454 bytes sent via SQL*Net to client
372 bytes received via SQL*Net from client
4 SQL*Net roundtrips to/from client
2 sorts (memory)
0 sorts (disk)
37 rows processed
We are down to only 57 LIOs. However, the entire thing still requires us to fetch all relevant rowids from both indexes followed by a merge. On huge document sets each step may output quite a bit of data and require substantial processing resources.

The new filter option in 11G

11G's Oracle Text allows you to specify structured data columns in a filter section while creating Oracle Text indexes:
SQL> CREATE INDEX ctx_books_text
2 ON books (text)
3 INDEXTYPE IS CTXSYS.CONTEXT
4 FILTER BY author
5 PARAMETERS('filter ctxsys.null_filter lexer books_lexer memory 64m');

Index created.
Let's see how the search will look like now:
SQL> select author, title, published
2 from books
3 where contains(text, 'procedure') > 0
4 and author='XDB';

37 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 2443340341

----------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 47 | 96679 | 11 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| BOOKS | 47 | 96679 | 11 (0)| 00:00:01 |
|* 2 | DOMAIN INDEX | CTX_BOOKS_TEXT | | | 8 (0)| 00:00:01 |
----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("CTXSYS"."CONTAINS"("TEXT",'procedure')>0)
filter("AUTHOR"='XDB')

Note
-----
- dynamic sampling used for this statement


Statistics
----------------------------------------------------------
13 recursive calls
0 db block gets
48 consistent gets
0 physical reads
0 redo size
1357 bytes sent via SQL*Net to client
372 bytes received via SQL*Net from client
4 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
37 rows processed
Note only 48 LIOs. While it may not be a solid improvement over merged indexes example, the important point here is that Oracle was able to get all required data from a context index alone.

What about 10G?

There is no filter option for context indexes in 10G, however, that doesn't mean that there is nothing you could do to help you facilitate your mixed structured and unstructured data searches. I'll blog about what could be done in 10G next time which, by the way, can be applied to 11G as well.

Tuesday, December 30, 2008

Alter database datafile offline drop

There was a topic on oracle-l recently and that remind me that offline drop provides one useful feature which people usually not leveraging.

This feature not only allows you to open your database without datafile foo...

SQL> startup
ORACLE instance started.

Total System Global Area 700448768 bytes
Fixed Size 1260844 bytes
Variable Size 310379220 bytes
Database Buffers 385875968 bytes
Redo Buffers 2932736 bytes
Database mounted.
ORA-01157: cannot identify/lock data file 4 - see DBWR trace file
ORA-01110: data file 4: '/u01/oradata/XE/datafile/foo.dbf'


SQL> alter database datafile 4 offline drop;

Database altered.

SQL> alter database open;

Database altered.
But what's really important about this is that offline drop doesn't really drops anything (it just updates the controlfile to say that file isn't there) and what you can do later is:
[oracle@srm oradata]$ rman

Recovery Manager: Release 10.2.0.1.0 - Production on Tue Dec 30 19:12:22 2008

Copyright (c) 1982, 2005, Oracle. All rights reserved.

RMAN> connect target;

connected to target database: XE (DBID=2555430687)

RMAN> restore tablespace foo;

Starting restore at 30-DEC-08
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=73 devtype=DISK

channel ORA_DISK_1: starting datafile backupset restore
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
restoring datafile 00004 to /u01/oradata/XE/datafile/foo.dbf
channel ORA_DISK_1: reading from backup piece /u01/oradata/fra/XE/backupset/2008_12_30/o1_mf_nnnd0_TAG20081230T190854_4oog0pt0_.bkp
channel ORA_DISK_1: restored backup piece 1
piece handle=/u01/oradata/fra/XE/backupset/2008_12_30/o1_mf_nnnd0_TAG20081230T190854_4oog0pt0_.bkp tag=TAG20081230T190854
channel ORA_DISK_1: restore complete, elapsed time: 00:00:02
Finished restore at 30-DEC-08

RMAN> recover tablespace foo;

Starting recover at 30-DEC-08
using channel ORA_DISK_1

starting media recovery
media recovery complete, elapsed time: 00:00:01

Finished recover at 30-DEC-08

RMAN> sql 'alter tablespace foo online';

sql statement: alter tablespace foo online
In other words, it can be used to facilitate your database restore in certain cases. Imagine that your database has transactional and reporting data and that that data is spread across different tablespaces. In case your transactional data volume is small compared to reporting stuff (which is usually the case), you can plan your restore like this:

  • Restore everything but your analytical tablespaces, then offline drop missing datafiles and open your database => you are back into transactional business and immediate world stops screaming around you.
  • Restore your analytical tablespaces => they don't have to steer the company blindly anymore.

    Depending on the data volumes, the first step might take only a fraction of time compared to restoring everything in one shot.
  •