Oracle8 ConText Cartridge Application Developer's Guide
Release 2.4
A63821-01

Library

Product

Contents

Index
 

Prev Next

4
Theme Queries

This chapter describes how to perform theme queries. The following topics are covered:

Understanding Theme Queries

Theme queries enable you to search for documents by their major concepts. The following sections describe the theme indexing and querying processes and how they use the knowledge base:

Theme Indexing Concepts

Figure 4-1

 

Before you can issue a theme query, your set of documents must be indexed by theme. During theme indexing, ConText extracts up to fifty main concepts or themes of a document and stores these themes in the theme index. A weight is also associated with every theme that is indexed. A theme can be a concrete concept, such as insects, or an abstract concept, such as success, sufficiently developed in the document.

Figure 4-1 illustrates how ConText uses the knowledge base to extract document themes from an example document "The Reproductive Cycle of Insects" that contains information about insects. This example shows that ConText recognizes the following types of themes:

Known Themes

Known themes are document themes that can attach to a branch of the knowledge base.

In the example in Figure 4-1, the document A entitled "The Reproductive Cycle of Insects" contains information about insects. The known document theme insects has four parent themes corresponding to the branch of the knowledge base: science and technology, hard sciences, biology, zoology, and insects. Each theme in the branch is entered as a searchable row in the theme index along with a weight.

When themes are indexed as such, a theme query on insects or any of its parents returns the document A.

Unknown Themes

Unknown themes are document themes that cannot be found in the knowledge base, because they are either unknown to the knowledge base or inherently ambiguous.

Figure 4-1 shows how an unknown theme of Dr. Mack is extracted without having a representation in the knowledge base. Unknown themes such as this are indexed as a single row.

Ambiguous document themes such as the term cricket or the term table also have no attachments to the knowledge base and hence are indexed as a single row. To query on ambiguous document themes, you would rely on other supporting themes such as sports or insects being indexed with an ambiguous theme like cricket.
 

See Also: 

For more information about querying ambiguous themes, see "Refining Theme Queries" in this chapter. 

 
 

Theme Weight

The theme weight is a measure of the strength of a theme relative to the other themes in a document. Weights are indexed with every theme and the related parent themes extracted from a document. ConText uses theme weights to help score theme queries.

Theme Querying

Figure 4-2

 

To execute a theme query, you specify a query string, which can be a sentence or a phrase with or without operators. ConText uses the knowledge base to normalize the word or phrase you enter into a standard form. It then looks up the normalized theme in the index and returns the documents that were indexed with the given theme. See Figure 4-2. Scores for theme queries are calculated based on the weights associated with each theme in the index.

For example, a theme query on insect retrieves the document indexed in Figure 4-1 entitled, "The Reproductive Cycle of Insects". Likewise, a theme query on any of the indexed parents, such as science and technology, hard sciences, biology, or zoology also retrieves the same document.
 


Note: 

When you issue a theme query, you are asking ConText to return to you all the documents that ConText indexed with that theme. For ConText to attach a theme to a document, the idea or concept must be developed sufficiently in the document. If a concept is not developed sufficiently in a document, ConText does not index it as a document theme, and consequently the document is not returned in a query for that theme. 


 
 

Scoring

ConText returns a relevance score for each document it returns in a theme query; the higher the score, the more relevant the returned document. This relevance score is out of 100 and is based on the weight of the indexed theme.

Generally, specifying broader themes or concepts in a theme query will return higher scoring documents.

When using operators in theme queries, the scoring behavior is the same as for regular text queries. For example, the OR operator returns the higher score of its operand, and the AND operator returns the lower score of its operands.

Case-Sensitivity

Theme queries are case-sensitive. For example, doing a query on the common noun turkey produces a hit on turkey the bird. Such a query does not produce a hit on the proper noun Turkey, which describes a country. To query on the proper noun, you must enter the query as Turkey.

Recognition of Known Themes

Even though ConText theme queries are case-sensitive, ConText tolerates poorly formatted input for known themes.

For example, entering microsoft or microSoft returns documents that include the theme of Microsoft, a known company. Likewise, entering Currency Rates returns documents that include a theme of currency rates, a standard classification in business and economics.
 


Note: 

ConText always attempts to match the entered theme with themes in the index. For example if you enter microsoft, ConText looks up microsoft and Microsoft in the index. Likewise, if you enter Currency Rates as your theme, ConText looks up Currency Rates and currency rates in the index. 


 
 

Constructing Theme Queries

The following section describes how to construct theme queries:

Using Operators

With theme queries, the following operators have the same semantics as with regular text queries:

Operator  Symbol 

Accumulate 

Or 

And 

Minus 

Not 

Weight 

Threshold 

Max 

 

Examples

Some valid theme query strings using operators are as follows:

contains(text, 'cricket ~ insects') > 0;
contains(text, 'cricket & sports') > 0;
contains(text, 'music, reggae*5') > 0;
contains(text, 'chemistry > 30') > 0;
contains(text, 'soccer | basketball') > 0;
contains(text, 'computer software - Microsoft') > 0;
contains(text, 'music:20') > 0;
 
See Also: 

For more information about how to use operators in theme queries, see "Refining Theme Queries" in this chapter. 

For more information about the semantics of query operators, see Chapter 3, "Understanding Query Expressions"

 
 

Thesaurus Operators

In a theme query, the thesaurus operators (synonym, broader term, narrower term etc.) work the same way as in a regular text query, provided a thesaurus has been created/loaded.
 

See Also: 

For more information about thesaurus operators, see "Thesaurus Operators" in Chapter 3

 
 

Grouping Characters

In theme query expressions, the grouping characters ( ) [ ] have the same semantics as with a regular text query.
 

See Also: 

For more information about grouping characters, see "Grouping Characters" in Chapter 3

 
 

Wildcard Characters

In theme query expressions, the wildcard characters% _ work the same way as in regular text queries.
 


Note: 

There is a risk of ambiguity when using the wildcard character. For example, doing a theme query on %court% might return documents that have a theme of court of law or tennis court


 
  
See Also: 

For more information about grouping characters, see "Wildcard Characters" in Chapter 3

 
 

Unsupported Operators

ConText does not support the following query expression operators with theme queries:

Operator  Symbol 

Near 

Fuzzy 

Soundex 

Stem 

 

Phrasing Theme Queries

The following issues affect the phrasing of theme queries.

Use Noun Forms

When you enter your theme query, ConText normalizes the word or phrase representing your theme into a form that it can use to compare with document themes in the index. This normal form is nouns and noun phrases, such as chemistry or personal computer. It is therefore better to use nouns and noun phrases when constructing theme queries. Avoid using sentences or long phrases.

For example, to search for documents about computer programming, use the noun form computer programming not programming my computer.

Avoid Splitting Phrases

Avoid splitting phrases that describe your idea as a whole. For example, use the phrase physical chemistry, not physical and chemistry.

Understand Case-Sensitivity

Theme queries are case-sensitive. For example, doing a query on the common noun turkey, which describes a type of bird, will not produce a hit on the proper noun Turkey, which describes a country.
 


See Also: 

For more information about case-sensitivity and theme queries, see the "Theme Querying" section in this chapter. 


 
 

Refining Theme Queries

Depending on how you write your theme query, ConText usually returns documents that are relevant to your query as well as documents that might be irrelevant to your query. Before you issue the query, you do not know what combination of document themes your query will return.

For example, a query on cricket might return documents on sports and insects depending on your document set. The best way to know the possible outcome is to run the query and examine the set of returned documents. Then you run the query again, using logical operators to eliminate unwanted documents.

You can approach the trial and error method in one of two ways:

Restricting a Query

Starting with broad theme queries might generate noise or unwanted documents. This is because of the following:

You can use the AND or NOT operator to eliminate unwanted documents. However, use these operators with caution, because in both cases you run the risk of eliminating documents that you might be interested in. For this reason, it is always better to have some noise than none at all.

Using AND

You can use the AND operator with a qualifying theme to restrict your theme query and hence eliminate noise.

For example, if a theme query on cricket always returned documents about the sport cricket and the insect cricket, and you were interested only in those documents about cricket the sport, you can restrict your query by qualifying cricket with the more general category sports as follows:

'cricket and sports'

The disadvantage of using AND with a restricting theme is that a successful query depends on both themes being developed sufficiently in the document for ConText to index them as such. For example, a hypothetical news article about the personal affairs of cricket player might not have the theme of sports developed substantially for ConText to index sports as a theme, and therefore such a document would not be returned in the above query.
 


Suggestion: 

When choosing the restricting condition to use with the AND operator, we recommend choosing a broad category; choosing a very specific category as the restricting condition might inadvertently eliminate relevant documents. 


 
 

Using NOT

You can use the NOT operator to exclude unwanted themes. For example, suppose you have a collection of news articles. You find that a theme query on cricket returns documents about cricket the sport as well as cricket the insect.

In such a scenario, you can use the NOT operator to exclude the unwanted theme. Thus if you are interested in those documents only about the sport cricket, you exclude documents about insects as follows:

'cricket not insects'

One disadvantage of using the NOT operator is that you run the risk of excluding documents that are coincidentally about the desired theme and the unwanted theme. For example, the above query does not return a hypothetical document about a cricket game that was swarmed by locusts, assuming that the theme of insects is developed sufficiently for ConText to index insects as a document theme.

Another disadvantage of using NOT is that you usually have a better idea of the themes you want, not of the themes you don't want. Predicting unwanted themes depends on knowing your document corpus. For this reason, using NOT is best suited for eliminating irrelevant high-ranking documents you specifically know about.

Expanding a Query

Sometimes it is better to start with specific categories and then expand these queries into more general ones, especially when your query covers a topic that is categorized specifically in the world. For example, if you are searching for documents that are about bees, you issue a query on bees, which is a specific category of insects. If you find that the result set is not returning the documents you need, you can expand the query by issuing a theme of insects, which is slightly broader.

After expanding a query, you can use the NOT or AND operators to scale back the query.

Theme Query Examples

To execute a theme query, you specify a query string, which can be a sentence or a phrase with or without operators. ConText interprets your query, creating a normalized form of your query that it can use to match against document themes in the index. Context returns a list of documents that satisfy the query, based on certain rules, along with a score of how relevant each document is to the query.

You can issue themes queries using either the two-step or one-step method. The way in which ConText matches themes and scores hits is the same for both methods.
 


Note: 

To issue theme queries, you must have a theme index. 

For more information about how to create a theme index on a text column, see Oracle8 Context Cartridge Administrator's Guide.  


 
 

Two-Step Query

To execute a theme query with the CTX_QUERY.CONTAINS procedure against a theme index, you must specify a policy that has a theme lexer associated with it.

For example, you specify a theme query on computer software as follows:

execute ctx_query.contains('THEME_POL', 'computer software', 'CTX_TEMP');

In the above example, ConText normalizes computer software, and then attempts to match the normal form with document themes in the index.

When a match is found, ConText uses the weight of the matched theme to compute a score that reflects how relevant the match is to the query; the higher the score, the more relevant the hit. ConText returns the matched document as part of the hitlist.

One-Step Query

You can execute theme queries in SQL*Plus using the one-step method. To do so, the text column must be indexed by theme. The way in which ConText matches themes and scores hits is the same as in a two-step query.

For example, to execute a theme query on computer software:

SELECT * FROM TEXTAB
WHERE CONTAINS (text, 'computer software') > 0;

Multiple Policies

For a text column that has more than one policy associated with it, you must specify which policy to use in the CONTAINS clause using the pol_hint parameter. You might create two policies for a column when you want to perform both theme and text queries on the column.

For example, if the column text had a regular text policy and a theme policy THEME_POL associated with it, you issue a theme query as follows:

SELECT ID, SCORE(0) FROM TEXTAB
WHERE CONTAINS (text, 'computer software', 0, 'THEME_POL') > 0;

When you specify pol_hint, you must also specify a placeholder (in this example 0) for the LABEL parameter.
 

See Also: 

For more information about using the pol_hint parameter in the CONTAINS function, see the specification for CONTAINS in Chapter 9

 
 



Prev

Next
 
Oracle
Copyright © 1998 Oracle Corporation. 
All Rights Reserved. 

Library

Product

Contents

Index