7
ConText Linguistics

This chapter describes the approach used by ConText to provide thematic analysis of English-language text.

The following topics are covered in this chapter:

Overview of ConText Linguistics

Figure 7-1

ConText linguistics is a system that extracts the main ideas from English-language text and uses the main ideas to produce different forms of output. These main ideas are referred to as themes.

As shown in Figure 7-1, ConText's theme extraction system extracts themes from documents to produce CTX_LING output, theme highlighting, and theme indexes.

CTX_LING output is created on a per-document basis and gives you different views of documents for presentation. Theme highlighting is also available on a per-document basis. CTX_LING output and theme highlighting are known as ConText document services.

Theme indexes are created from a document set, against which you issue theme queries.

You can optionally use linguistic settings to control case conversion of text before it is processed as well as to control the size of Gists and theme summaries.

The theme extraction system illustrated in Figure 7-1 is comprised of a parsing engine and knowledge base which work to extract themes from text. You can obtain thematic output in different forms, depending on how you invoke the system. The following table describes how to obtain each type of output:

Output Text Input Invocation

Theme Summaries
List of Themes
Gists

Single Document

Use the CTX_LING package with a ConText 'L' server.

Theme Highlighting

Single Document

Use CTX_QUERY.HIGHLIGHT with a ConText 'Q' server. A theme index is required.

Theme Index

Document Set

Use theme lexer in policy with CTX_DDL.CREATE_INDEX to index documents.

Output	Text Input	Invocation
Theme Summaries List of Themes Gists	Single Document	Use the CTX_LING package with a ConText 'L' server.
Theme Highlighting	Single Document	Use CTX_QUERY.HIGHLIGHT with a ConText 'Q' server. A theme index is required.
Theme Index	Document Set	Use theme lexer in policy with CTX_DDL.CREATE_INDEX to index documents.

See Also:

For more information about how the theme extraction system works, refer to the "Theme Extraction System" section in this chapter.

For more information about theme summaries, list of themes, and Gists, see Chapter 8, "Using CTX_LING".

For more information about theme highlighting, see Chapter 6, "Document Presentation: Highlighting".

For information about creating theme indexes, see the Oracle8 ConText Cartridge Administrator's Guide..

For more information about issuing theme queries, see "Understanding Theme Queries" in Chapter 4.

What is a Theme?

Themes are the main ideas in a document. Themes can be concrete concepts such as Oracle Corporation, jazz music, football, England, or Nelson Mandela; themes can be abstract concepts such as success, happiness, motivation, or unification. Themes can also be groupings commonly defined in the world, such as chemistry, botany, or fruit.

When processing text to extract themes, Context extracts up to fifty themes per document.

To derive document themes, ConText uses the information stored in the knowledge catalog. Most themes are concepts in the knowledge catalog. However, ConText can still infer themes that are not known concepts in the knowledge catalog.

See Also:

For more information about the knowledge catalog and how ConText extracts themes, see "Theme Extraction System" in this chapter.

Theme Weight

ConText assigns a weight to every theme it extracts from a document. Theme weight is a measure of how well that idea is developed in the document with respect to other themes in the document.

ConText returns a theme weight with each theme returned in a list of themes. During theme indexing, Context also indexes document theme weights with themes and uses the weights to score theme queries issued against the index.

Text Input

Text input to the theme extraction system in Figure 7-1 can be one of the following:

single documents to create CTX_LING output
single documents to create theme highlighting
a set of documents stored in a text column to create theme indexes.
a theme query expression which ConText normalizes for index look-up

The best results are obtained when the text input to the theme extraction system is in mixed case. However, if your text is all-uppercase or all-lower text, you can convert it to mixed case by changing linguistic settings.

See Also:

For more information about linguistic settings, see"Linguistic Settings" in this chapter.

In addition, having good paragraph and sentence structure improves results for generating CTX_LING output, theme highlighting, and theme indexes.

Theme Extraction System

Figure 7-2

The theme extraction system extracts themes from English-language text. It is made up of the following components:

knowledge base
parsing engine

Knowledge Base

The knowledge base is a collective term referring to the lexicon and the knowledge catalog. The parsing engine uses the knowledge base to help extract themes from text.

Lexicon

The lexicon is a static information store that provides word and phrase information for the parsing engine. The lexicon recognizes over five hundred thousand English words and phrases and defines hundreds of lexical characteristics for each word.

Note:

The lexicon is specific to the English language, handling both American and British usage and spelling.

Linguistic information about words in the lexicon is divided into the following types:

Information Type Description

Syntax

Syntax flags indicate the part-of-speech of a word or phrase.

Theme

Theme flags identify the thematic qualities of a word (e.g. weak noun/needs support, strong verb). The parser uses these flags to determine how a word contributes to the thematic construction of the document as a whole.

Information Type	Description
Syntax	Syntax flags indicate the part-of-speech of a word or phrase.
Theme	Theme flags identify the thematic qualities of a word (e.g. weak noun/needs support, strong verb). The parser uses these flags to determine how a word contributes to the thematic construction of the document as a whole.

In the theme extraction process, ConText uses the information in the lexicon to identify potential themes, and to help rank themes in a document.

Knowledge Catalog

Figure 7-3

The knowledge catalog is a tree-like structure whose branches break down various realms of discourse. The knowledge catalog is divided into the following six main categories as shown in Figure 7-3:

Business and Economics
Government and Military
Science and Technology
Social Environment
Geography
Abstract Ideas and Concepts

See Also:

For a complete breakdown of the categories in the knowledge catalog, see Appendix E, "Knowledge Catalog - Category Hierarchy".

Concepts

Concepts are leaf nodes in the knowledge catalog and can be associated with any level in the category tree. Concepts are related to parent categories by an "is-associated-with" relationship that covers specific relationships such as "is-a".

The category of trigonometry, whose branch appears in Figure 7-3, contains over 30 associated concepts including sines, cosines, radians and polar axes.

The category of success, located in the abstract ideas and concepts branch, contains over 30 associated concepts including award winners, conquerors, prosperity, and winning streaks.

Concepts can be associated with any level in the category tree. Using the example in Figure 7-3, the category of mathematics, which is in the middle of the branch, has over 130 associated concepts. Some of these concepts include Isaac Newton, Fibonacci sequences, arithmetic progressions, and complex integers.

Other categories such as flowering plants contain over 1000 associated concepts.

The average number of concepts associated with a category in the knowledge catalog is approximately 94.

In the theme extraction process, all concepts in the knowledge catalog are potential document themes.

Note:

All categories are also concepts. This means that categories can also be potential document themes in the theme extraction process. For example, the categories of trigonometry and success can appear as document themes.

Unknown and Ambiguous Concepts

ConText's knowledge catalog is not an exhaustive repository of all possible themes (concepts) that can be extracted from a document. Some concepts that ConText might extract from a document are not known to the knowledge catalog.

In addition, concepts such as bank, cricket, or tangent can have more than one meaning in English and hence are ambiguous. Because they are ambiguous, these concepts cannot be placed in the knowledge catalog and are treated as if they are unknown.

See Also:

For more information about how ConText handles unknown and ambiguous themes in the theme extraction process, see the following sections:

"Parsing Engine" in this chapter

"Theme Indexing Concepts" in Chapter 4

Normal Forms

In the theme extraction process, ConText must convert words and phrases in text to their normal forms so they can attach into the knowledge hierarchy. To make this conversion, the knowledge catalog keeps the following lists:

Type of List Description

Standard Noun Forms

A list of mappings from inflected variations of words to their standard noun forms as stored in the knowledge catalog's hierarchy of concepts. For example, the words notify and notifies are mapped to the normal form notification; likewise, the words summarize and summarizes are mapped to the normal form summaries.

Alternate Forms

A list of mappings from acronyms, abbreviations, and alternate spellings to their standard forms. For example, IBM is an acronym for the standard form IBM - International Business Machines Corporation

Type of List	Description
Standard Noun Forms	A list of mappings from inflected variations of words to their standard noun forms as stored in the knowledge catalog's hierarchy of concepts. For example, the words notify and notifies are mapped to the normal form notification; likewise, the words summarize and summarizes are mapped to the normal form summaries.
Alternate Forms	A list of mappings from acronyms, abbreviations, and alternate spellings to their standard forms. For example, IBM is an acronym for the standard form IBM - International Business Machines Corporation

Parsing Engine

ConText uses the parsing engine to produce all types of thematic output, including CTX_LING output and theme indexes.

The parsing engine syntactically analyzes text, identifying phrase, sentence and paragraph boundaries. It then interprets meaning, selecting the high-information content to produce themes. The lexicon and knowledge catalog provide the reference information necessary to do this processing.

If case-conversion is enabled, the parsing engine converts all the text to lowercase and processes the text through the case-sensitivity routines to determine capitalization.

Note:

Case conversion does not affect the original text of the documents being processed; only the output of the parsing engine is stored in mixed-case.

The following sections describe how the parsing engine analyzes text to extract themes.

Token Recognition

ConText breaks up text into paragraphs and then breaks paragraphs into tokens. Tokens can consist of either single words or phrases. Words are groups of characters separated by blank space or punctuation marks; phrases are sequences of two or more words.

Information about English words and phrases is derived from ConText's knowledge base. Sequences of words that match known phrases are collapsed and treated as single tokens for further processing. For example, the phrases stock market and relational database are treated as tokens.

Token Normalization

ConText converts each token to a normal form using information stored in the knowledge base. Normal forms are the preferred forms of all alternative forms of the token. When ConText is able to find the token in the knowledge base it is a known token.

Specifically, token normalization includes the following transformations of alternative forms to preferred forms: Verbs are converted to their noun forms; most nouns are converted to their plural forms; and acronyms and abbreviations are converted to their full forms. For example, the acronyms IBM and I.B.M are converted to IBM - International Business Machines.

Words that mean the same thing for the purposes of text indexing and retrieval are also converted to normal forms. For example, the words loving and amorousness are normalized to love.

When a token cannot be found in the knowledge base, ConText guesses its part-of-speech and then normalizes it according to one of the standard transformations. However, since the token cannot be placed in the knowledge base, it is unknown, and is treated as its own normal form isolated from the knowledge base.

Theme Ranking

In this step, ConText scores the normalized tokens, known and unknown, then sorts the tokens, which are potential document themes, into a ranked list. The scoring and ranking of tokens is based on the information associated with each token in the knowledge base, such as what words and parts-of-speech are good candidates for themes. The highest ranking tokens are called themes.

Theme Accumulation

ConText combines duplicated and closely related themes into single themes. This is done by generalizing related themes to common parents using the hierarchical structure of the knowledge catalog. The goal of this process is to find the top-ranking themes, up to fifty, for a document.

Theme Proving

In the final step, ConText looks back at the known themes it generated and evaluates the evidence for each theme in the surrounding text.

Because words can be ambiguous or can be used with new meaning, ConText attempts to find support for the parent concept of each theme. Parent concepts are derived from the knowledge catalog.

If no support exists for the parent concept, ConText indexes the theme as a single row without the parent concept (theme).

Themes that are indexed as single rows have no parents in the hierarchical list-of-themes you obtain with CTX_LING.REQUEST_THEMES.

See Also:

For more information about how ConText indexes themes, see "Theme Indexing Concepts" in Chapter 4.

Linguistic Settings

Linguistic setting are settings you can enable to control how ConText processes text to extract themes.

There are two types of linguistic settings that affect output to the theme extraction system:

case-conversion settings
Gist and theme summary settings

Case-Conversion Settings

ConText provides two pre-defined linguistic setting labels for case-conversion. These settings affect the processing of all text input to the theme extraction system:

Setting Description

GENERIC

Default configuration. Parses mixed-case English text. Produces theme output.

SA (Case Sensitive)

Same as GENERIC except that ConText converts text that is all-uppercase or all lower-case to mixed-case text before performing theme analysis.

Setting	Description
GENERIC	Default configuration. Parses mixed-case English text. Produces theme output.
SA (Case Sensitive)	Same as GENERIC except that ConText converts text that is all-uppercase or all lower-case to mixed-case text before performing theme analysis.

You can set linguistic settings labels with the CTX_LING.SET_SETTINGS_LABEL procedure.

Gist and Theme Summary Settings

You can use the administration tool to create settings labels to control the following options:

size of Gist
size of theme summary
Gist generation method

When you use the administration tool to create your own settings, you must use one of the ConText predefined settings as a starting point, depending on whether your text is mixed-case, or all upper-case, or all lower-case.

See Also:

For more information about using the administration tool to create your own labels, see the help file for the administration tool.

For more information about Gists and theme summaries, see Chapter 7, "ConText Linguistics".

Enabling Linguistic Settings

To switch to a case-sensitive setting (SA) or to enable settings labels you create with the administration tool, you must use the CTX_LING.SET_SETTINGS_LABEL procedure.

Note:

When you enable a setting other than the default, it affects the way ConText processes text for only that session. To obtain the same type of processing in a new session, you must re-enable the settings with CTX_LING.SET_SETTINGS_LABEL.

See Also:

For more information on how to specify linguistic settings, see "Enabling Linguistic Settings" in Chapter 8, "Using CTX_LING".

7 ConText Linguistics

Overview of ConText Linguistics

Figure 7-1

What is a Theme?

Theme Weight

Text Input

Theme Extraction System

Figure 7-2

Knowledge Base

Lexicon

Knowledge Catalog

Figure 7-3

Categories

Concepts

Unknown and Ambiguous Concepts

Normal Forms

Parsing Engine

Token Recognition

Token Normalization

Theme Ranking

Theme Accumulation

Theme Proving

Linguistic Settings

Case-Conversion Settings

Gist and Theme Summary Settings

Enabling Linguistic Settings

7
ConText Linguistics