THE
LEXIKON
An extensive lexicon
is essential to advanced text-analysis software. The lexicon acts like a data base on the
linguistic features and meanings of input words. It must describe them accurately, because in
most analysis systems, almost all processing decisions will be based
directly on the lexicon's data.
Until now,
developers needing this kind of software "expertise" had to get it
from human linguists, who would assemble its data painstakingly at great effort
and cost. Now, Lexikos has pre-packaged
this unique type of knowledge into a portable, modular software tool that is
easily incorporated into new applications.
Our
"Lexikon" is a unique commercial software product: a standalone
virtual data base package that models tens of thousands of common English
words. When it is given a line or
paragraph of text, the Lexikon looks up each word, then computes and returns a
"map" of all the roots, syntactic features and semantic markers which
could apply to that word in that specific context. Applications code uses the combined data
returned for all the words to do its intended task.
THE
BENEFITS TO ITS USERS
The Lexikon lets
advanced software development proceed at maximum efficiency. By providing
its users with accurate lexical data in adequate bulk, it can quickly advance
any software project involving the content, structure, or meaning of English
text. Developers focus on their own
goals, not on assembling a lexicon.
The Lexikon lets
developers confront the true issues of real-world text analysis. The availability
of good lexical data early on can prevent the misperceptions and wasted
man-years that may result when work is based on a toy system lexicon.
The Lexikon's data
is of very high quality. The system of features and markers which the Lexikon
uses to describe words and meanings was designed by Lexikos linguists
specifically for use by other software.
This avoids many problems that come from reusing data from dictionaries
meant for people, or (even worse) from using lexical data which was input by
programmers without linguistic training.
The Lexikon costs
far less than building a lexicon internally.
Lexikos is an industrial supplier of
tools for natural language software development, so we can afford the
specialized people, tools, and testing demanded in lexicon creation. Lexikos clients take advantage of this,
reduce their overhead in these areas, and save considerable time and money by
exploiting the natural division of labor.
The Lexikon can be
put to work quickly. We can offer it ready to run on a large-RAM
80x86 PC, a PC-resident 80386 co-processor board, a Mac II, an engineering
workstation, or any other host computer supporting Common Lisp.
Overall, if our
Lexikon is used as the "front end" to your parser, expert system,
text indexer, or other English-analysis application, the total benefits will be
considerable: your application system
will be deployed sooner, with better results, at much lower net R&D costs.
TECHNICAL
SUMMARY OF THE LEXIKON
In operation, the
Lexikon automatically turns ASCII characters (keyboard inputs or the lines of a
text file) into a detailed lexical model of the English phrase or paragraph
they represent. The output separately
depicts each word, modeled in context.
These word models are easy to use, linguistically correct, and
surprisingly unambiguous. Each
representation of a word includes:
*The syntactic features for the word, including
all its expected complement patterns, at a level of detail enabling a
sophisticated syntax analysis.
*The semantic class markers for the word, which
describe its possible denotations and provide for application-specific data
inheritance.
*The morphological changes for inflected, derived
and irregular forms. The Lexikon
automatically finds each root and adjusts its data as required.
*Links to multi-token words, idioms, and names of
which the input word may be a part, under methods that aid the on-line parsing
of these constructs.
*Reduced lexical ambiguity, due to partial
parsing rules that work invisibly to exploit constraints on the capitalization
and context of each word.
The Lexikon provides
excellent coverage of common English vocabulary. Even so, no vocabulary covers all possible
inputs, so the Lexikon also includes an interactive module called the
"Scanner" which extends its fully automatic logic.
The Scanner is a
control and debugging shell which lets users flexibly query the Lexikon's word
models, either in isolation or constrained in a paragraph. It also provides a set of controls on the
processing strategies and output formats of the combined system, some of which
may cause prompts for on-line operator inputs:
*Logic for on-line word-learning can be enabled
to help a clerical operator expand the Lexikon's vocabulary as needed during
each job, so every word and name is properly modeled in the output. As a bonus, it also helps the operator find
and fix misspellings or other flaws in the input text.
*Other controls cause the Scanner to consult with the
operator via menus to interactively remove lexical ambiguity that may
remain in the output of the Lexikon.
These extra human inputs can let the logic of a follow-on parser be
simplified, or in some applications even substitute for parsing code.
*The Scanner's interactive logic may itself be augmented
by a full on-line copy of Roget's Thesaurus. This option can make production-scale use of
the complete Lexikon system still easier, by expanding its vocabulary and making operator inputs to the Scanner even
simpler and less frequent.
When richly detailed
data from the Lexikon is combined in the Scanner with the on-line guidance of
an operator, the net output stream can become very complete and precise,
exhibiting no gaps, no lexical ambiguity, and a human-like sophistication in
the contextual interpretation of words.
This kind of
accuracy from today's text-analysis software is unique. We think it will greatly aid current
development work in natural language processing and help produce a new spurt of
growth in practical text-processing applications.