Parallel Translations as Sense Discriminators
means to distinguish word senses, at least to the
Abstract
degree that they are useful for natural language
document retrieval, and machine translation.
equivalents in four languages from different
exploited to automatically determine the sense
language families, extracted from an on-line
of a word in context (see Ide and Véronis, 1998),
parallel corpus of George Orwell’s Nineteen
including syntactic behavior, semantic and
Eighty-Four. Thegoal of the study is to
pragmatic knowledge, and especially in more
determine the degree to which translation
recent empirical studies, word co-occurrence
within syntactic relations (e.g., Hearst, 1991;
polysemous word in English are lexicalized
Yarowsky, 1993), words co-occurring in global
differently across a variety of languages, and
context (e.g., Gale et al., 1993; Yarowsky, 1992;
to determine whether this information can be
Schütze, 1992, 1993), etc. No clear criteria have
used to structure or create a set of sense
emerged, however, and the problem continues to
processing applications. A coherence index
The notion that cross-lingual comparison can be
is computed that measures the tendency for
useful for sense disambiguation has served as a
different senses of the same English word to
be lexicalized differently, and from this data
example, Brown et al. (1991) and Gale et al.
a clustering algorithm is used to create sense
(1992a, 1993) used the parallel, aligned HansardCorpus of Canadian Parliamentary debates for
Introduction
WSD, and Dagan et al. (1991) and Dagan and
Itai (1994) used monolingual corpora of Hebrew
It is well known that the most nagging issue for
and German and a bilingual dictionary. These
studies rely on the assumption that the mapping
definition of just what a word sense is. At its
significantly among languages. For example, the
linguistic one that is far from being resolved.
word duty in English translates into French as
devoir in its obligation sense, and impôt in its
processing has led to efforts to find practical
tax sense. By determining the translation
equivalent of duty in a parallel French text, the
correct sense of the English word is identified.
generally, which languages should be considered
These studies exploit this information in order to
for this exercise? All languages? Closely related
gather co-occurrence data for the different
languages? Languages from different language
senses, which is then used to disambiguate new
texts. In related work, Dyvik (1998) used
patterns of translational relations in an English-
"enough" to provide adequate information for
University) to define semantic properties such as
There is also the question of the criteria that
synonymy, ambiguity, vagueness, and semantic
fields and suggested a derivation of semantic
distinction is "lexicalized cross-linguistically".
representations for signs (e.g., lexemes),
How consistent must the distinction be? Does it
hyponymy etc., from such translational relations. mutually non-interchangeable lexical items in
some significant number of other languages, or
suggested that for the purposes of WSD, the
need it only be the case that the option of a
different senses of a word could be determined
different lexicalization exists in a certain
by considering only sense distinctions that are
lexicalized cross-linguistically. In particular,
Another consideration is where the cross-lingual
they propose that some set of target languages
information to answer these questions would
be identified, and that the sense distinctions to
come from. Using bilingual dictionaries would
be extremely tedious and error-prone, given the
applications and evaluation be restricted to those
substantial divergence among dictionaries in
that are realized lexically in some minimum
subset of those languages. This idea would seem
distinctions they make. Resnik and Yarowsky
to provide an answer, at least in part, to the
(1997) suggest EuroWordNet (Vossen, 1998) as
problem of determining different senses of a
a possible source of information, but, given that
word: intuitively, one assumes that if another
EuroWordNet is primarily a lexicon and not a
language lexicalizes a word in two or more
corpus, it is subject to many of the same
ways, there must be a conceptual motivation. If
objections as for bi-lingual dictionaries.
likely to find the significant lexical differences
information from parallel, aligned corpora.
that delimit different senses of a word.
Unlike bilingual and multi-lingual dictionaries,
translation equivalents in parallel texts are
questions. For instance, it is well known that
determined by experienced translators, who
evaluate each instance of a word’s use in context
languages (for example, the French intérêt and
rather than as a part of the meta-linguistic
the English interest), especially languages that
activity of classifying senses for inclusion in a
are relatively closely related. Assuming this
dictionary. However, at present very few parallel
problem can be overcome, should differences
aligned corpora exist. The vast majority of these
found in closely related languages be given
are bi-texts, involving only two languages, one
lesser (or greater) weight than those found in
of which is very often English. Ideally, a serious
evaluation of Resnik and Yarowsky’s proposal
(Germanic, Slavic, Finno-Ugrec, and Romance),
would include parallel texts in languages from
two languages from the same family (Czech and
several different language families, and, to
Slovene), as well as one non-Indo-European
maximally ensure that the word in question is
used in the exact same sense across languages, it
Nineteen Eighty-Four is a text of about 100,000
would be preferable that the same text were used
words, translated directly from the original
over all languages in the study. The only
English in each of the other languages. The
currently available parallel corpora for more
parallel versions of the textare sentence-aligned
than two languages are Orwell's Nineteen
to the English and tagged for part of speech. Eighty-Four (Erjavec and Ide, 1998), Plato's
Although Nineteen Eighty-Four is a work of
Republic (Erjavec, et al., 1998), the MULTEXT
fiction, Orwell's prose is not highly stylized and,
Journal of the Commission corpus (Ide and
as such, it provides a reasonable sample of
Véronis, 1994), and the Bible (Resnik, et al., in
modern, ordinary language that is not tied to a
press). It is likely that these corpora do not
given topic or sub-domain (such as newspapers,
provide enough appropriate data to reliably
technical reports, etc.). Furthermore, the
determine sense distinctions. Also, it is not clear
translations of the text seem to be relatively
how the lexicalization of sense distinctions
faithful to the original: for instance, over 95% of
across languages is affected by genre, domain,
the sentence alignments in the full parallel
This paper attempts to provide some preliminary
(Priest-Dorman, et al., 1997).
answers to the questions outlined above, in order
Nine ambiguous English words were considered:
to eventually determine the degree to which the
hard, head, country, line, promise, slight, seize,
use of parallel data is viable to determine sense
scrap, float. The first four were chosen because
distinctions, and, if so, the ways in which this
they have been used in other disambiguation
information might be used. Given the lack of
studies; the latter five were chosen from among
large parallel texts across multiple languages,
the words used in the Senseval disambiguation
the study is necessarily limited; however, close
exercise (Kilgarriff and Palmer, forthcoming). In
examination of a small sample of parallel data
all cases, the study was necessarily limited to
can, as a first step, provide the basis and
words that occurred frequently enough in the
direction for more extensive studies.
Orwell text to warrant consideration.
Five hundred forty-two sentences containing an
Methodology
I have conducted a small study using parallel,
morphological variants) of each of the nine
aligned versions of George Orwell's Nineteen
words were extracted from the English text,
Eighty-Four (Erjavec and Ide, 1998)in five
together with the parallel sentences in which
they occur in the texts of the four comparison
Romanian, and Czech.1 The study therefore
involves languages from four language families
Slovene). As Wilks and Stevenson (1998) have
p o i n t e d o u t , p a r t - o f - s p e e c h t a g g i n g
accomplishes a good portion of the work of
1 The Orwell parallel corpus also includes versions of
semantic disambiguation; therefore occurrences
Nineteen-Eighty Four in Hungarian, Bulgarian,
of words that appeared in the data in more than
Latvian, Lithuanian, Serbian, and Russian.
one part of speech were grouped separately.2
CIs do not determine whether or not a sense
The English occurrences were then grouped
distinction can be lexicalized in the target
language, but only the degree to which they are
(version 1.6) [Miller et al., 1990; Fellbaum,
lexicalized differently in the translated text.
1998]). The sense categorization was performed
However, it can be assumed that the CIs provide
by the author and two student assistants; results
a measure of the tendency to lexicalize different
from the three were compared and a final,
WordNet senses differently, which can in turn
mutually agreeable set of sense assignments
be seen as an indication of the degree to which
For each of the four comparison languages, the
For each ambiguous word, the CI is computed
corpus of sense-grouped parallel sentences were
sent to a linguist and native speaker of the
comparison language. The linguists were asked
to provide the lexical item in each parallel
sentence that corresponds to the ambiguous
English word. If inflected, they were asked to
provide both the inflected form and the root
• n is the number of comparison languages
form. In addition, the linguists were asked to
indicate the type of translation, according to the
• m and m are the number of occurrences of
sense s and sense s in the English corpus,
For over 85% of the English word occurrences
(corresponding to types 1 and 2 in Table 1), a
specific lexical item or items could be identified
is the number of timesthat senses q
and r are translated by the same lexical item
corresponding English word. For comparison
purposes, each translation equivalent was
represented by its lemma (or the lemma of the
root form in the case of derivatives) and
The CI is a value between 0 and 1, computed by
associated with the WordNet sense to which it
examining clusters of occurrences translated by
the same word in the other languages. If sense i
In order to determine the degree to which the
and sense j are consistently translated with the
assigned sense distinctions correspond to
same word in each comparison language, then
translation equivalents, a coherence index(CI)CI(s s ) = 1; if they are translated with a
was computed that measures how often each pair
different word in every occurrence, CI(s s ) = 0.
of senses is translated using the same word as
In general, the CI for pairs of different senses
well as the consistency with which a given sense
provides an index of their relatedness; i.e., the
is translated with the same word.3 Note that the
greater the value of CI(s s ), the more frequently
occurrences of sense i and sense j are translated
2 The adjective and adverb senses of hard are
with the same lexical item. When i = j, we
considered together because the distinction is not
consistent across the translations used in the study.
3 Note that the CI is similar to semantic entropy
(Melamed, 1997). However, Melamed computes
entropy for word types, rather than word senses.
obtain a measure of the coherence of a given
A single lexical item is used to translate the English equivalent (possibly a different part of speech)
The English word is translated by a phrase of two or more words or a compound, which has the same
The English word is not lexicalized in the translation
A pronoun is substituted for the English word in the translation
An English phrase containing the ambiguous word is translated by a single word in the comparison
language which has a broader or more specific meaning, or by a phrase in which the specific conceptcorresponding to the English word is not explicitly lexicalized
Table 1 : Translation types and their frequencies
For example, Table 2 gives the senses of hard
and head that occurred in the data.5 The CI data
not yielding to pressure ; vs. “soft”
for hard and head are given in Tables 3 and 4.
CIs measuring the affinity of a sense with
itself—that is, the tendency for all occurrences
of that sense to be translated with the same
word--show that all of the six senses of hard
have greater internal consistency than affinity
Table 2 : WordNet senses of hard and head
with other senses, with senses 1.1 ("difficult" -
CI = .56) and 1.3 ("not soft" – CI = .63)
registering the highest internal consistency.6 The
individually as well as for different language
same holds true for three of the four senses of
head, while the CI for senses 1.3 (“intellect”)
(three different language families): Czech and
and 1.1 (“part of the body”) is higher than the CI
Slovene (Indo-European; and Estonian (non-
To better visualize the relationship between
senses, a hierarchical clustering algorithm was
applied to the CI data to generate trees reflecting
sense proximity.4 Finally, in order to determine
the degree to which the linguistic relation
between languages may affect coherence, a
correlation was run among CIs for all pairs of
5 Results for all words in the study are available at
http://www.cs.vassar.edu/~ide/wsd/cross-ling.html.
Although the data sample is small, it gives some
6 Senses 2.3 and 1.4 have CIs of 1 because each of
insight into ways in which a larger sample might
these senses exists in a single occurrence in the
corpus, and have therefore been discarded from
consideration of CIs for individual senses. We are
currently investigating the use of the Kappa statistic
(Carletta, 1996) to normalize these sparse data.
remaining WordNet senses are scattered at
various places within the entries or, in some
hierarchical relations apparent in the clusters are
not reflected in the dictionary entries, since the
senses are for the most part presented in flat,
Figure 2 shows the sense clusters for h a r d
linear lists. However, it is interesting to note that
generated from the CI data.7The senses fall into
the first five senses of hard in the COBUILD
two main clusters, with the two most internally
dictionary, which is the only dictionary in the
consistent senses (1.1 and 1.3) at the deepest
level of each of the respective groups. The two
examples9 and presents senses in order of
adverbial forms8 are placed in separate groups,
reflecting their semantic proximity to the
WordNet senses in this study. WordNet’s
different adjectival meanings of hard. The
“metaphorically hard” is spread over multiple
clusters for head (Figure 2) similarly show two
senses in the COBUILD, as it is in the other
distinct groupings, each anchored in the two
senses with the highest internal consistency and
the lowest mutual CI (“part of the body” (1.1)
The hierarchies apparent in the cluster graphs
make intuitive sense. Structured like dictionary
entries, the clusters for hard and head might
appear as in Figure 1. This is not dissimilar to
actual dictionary entries for hard and head; for
example, the entries for hard in four differently
constructed dictionaries (Collins English (CED),
Figure 1 : Clusters for hard and head structured as
Longman’s (LDOCE), Oxford AdvancedLearner’s (OALD), and COBUILD) all list the
The results for different language groupings
“difficult” and “not soft” senses first and second,
show that the tendency to lexicalize senses
which, since most dictionaries list the most
differently is not affected by language distance
common or frequently used senses first, reflects
(Table 5). In fact, the mean CI for Estonian, the
the gross division apparent in the clusters.
only non-Indo-European language in the study,
Beyond this, it is difficult to assess the
is lower than that for any other group, indicating
that WordNet sense distinctions are slightly less
likely to be lexicalized differently in Estonian.
7 For the purposes of the cluster analysis, CIs of 1.00
resulting from a single occurrrence were normalized
9 Editions of the LDOCE (1987 version) and OALD
8 Because root forms were used in the analysis, no
(1985 version) dictionaries consulted in this study
distinction in translation equivalents was made for
pre-date editions of those same dictionaries based on
Correlations of CIs for each language pair
(Table 5) also show no relationship between the
lexicalized differently and language distance.
This is contrary to results obtained by Resnik
and Yarowsky (submitted), who, using a metric
similar to the one used in this study, found that
that non-Indo-European languages tended to
lexicalize English sense distinctions more than
Langs. Hard Country Line Head Ave.
Indo-European languages, especially at finer-
grained levels. However, their translation data
was generated by native speakers presented with
isolated sentences in English, who were asked to
provide the translation for a given word in the
sentence. It is not clear how this data compares
Table 6 : CI correlation for the four target languages
to translations generated by trained translators
___________________________|--------------------> 2.1 | |--------------------> 1.1 -| _____________________|---------> 2.3 |---------------------------| |---------> 1.3 |_____________________|---------------> 1.4 |---------------> 1.2 minimum distance = 0.249399 ( 1.3 ) ( 2.3 ) minimum distance = 0.434856 ( 1.2 ) ( 1.4 ) minimum distance = 0.555158 ( 1.1 ) ( 2.1 ) minimum distance = 0.602972 ( 1.4 1.2 ) ( 2.3 1.3 ) minimum distance = 0.761327 ( 2.3 1.3 1.4 1.2 ) ( 2.1 1.1 )
Figure 2 : Cluster tree and distance measures for the six senses of hard |-------------------------> 1.4 -| ______________________|----------------> 1.1 |-------------------------| |----------------> 1.3 |----------------------> 1.7 minimum distance = 0.441022 ( 1.3 ) ( 1.1 ) minimum distance = 0.619052 ( 1.7 ) ( 1.1 1.3 ) minimum distance = 0.723157 ( 1.1 1.3 1.7 ) ( 1.4 )
Figure 3 : Cluster tree and distance measures for the four senses of headConclusion
translations for words representing the various
WordNet senses, which provide word groups
The small sample in this study suggests that
similar to WordNet synsets. Interestingly, there
cross-lingual lexicalization can be used to define
is virtually no overlap between the WordNet
and structure sense distinctions. The cluster
synsets and word groups generated from back
translations. The results show, however, that
relations among WordNet senses that could be
sense distinctions useful for natural language
used, for example, to determine the granularity
processing tasks such as machine translation
of sense differences, which in turn could be used
could potentially be determined, or at least
in tasks such as machine translation, information
influenced, by considering this information. The
retrieval, etc. For example, it is likely that as
automatically generated synsets themselves may
sense distinctions become finer, the degree of
also be useful in the same applications where
error is less severe. Resnik and Yarowsky
WordNet synsets (and ontologies) have been
(1997) suggest that confusing finer-grained
sense distinctions should be penalized less
More work needs to be done on the topic of
severely than confusing grosser distinctions
cross-lingual sense determination, utilizing
substantially larger parallel corpora that include
disambiguation systems. The clusters also
a variety of language types as well as texts from
provide insight into the lexicalization of sense
several genres. This small study explores a
distinctions related by various semantic relations
(metonymy, meronymy, etc.) across languages;
for instance, the “part of the body” and
“intellect” senses of head are lexicalized with
Acknowledgements
the same item a significant portion of the time
across all languages, information that could be
The author would like to gratefully acknowledge
used in machine translation. In addition, cluster
the contribution of those who provided the
data such as that presented here could be used in
lexicography, to determine a more detailed
(Romanian); as well as Dana Fleur and Daniel
It is less clear how cross-lingual information can
Kline, who helped to transcribe and evaluate the
be used to determine sense distinctions
independent of a pre-defined set, such as the
Hinrich Schütze for their helpful comments.
WordNet senses used here. In an effort to
References
explore how this might be done, I have used the
small sample from this study to create word
Carletta, Jean (1996). Assessing Agreement on
groupings from “back translations” (i.e.,
Classification Tasks: The Kappa Statistic.
additional translations in the original language
Computational Linguistics, 22(2), 249-254.
of the translations in the target language) and
Dagan, Ido and Itai, Alon (1994). Word sense
developed a metric that uses this information to
determine relatedness between occurrences,
monolingual corpus. Computational Linguistics,
which is in turn used to cluster occurrences into
sense groups. I have also compared sets of back
Dagan, Ido; Itai, Alon; and Schwall, Ulrike (1991). Language Technology Worskshop, San Francisco,
Two languages are more informative than one. Proceedings of the 29th Annual Meeting of the
Melamed, I. Dan. (1997). Measuring Semantic
Association for Computational Linguistics, 18-21
Entropy. ACL-SIGLEX Workshop Tagging Text
June 1991, Berkeley, California, 130-137. with Lexical Semantics: Why, What, and How?
Dyvik, Helge (1998). Translations as Semantic
April 4-5, 1997, Washington, D.C., 41-46.
Mirrors. Proceedings of Workshop W13:
Miller, George A.; Beckwith, Richard T. Fellbaum,
Multilinguality in the Lexicon II, The 13th Biennial
Christiane D.; Gross, Derek and Miller, Katherine
European Conference on Artificial Intelligence
J. (1990). WordNet: An on-line lexical database. (ECAI 98), Brighton, UK, 24-44. International Journal of Lexicography, 3(4), 235-
Erjavec, Tomaz and Ide, Nancy (1998). The
MULTEXT-EAST Corpus. Proceedings of the
Priest-Dorman, Greg; Erjavec, Tomaz; Ide, Nancy
First International Conference on Language
and Petkevic, Vladimír (1997). Corpus Markup. Resources and Evaluation, 27-30 May 1998,
Erjavec, Tomaz, Lawson, Ann, and Romary, Laurent
http://nl.ijs.si/ME/CD/docs/mte-d23f/mte-D23F.html.
(1998). East meets West: Producing Multilingual
Resnik, Philip; Broman Olsen, Mari and Diab, Mona
Resources in a European Context. Proceedings of
(1999). Creating a Parallel Corpus from the Book
the First International Conference on Language
of 2000 Tongues. Computers and the Humanities,Resources and Evaluation, 27-30 May 1998,
Resnik, Philip and Yarowsky, David (submitted).
Fellbaum, Christiane (ed.) (1998). WordNet: An
Distinguishing systems and distinguishing senses:
Electronic Lexical Database. MIT Press,
disambiguation. Submitted to Natural Language
Resnik, Philip and Yarowsky, David (1997). A
disambiguating word senses in a large corpus.
perspective on word sense disambiguation methods
Computers and the Humanities, 26, 415-439.
and their evaluation. ACL-SIGLEX WorkshopTagging Text with Lexical Semantics: Why, What,
disambiguation using local context in large
and How? April 4-5, 1997, Washington, D.C., 79-
corpora. Proceedings of the 7th Annual Conferenceof the University of Waterloo Centre for the New
Schütze, Hinrich (1992). Dimensions of meaning. OED and Text Research, Oxford, United Kingdom,
Proceedings of Supercomputing’92. IEEE
Computer Society Press, Los Alamitos, California,
Ide, Nancy and Véronis, Jean (1998). Word sense
disambiguation: The state of the art. Computational
Schütze, Hinrich (1993). Word space. In Hanson,
Stephen J.; Cowan, Jack D.; and Giles, C. Lee
Kilgarriff, Adam and Palmer, Martha, Eds.
(Eds.) Advances in Neural Information Processing
(forthcoming). Proceedings of the Senseval Word
Systems 5, Morgan Kauffman, San Mateo,
Sense Disambiguation Workshop, Special double
issue of Computers and the Humanities, 33:4-5.
Vossen, Piek (ed.) (1998). EuroWordNet: A
Leacock, Claudia; Towell, Geoffrey and Voorhees,
Multilingual Database with Lexical Semantic
Ellen (1993). Corpus-based statistical sense
Networks. Kluwer Academic Press, Dordrecht.
resolution. Proceedings of the ARPA Human
Reprinted from Computers and the Humanities,
Wilks, Yorick and Stevenson, Mark (1998). Word
Combinations of Knowledge Sources. Proceedingsof COLING/ACL-98, Montreal, August, 1998.
Yarowsky, David (1992). Word sense disambiguation
using statistical models of Roget's categories
trained on large corpora. Proceedings of the 14thInternational Conference on ComputationalLinguistics, COLING'92, 23-28 August, Nantes,
Yarowsky, David (1993). One sense per collocation. Proceedings of the ARPA Human LanguageTechnology Workshop, Princeton, New Jersey,
Exploring the Link Between Volume of Media Coverage and Business Outcomes Angela Jeffrey, APR Vice President Editorial Research, VMS Dr. David Michaelson David Michaelson & Company, LLC Dr. Don W. Stacks Professor, School of Communication University of Miami Members, Commission on Public Relations Measurement & Evaluation Published by the Institute for Public
CLINDOXYL GEL (logo) (Clindamycin 1% and benzoyl peroxide 5%) TOPICAL ACNE THERAPY CLINDOXYL® Gel (clindamycin phosphate and benzoyl peroxide) ACTION AND CLINICAL PHARMACOLOGY Clindamycin Phosphate Although clindamycin phosphate is inactive in vitro , rapid in vivo hydrolysis converts this compound to the active antibiotic clindamycin. Like other macrolides, clindamycin