cs.vassar.edu

Cs.vassar.edu

Parallel Translations as Sense Discriminators means to distinguish word senses, at least to the Abstract
degree that they are useful for natural language document retrieval, and machine translation.
equivalents in four languages from different exploited to automatically determine the sense language families, extracted from an on-line of a word in context (see Ide and Véronis, 1998), parallel corpus of George Orwell’s Nineteen including syntactic behavior, semantic and Eighty-Four. The goal of the study is to pragmatic knowledge, and especially in more determine the degree to which translation recent empirical studies, word co-occurrence within syntactic relations (e.g., Hearst, 1991; polysemous word in English are lexicalized Yarowsky, 1993), words co-occurring in global differently across a variety of languages, and context (e.g., Gale et al., 1993; Yarowsky, 1992; to determine whether this information can be Schütze, 1992, 1993), etc. No clear criteria have used to structure or create a set of sense emerged, however, and the problem continues to processing applications. A coherence index The notion that cross-lingual comparison can be is computed that measures the tendency for useful for sense disambiguation has served as a different senses of the same English word to be lexicalized differently, and from this data example, Brown et al. (1991) and Gale et al. a clustering algorithm is used to create sense (1992a, 1993) used the parallel, aligned Hansard Corpus of Canadian Parliamentary debates for Introduction
WSD, and Dagan et al. (1991) and Dagan and Itai (1994) used monolingual corpora of Hebrew It is well known that the most nagging issue for and German and a bilingual dictionary. These studies rely on the assumption that the mapping definition of just what a word sense is. At its significantly among languages. For example, the linguistic one that is far from being resolved.
word duty in English translates into French as devoir in its obligation sense, and impôt in its processing has led to efforts to find practical tax sense. By determining the translation equivalent of duty in a parallel French text, the correct sense of the English word is identified.
generally, which languages should be considered These studies exploit this information in order to for this exercise? All languages? Closely related gather co-occurrence data for the different languages? Languages from different language senses, which is then used to disambiguate new texts. In related work, Dyvik (1998) used patterns of translational relations in an English- "enough" to provide adequate information for University) to define semantic properties such as There is also the question of the criteria that synonymy, ambiguity, vagueness, and semantic fields and suggested a derivation of semantic distinction is "lexicalized cross-linguistically".
representations for signs (e.g., lexemes), How consistent must the distinction be? Does it hyponymy etc., from such translational relations.
mutually non-interchangeable lexical items in some significant number of other languages, or suggested that for the purposes of WSD, the need it only be the case that the option of a different senses of a word could be determined different lexicalization exists in a certain by considering only sense distinctions that are lexicalized cross-linguistically. In particular, Another consideration is where the cross-lingual they propose that some set of target languages information to answer these questions would be identified, and that the sense distinctions to come from. Using bilingual dictionaries would be extremely tedious and error-prone, given the applications and evaluation be restricted to those substantial divergence among dictionaries in that are realized lexically in some minimum subset of those languages. This idea would seem distinctions they make. Resnik and Yarowsky to provide an answer, at least in part, to the (1997) suggest EuroWordNet (Vossen, 1998) as problem of determining different senses of a a possible source of information, but, given that word: intuitively, one assumes that if another EuroWordNet is primarily a lexicon and not a language lexicalizes a word in two or more corpus, it is subject to many of the same ways, there must be a conceptual motivation. If objections as for bi-lingual dictionaries.
likely to find the significant lexical differences information from parallel, aligned corpora.
that delimit different senses of a word.
Unlike bilingual and multi-lingual dictionaries, translation equivalents in parallel texts are questions. For instance, it is well known that determined by experienced translators, who evaluate each instance of a word’s use in context languages (for example, the French intérêt and rather than as a part of the meta-linguistic the English interest), especially languages that activity of classifying senses for inclusion in a are relatively closely related. Assuming this dictionary. However, at present very few parallel problem can be overcome, should differences aligned corpora exist. The vast majority of these found in closely related languages be given are bi-texts, involving only two languages, one lesser (or greater) weight than those found in of which is very often English. Ideally, a serious evaluation of Resnik and Yarowsky’s proposal (Germanic, Slavic, Finno-Ugrec, and Romance), would include parallel texts in languages from two languages from the same family (Czech and several different language families, and, to Slovene), as well as one non-Indo-European maximally ensure that the word in question is used in the exact same sense across languages, it Nineteen Eighty-Four is a text of about 100,000 would be preferable that the same text were used words, translated directly from the original over all languages in the study. The only English in each of the other languages. The currently available parallel corpora for more parallel versions of the text are sentence-aligned than two languages are Orwell's Nineteen to the English and tagged for part of speech.
Eighty-Four (Erjavec and Ide, 1998), Plato's Although Nineteen Eighty-Four is a work of Republic (Erjavec, et al., 1998), the MULTEXT fiction, Orwell's prose is not highly stylized and, Journal of the Commission corpus (Ide and as such, it provides a reasonable sample of Véronis, 1994), and the Bible (Resnik, et al., in modern, ordinary language that is not tied to a press). It is likely that these corpora do not given topic or sub-domain (such as newspapers, provide enough appropriate data to reliably technical reports, etc.). Furthermore, the determine sense distinctions. Also, it is not clear translations of the text seem to be relatively how the lexicalization of sense distinctions faithful to the original: for instance, over 95% of across languages is affected by genre, domain, the sentence alignments in the full parallel This paper attempts to provide some preliminary (Priest-Dorman, et al., 1997).
answers to the questions outlined above, in order Nine ambiguous English words were considered: to eventually determine the degree to which the hard, head, country, line, promise, slight, seize, use of parallel data is viable to determine sense scrap, float. The first four were chosen because distinctions, and, if so, the ways in which this they have been used in other disambiguation information might be used. Given the lack of studies; the latter five were chosen from among large parallel texts across multiple languages, the words used in the Senseval disambiguation the study is necessarily limited; however, close exercise (Kilgarriff and Palmer, forthcoming). In examination of a small sample of parallel data all cases, the study was necessarily limited to can, as a first step, provide the basis and words that occurred frequently enough in the direction for more extensive studies.
Orwell text to warrant consideration.
Five hundred forty-two sentences containing an Methodology
I have conducted a small study using parallel, morphological variants) of each of the nine aligned versions of George Orwell's Nineteen words were extracted from the English text, Eighty-Four (Erjavec and Ide, 1998) in five together with the parallel sentences in which they occur in the texts of the four comparison Romanian, and Czech.1 The study therefore involves languages from four language families Slovene). As Wilks and Stevenson (1998) have p o i n t e d o u t , p a r t - o f - s p e e c h t a g g i n g accomplishes a good portion of the work of 1 The Orwell parallel corpus also includes versions of semantic disambiguation; therefore occurrences Nineteen-Eighty Four in Hungarian, Bulgarian, of words that appeared in the data in more than Latvian, Lithuanian, Serbian, and Russian.
one part of speech were grouped separately.2 CIs do not determine whether or not a sense The English occurrences were then grouped distinction can be lexicalized in the target language, but only the degree to which they are (version 1.6) [Miller et al., 1990; Fellbaum, lexicalized differently in the translated text.
1998]). The sense categorization was performed However, it can be assumed that the CIs provide by the author and two student assistants; results a measure of the tendency to lexicalize different from the three were compared and a final, WordNet senses differently, which can in turn mutually agreeable set of sense assignments be seen as an indication of the degree to which For each of the four comparison languages, the For each ambiguous word, the CI is computed corpus of sense-grouped parallel sentences were sent to a linguist and native speaker of the comparison language. The linguists were asked to provide the lexical item in each parallel sentence that corresponds to the ambiguous English word. If inflected, they were asked to provide both the inflected form and the root • n is the number of comparison languages form. In addition, the linguists were asked to indicate the type of translation, according to the • m and m are the number of occurrences of sense s and sense s in the English corpus, For over 85% of the English word occurrences (corresponding to types 1 and 2 in Table 1), a specific lexical item or items could be identified is the number of times that senses q and r are translated by the same lexical item corresponding English word. For comparison purposes, each translation equivalent was represented by its lemma (or the lemma of the root form in the case of derivatives) and The CI is a value between 0 and 1, computed by associated with the WordNet sense to which it examining clusters of occurrences translated by the same word in the other languages. If sense i In order to determine the degree to which the and sense j are consistently translated with the assigned sense distinctions correspond to same word in each comparison language, then translation equivalents, a coherence index (CI) CI(s s ) = 1; if they are translated with a was computed that measures how often each pair different word in every occurrence, CI(s s ) = 0.
of senses is translated using the same word as In general, the CI for pairs of different senses well as the consistency with which a given sense provides an index of their relatedness; i.e., the is translated with the same word.3 Note that the greater the value of CI(s s ), the more frequently occurrences of sense i and sense j are translated 2 The adjective and adverb senses of hard are with the same lexical item. When i = j, we considered together because the distinction is not consistent across the translations used in the study.
3 Note that the CI is similar to semantic entropy (Melamed, 1997). However, Melamed computes entropy for word types, rather than word senses.
obtain a measure of the coherence of a given A single lexical item is used to translate the English equivalent (possibly a different part of speech) The English word is translated by a phrase of two or more words or a compound, which has the same The English word is not lexicalized in the translation A pronoun is substituted for the English word in the translation An English phrase containing the ambiguous word is translated by a single word in the comparison language which has a broader or more specific meaning, or by a phrase in which the specific conceptcorresponding to the English word is not explicitly lexicalized Table 1 : Translation types and their frequencies For example, Table 2 gives the senses of hard and head that occurred in the data.5 The CI data not yielding to pressure ; vs. “soft” for hard and head are given in Tables 3 and 4.
CIs measuring the affinity of a sense with itself—that is, the tendency for all occurrences of that sense to be translated with the same word--show that all of the six senses of hard have greater internal consistency than affinity Table 2 : WordNet senses of hard and head with other senses, with senses 1.1 ("difficult" - CI = .56) and 1.3 ("not soft" – CI = .63) registering the highest internal consistency.6 The individually as well as for different language same holds true for three of the four senses of head, while the CI for senses 1.3 (“intellect”) (three different language families): Czech and and 1.1 (“part of the body”) is higher than the CI Slovene (Indo-European; and Estonian (non- To better visualize the relationship between senses, a hierarchical clustering algorithm was applied to the CI data to generate trees reflecting sense proximity.4 Finally, in order to determine the degree to which the linguistic relation between languages may affect coherence, a correlation was run among CIs for all pairs of 5 Results for all words in the study are available at http://www.cs.vassar.edu/~ide/wsd/cross-ling.html.
Although the data sample is small, it gives some 6 Senses 2.3 and 1.4 have CIs of 1 because each of insight into ways in which a larger sample might these senses exists in a single occurrence in the corpus, and have therefore been discarded from consideration of CIs for individual senses. We are currently investigating the use of the Kappa statistic (Carletta, 1996) to normalize these sparse data.
remaining WordNet senses are scattered at various places within the entries or, in some hierarchical relations apparent in the clusters are not reflected in the dictionary entries, since the senses are for the most part presented in flat, Figure 2 shows the sense clusters for h a r d linear lists. However, it is interesting to note that generated from the CI data.7 The senses fall into the first five senses of hard in the COBUILD two main clusters, with the two most internally dictionary, which is the only dictionary in the consistent senses (1.1 and 1.3) at the deepest level of each of the respective groups. The two examples9 and presents senses in order of adverbial forms8 are placed in separate groups, reflecting their semantic proximity to the WordNet senses in this study. WordNet’s different adjectival meanings of hard. The “metaphorically hard” is spread over multiple clusters for head (Figure 2) similarly show two senses in the COBUILD, as it is in the other distinct groupings, each anchored in the two senses with the highest internal consistency and the lowest mutual CI (“part of the body” (1.1) The hierarchies apparent in the cluster graphs make intuitive sense. Structured like dictionary entries, the clusters for hard and head might appear as in Figure 1. This is not dissimilar to actual dictionary entries for hard and head; for example, the entries for hard in four differently constructed dictionaries (Collins English (CED), Figure 1 : Clusters for hard and head structured as Longman’s (LDOCE), Oxford Advanced Learner’s (OALD), and COBUILD) all list the The results for different language groupings “difficult” and “not soft” senses first and second, show that the tendency to lexicalize senses which, since most dictionaries list the most differently is not affected by language distance common or frequently used senses first, reflects (Table 5). In fact, the mean CI for Estonian, the the gross division apparent in the clusters.
only non-Indo-European language in the study, Beyond this, it is difficult to assess the is lower than that for any other group, indicating that WordNet sense distinctions are slightly less likely to be lexicalized differently in Estonian.
7 For the purposes of the cluster analysis, CIs of 1.00 resulting from a single occurrrence were normalized 9 Editions of the LDOCE (1987 version) and OALD 8 Because root forms were used in the analysis, no (1985 version) dictionaries consulted in this study distinction in translation equivalents was made for pre-date editions of those same dictionaries based on Correlations of CIs for each language pair (Table 5) also show no relationship between the lexicalized differently and language distance.
This is contrary to results obtained by Resnik and Yarowsky (submitted), who, using a metric similar to the one used in this study, found that that non-Indo-European languages tended to lexicalize English sense distinctions more than Langs. Hard Country Line Head Ave. Indo-European languages, especially at finer- grained levels. However, their translation data was generated by native speakers presented with isolated sentences in English, who were asked to provide the translation for a given word in the sentence. It is not clear how this data compares Table 6 : CI correlation for the four target languages to translations generated by trained translators ___________________________|--------------------> 2.1
| |--------------------> 1.1
-| _____________________|---------> 2.3
|---------------------------| |---------> 1.3
|_____________________|---------------> 1.4
|---------------> 1.2
minimum distance = 0.249399 ( 1.3 ) ( 2.3 )
minimum distance = 0.434856 ( 1.2 ) ( 1.4 )
minimum distance = 0.555158 ( 1.1 ) ( 2.1 )
minimum distance = 0.602972 ( 1.4 1.2 ) ( 2.3 1.3 )
minimum distance = 0.761327 ( 2.3 1.3 1.4 1.2 ) ( 2.1 1.1 )
Figure 2 : Cluster tree and distance measures for the six senses of hard |-------------------------> 1.4
-| ______________________|----------------> 1.1
|-------------------------| |----------------> 1.3
|----------------------> 1.7
minimum distance = 0.441022 ( 1.3 ) ( 1.1 )
minimum distance = 0.619052 ( 1.7 ) ( 1.1 1.3 )
minimum distance = 0.723157 ( 1.1 1.3 1.7 ) ( 1.4 )
Figure 3 : Cluster tree and distance measures for the four senses of head Conclusion
translations for words representing the various WordNet senses, which provide word groups The small sample in this study suggests that similar to WordNet synsets. Interestingly, there cross-lingual lexicalization can be used to define is virtually no overlap between the WordNet and structure sense distinctions. The cluster synsets and word groups generated from back translations. The results show, however, that relations among WordNet senses that could be sense distinctions useful for natural language used, for example, to determine the granularity processing tasks such as machine translation of sense differences, which in turn could be used could potentially be determined, or at least in tasks such as machine translation, information influenced, by considering this information. The retrieval, etc. For example, it is likely that as automatically generated synsets themselves may sense distinctions become finer, the degree of also be useful in the same applications where error is less severe. Resnik and Yarowsky WordNet synsets (and ontologies) have been (1997) suggest that confusing finer-grained sense distinctions should be penalized less More work needs to be done on the topic of severely than confusing grosser distinctions cross-lingual sense determination, utilizing substantially larger parallel corpora that include disambiguation systems. The clusters also a variety of language types as well as texts from provide insight into the lexicalization of sense several genres. This small study explores a distinctions related by various semantic relations (metonymy, meronymy, etc.) across languages; for instance, the “part of the body” and “intellect” senses of head are lexicalized with Acknowledgements
the same item a significant portion of the time across all languages, information that could be The author would like to gratefully acknowledge used in machine translation. In addition, cluster the contribution of those who provided the data such as that presented here could be used in lexicography, to determine a more detailed (Romanian); as well as Dana Fleur and Daniel It is less clear how cross-lingual information can Kline, who helped to transcribe and evaluate the be used to determine sense distinctions independent of a pre-defined set, such as the Hinrich Schütze for their helpful comments.
WordNet senses used here. In an effort to References
explore how this might be done, I have used the small sample from this study to create word Carletta, Jean (1996). Assessing Agreement on groupings from “back translations” (i.e., Classification Tasks: The Kappa Statistic.
additional translations in the original language Computational Linguistics, 22(2), 249-254.
of the translations in the target language) and Dagan, Ido and Itai, Alon (1994). Word sense developed a metric that uses this information to determine relatedness between occurrences, monolingual corpus. Computational Linguistics, which is in turn used to cluster occurrences into sense groups. I have also compared sets of back Dagan, Ido; Itai, Alon; and Schwall, Ulrike (1991).
Language Technology Worskshop, San Francisco, Two languages are more informative than one.
Proceedings of the 29th Annual Meeting of the Melamed, I. Dan. (1997). Measuring Semantic Association for Computational Linguistics, 18-21 Entropy. ACL-SIGLEX Workshop Tagging Text June 1991, Berkeley, California, 130-137.
with Lexical Semantics: Why, What, and How? Dyvik, Helge (1998). Translations as Semantic April 4-5, 1997, Washington, D.C., 41-46.
Mirrors. Proceedings of Workshop W13: Miller, George A.; Beckwith, Richard T. Fellbaum, Multilinguality in the Lexicon II, The 13th Biennial Christiane D.; Gross, Derek and Miller, Katherine European Conference on Artificial Intelligence J. (1990). WordNet: An on-line lexical database.
(ECAI 98), Brighton, UK, 24-44.
International Journal of Lexicography, 3(4), 235- Erjavec, Tomaz and Ide, Nancy (1998). The MULTEXT-EAST Corpus. Proceedings of the Priest-Dorman, Greg; Erjavec, Tomaz; Ide, Nancy First International Conference on Language and Petkevic, Vladimír (1997). Corpus Markup.
Resources and Evaluation, 27-30 May 1998, Erjavec, Tomaz, Lawson, Ann, and Romary, Laurent http://nl.ijs.si/ME/CD/docs/mte-d23f/mte-D23F.html.
(1998). East meets West: Producing Multilingual Resnik, Philip; Broman Olsen, Mari and Diab, Mona Resources in a European Context. Proceedings of (1999). Creating a Parallel Corpus from the Book the First International Conference on Language of 2000 Tongues. Computers and the Humanities, Resources and Evaluation, 27-30 May 1998, Resnik, Philip and Yarowsky, David (submitted).
Fellbaum, Christiane (ed.) (1998). WordNet: An Distinguishing systems and distinguishing senses: Electronic Lexical Database. MIT Press, disambiguation. Submitted to Natural Language Resnik, Philip and Yarowsky, David (1997). A disambiguating word senses in a large corpus.
perspective on word sense disambiguation methods Computers and the Humanities, 26, 415-439.
and their evaluation. ACL-SIGLEX Workshop Tagging Text with Lexical Semantics: Why, What, disambiguation using local context in large and How? April 4-5, 1997, Washington, D.C., 79- corpora. Proceedings of the 7th Annual Conference of the University of Waterloo Centre for the New Schütze, Hinrich (1992). Dimensions of meaning.
OED and Text Research, Oxford, United Kingdom, Proceedings of Supercomputing’92. IEEE Computer Society Press, Los Alamitos, California, Ide, Nancy and Véronis, Jean (1998). Word sense disambiguation: The state of the art. Computational Schütze, Hinrich (1993). Word space. In Hanson, Stephen J.; Cowan, Jack D.; and Giles, C. Lee Kilgarriff, Adam and Palmer, Martha, Eds.
(Eds.) Advances in Neural Information Processing (forthcoming). Proceedings of the Senseval Word Systems 5, Morgan Kauffman, San Mateo, Sense Disambiguation Workshop, Special double issue of Computers and the Humanities, 33:4-5. Vossen, Piek (ed.) (1998). EuroWordNet: A Leacock, Claudia; Towell, Geoffrey and Voorhees, Multilingual Database with Lexical Semantic Ellen (1993). Corpus-based statistical sense Networks. Kluwer Academic Press, Dordrecht.
resolution. Proceedings of the ARPA Human Reprinted from Computers and the Humanities, Wilks, Yorick and Stevenson, Mark (1998). Word Combinations of Knowledge Sources. Proceedings of COLING/ACL-98, Montreal, August, 1998.
Yarowsky, David (1992). Word sense disambiguation using statistical models of Roget's categories trained on large corpora. Proceedings of the 14th International Conference on Computational Linguistics, COLING'92, 23-28 August, Nantes, Yarowsky, David (1993). One sense per collocation.
Proceedings of the ARPA Human Language Technology Workshop, Princeton, New Jersey,

Source: http://www.cs.vassar.edu/~ide/papers/siglex99.pdf

Microsoft word - volume of media cove#209718.doc

Exploring the Link Between Volume of Media Coverage and Business Outcomes Angela Jeffrey, APR Vice President Editorial Research, VMS Dr. David Michaelson David Michaelson & Company, LLC Dr. Don W. Stacks Professor, School of Communication University of Miami Members, Commission on Public Relations Measurement & Evaluation Published by the Institute for Public

Microsoft word - clindoxyl pi-2004-3.doc

CLINDOXYL GEL (logo) (Clindamycin 1% and benzoyl peroxide 5%) TOPICAL ACNE THERAPY CLINDOXYL® Gel (clindamycin phosphate and benzoyl peroxide) ACTION AND CLINICAL PHARMACOLOGY Clindamycin Phosphate Although clindamycin phosphate is inactive in vitro , rapid in vivo hydrolysis converts this compound to the active antibiotic clindamycin. Like other macrolides, clindamycin