1. WP5 Tasks & Deliverables2. Overview of parallel technology tools3. Parallel copora requirements4. Survey of language resources5. Work plan for t4-t146. Questions
Input: parallel corpora produced in WP4 Output: language resources for MT in WP7/WP8
WP5.1 Sub-sentential alignment (DCU, ELDA, ILSP)
WP5.2 Bilingual dictionary extraction (DCU, ILSP)
• D5.1 (t06): Report describing the inventory of parallel technology
tools to be developed and integrated in PANACEA and the
characteristics of the resources to be produced.
• D5.2 (t14) Aligners integrated into the platform, and documentation
• D5.3 (t22) Parallel, sententially aligned texts, cleaned and prepared
for training/building translational models (20—50 million words)
• D5.4 (t30) Final version of the Bilingual Dictionary Extractor
• D5.5 (t30) Sample of bilingual dictionaries produced: EN—FR and
• D5.6 (t30) Final version of the integrated Transfer Rules module, and
• D5.7 (t30) Sample of transfer rules produced for EN—DE.
• Bil ingual dictionary extraction (WP5.2)
Align bilingual corpus (existing or output from WP4)
– Sentence– Word– Chunk / Syntactic
• GIZA++, berkeleyaligner• word packing (“compound rich” languages,
• Marker hypothesis: Marclator• Syntactic: TreeAligner
– Integrate models: generative, syntactic,
– Extend range of language pairs– Tune to text type, domain and genre– Check/filter corpora acquired (comparability
– Baseline: phrase alignment in Moses– Extrinsic evaluation (SMT in WP7)
Task: to derive bilingual dictionaries from aligned parallel corpus Methodology
– Expectation-Maximisation algorithm– Additional techniques on top of word correspondences →
precision, fine-cleaning → reduce human intervention
– Go beyond word level: MW translations (NPs, MWEs)– Baseline: word alignment in Moses– Evaluation?
• Find criteria for lexical transfer selection
• structural transfer (Probst, Sánchez-Martínez, et al.)
– (matching of POS-sequences– independent of lexical material)
• bilingual term extraction (Cabré 2001, Gamal o 2007)
– structural transfer– lexical transfer
• simple lexical• contextual lexical <- this is the task! conditions for transfer selection
– with domain / subject area information („MEDICAL“)– with locale / variant („EN_UK“ „DE_CH“)
– use information on local nodes (gender, number)– use structural contexts (arguments, prepositions, subcategorisation
frames & fil ers) (main means of RMT)
– use conceptual environment for disambiguation
• using word sense disambiguation, statistical word alignment
• supervised learning of most important disambiguation
1. domain tag assignment2. morphosyntactic tests
• local features on gender / number• subcategorisation: Prepositions (for nouns and verbs)• presence / absence of verb arguments (trans./intrans.)• (relational Adj <-> compound specifier)
• source language concept clusters (SMT uses target
– Selection of disambiguation candidates (N, V, A)– Creation of paral el corpora – Creation of subcorpora for each translation
1. domain tags: do subcorpora differ in domain?2. morphosyntactic:
• gender: do they differ in gender? in number?• arguments: do they differ in transitivity? in subcategorised prepositions?
1. conceptual: Can different SL concept clusters be built to
• Verification with additional candidates or data
– Sentence Segmentiser, Tokeniser, Dictionary Lookup
• Parser to extract annotated subtrees• Tree matching component
• target-sensitive word sense disambiguation
– similar for the target side …) (if time permits)
Quality:
– a really parallel (not comparable) corpora aligned on sentence level – translation quality of aligned sentence pairs is essential for MT output Linguistic pre-processing:
– tokenized plain text (plain PB-SMT)– POS tagging, lemmatization (factored PB-SMT, EBMT)– constitutency and dependency parsing (syntax motivated PB-SMT)
– for a baseline system: at least 1M sentece pairs (~20M words)– for domain adaptation: 20K-200K sentece pairs (~400K-4M words)
EuroParl * JRC Acquis * News Commentary United Nations English-French OpenSubtitles
- numbers in millions of words from English to the target language- in corpora denoted by * all language pairs available
News (WMT) Gigaword ILSP EL corpus
- numbers in millions of words- monolingual parts of the parallel corpora also available
• A number of standard monolingual and parallel corpora available
for al languages pairs of sufficient size & quality
• Parliamentary proceedings and debates can be considered
• Monolingual web-crawled corpora available for English, French,
German, Italian (WaCky) – unspecified domain
• No web-crawled paral el data available at al (Resnik's Strand is
only a list of URLs, but quite outdated) – no fal back strategy
• EuroParl for baseline systems
– parliamentary proceedings and dabates
– quite general domain suitable for adaptation
• Evaluation data to be selected as a subset from
webcrawled in-domain data (including 500-2000 sentence pairs for test set and dev test set)
• Focus on translation from English to other languages Official deadlines:
– t6 Report on paral el technolgy tools (D5.1)
– t14 Aligners integrated in the platform (D5.2)
Internal deadlines:
– t6 decision on MT language pairs and domains
– t12 resources to be included in the first evaluation produced (D4.3)
Assumption: general and in-domain monolingual and Possible approaches:
– one system build from mixture of the data– two systems and a domain classifier (for sentences)– two systems and system combination based on their n-
• Distribution of webservices across partners?• Software requirements for webservices?• Hardware specifications (no HW budget)?• Example webservice wrapper?
• Rich text format support?• Duplicate document/sentence detection? • Distribution of webservices?
– TPC tools for one language on one site?
• MT tools integrated into the platform?
– alignment OK– language modelling?– phrase table extraction?– Decoding?– tuning?
• Only extrinsic automatic evaluation feasible
• Only extrinsic (MT) evaluation feasible
POST-OPERATIVE INSTRUCTIONS SPINE SURGERY After your surgery there are several points we would like you to keep in mind. Most patients will be able to maintain a fairly normal level of activity following surgery. We do ask that you adhere to the activity restrictions described, as well as note some of the other care instructions. We value your health, well-being and comfort. If you do h
Nomor Suku Cadang Nama Suku Cadang Harga (Rp) 02510-000-220 LIQUID GASKET 06381-KPH-900 ROD KIT, CONNECTING 06401-KPH-881 DRIVE CHAIN KIT 06435-KPP-901 PAD SET RR 06451-GE2-406 SEAL SET PISTON 06455-KR3-404 PAD SET FR (NA) 06530-GBG-B20 BALL ASSY,STEEL #6X23 06531-GBG-B20 BALL ASSY,STEEL #6X29 06535-GN5-505 RACE STEERING KIT 08232M99K