forked from wmt-conference/wmt16-website
-
Notifications
You must be signed in to change notification settings - Fork 0
/
biomedical-translation-task.html
242 lines (194 loc) · 9.71 KB
/
biomedical-translation-task.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
<HTML>
<HEAD>
<title>Biomedical translation Task - ACL 2016 First Conference on Machine Translation</title>
<style> h3 { margin-top: 2em; } </style>
</HEAD>
<body>
<center>
<script src="title.js"></script>
<p><h2>Shared Task: Biomedical Translation Task</h2></p>
<script src="menu.js"></script>
</center>
<h3>Task description</h3>
<p>This is a new task that aims to evaluate systems on the translation of scientific publications for the the biological and health domains.
The documents were retrieved from the <a href="www.scielo.org">Scielo</a> database of scientific publications.
The biomedical translation task will address the following language pairs:</p>
<ul>
<li> English-French and French-English</li>
<li> English-Spanish and Spanish-English</li>
<li> English-Portuguese and Portuguese-English </li>
</ul>
<h3>Data</h3>
<p>We will make available parallel corpora for the above three language pairs, as well as monolingual corpora for each of the four languages.
The documents were retrieved from both the Scielo database for both the parallel and the monolingual corpora.
The documents can be composed of either a title, the abstract or both of them, depending on their availability in the database.
Additionally, we will also make available a parallel corpus of <a href="http://www.ncbi.nlm.nih.gov/pubmed">Medline</a> titles.</p>
<p>All files are available in the <a href="https://drive.google.com/open?id=0B3UxRWA52hBja0t2azlkN3d2elk">WMT'16 biomedical task Google Drive account</a>.</p>
<h4>Parallel corpora from Scielo</h4>
<p>The parallel documents from the Scielo database are located in the "scielo" folder. There is no parallel dataset for the biological domain and the language pair FR/EN. Please use out-of-the domain corpora or the health and Medline datasets as training data.</p>
<table border="1">
<tr><td>Dataset</td><td>ES/EN</td><td>FR/EN</td><td>PT/EN</td></tr>
<tr><td>Biological</td><td>es-en-training-biological.xml.gz</td><td>-</td><td>pt-en-training-biological.xml.gz</td></tr>
<tr><td>Health</td><td>es-en-training-health.xml.gz</td><td>fr-en-training-health.xml.gz</td><td>pt-en-training-health.xml.gz</td></tr>
</table>
<p>The Scielo corpus is available in the <a href="http://bioc.sourceforge.net/">BioC XML format</a>, for which readers and writers are available for
many programming languages, as well as various natural language processing tools for biomedicine.
There are specific values for the attribute "key" of the XML tag "infon" to identify the language of each document, the section (title or abstract) and the number of the sentence, as illustrated in the example below:</p>
<xmp>
<document>
<id>S0034-77441998000200003</id>
<passage>
<infon key="language">EN</infon>
<infon key="section">abstract</infon>
<sentence>
<infon key="sentnum">0</infon>
<text>The gastrointestinal activity of an aqueous extract of the dry wood of Quassia amara was investigated using animal models. </text>
</sentence>
<sentence>
<infon key="sentnum">1</infon><offset>-1</offset><text> Oral administration of the extract to mice produces an increase of gastrointestinal transit at doses of
500 and 1000 mg/kg. The antiulcerogenic activity was measured inducing ulcers on Sprague-Dowly rats with indomethacin or ethanol and by the induction of stress.</text>
</sentence>
<sentence>
<infon key="sentnum">2</infon>
<text> The experimental group was treated orally with the extract, using doses of 250, 500 and 1000 mg/kg before inducing the ulcers.</text>
</sentence>
...
</passage>
</document>
</xmp>
<h4>Aligned parallel corpora from Scielo</h4>
<p>We have aligned the documents from the Scielo database with the <a href="http://nlp.cs.nyu.edu/GMA/">GMA tool</a>.
The files derived from this alignment are located in the "scielo-gma" folder and include the following files for each section of the document
(title and abstract/text):</p>
<ul>
<li>*.crp: aligned sentences</li>
<li>*.simr: GMA's word alignment file</li>
<li>*.align: GMA's sentence alignment file</li>
<li>*.txt.axis: GMA's axis file, one per language</li>
<li>*.txt: plain text file, one per language</li>
</ul>
<table border="1">
<tr><td>Dataset</td><td>ES/EN</td><td>FR/EN</td><td>PT/EN</td></tr>
<tr><td>Biological</td><td>es-en-gma-biological.tar.gz</td><td>-</td><td>pt-en-gma-biological.tar.gz</td></tr>
<tr><td>Health</td><td>es-en-gma-health.tar.gz</td><td>fr-en-gma-health.tar.gz</td><td>pt-en-gma-health.tar.gz</td></tr>
</table>
<h4>Parallel corpora from Medline</h4>
<p>The Medline documents are located in the "medline" folder.</p>
<table border="1">
<tr><td>Dataset</td><td>ES/EN</td><td>FR/EN</td><td>PT/EN</td></tr>
<tr><td>Medline</td><td>pubmed_en_es.txt.zip</td><td>pubmed_en_fr.txt.zip</td><td>pubmed_en_pt.txt.zip</td></tr>
</table>
<h4>Monolingual corpora from Scielo</h4>
<p>The Medline documents will be located in the "scielo-monolingual" folder.</p>
<h4>Out-of-domain corpora</h4>
For out-of-domain corpora, please check other machine translation tasks in the WMT'16 challenge, such as <a href="translation-task.html">news</a> and
<a href="it-translation-task.html">IT</a>.
<h3>Evaluation</h3>
Evaluation will be carried out both automatically and manually.
Automatic evaluation will make use of standard machine translation metrics, such as BLEU and/or METEOR.
Native speakers in each of the languages will manually check the quality of the translation for a small sample of the submissions.
The <a href="http://www.appraise.cf/">Appraise system</a> will be used for this purpose.
<h3>Submission format</h3>
<p>The training data and the test data are available in the <a href="http://bioc.sourceforge.net/">BioC</a> format.
More information about BioC as well as readers are writer for many programming languages can be found in the <a href="http://bioc.sourceforge.net/">BioC web site</a>.</p>
<p>An example of the test set format is shown below for the English to Spanish (en2es) language pair:</p>
<xmp>
<document>
<id>S123456789</id>
<passage>
<infon key="language">EN</infon>
<infon key="section">title</infon>
<offset>-1</offset>
<sentence>
<infon key="sentnum">0</infon>
<offset>-1</offset>
<text>title sentence</text>
</sentence>
</passage>
<passage>
<infon key="language">EN</infon>
<infon key="section">abstract</infon>
<offset>-1</offset>
<sentence>
<infon key="sentnum">0</infon>
<offset>-1</offset>
<text>sentence 0</text>
</sentence>
<sentence>
<infon key="sentnum">1</infon>
<offset>-1</offset>
<text>sentence 1</text>
</sentence>
...
</passage>
</document>
</xmp>
<p>An example of the submission format is shown below for the above en2es language pair:</p>
<xmp>
<document>
<id>S123456789</id>
<passage>
<infon key="language">ES</infon>
<infon key="section">title</infon>
<offset>-1</offset>
<sentence>
<infon key="sentnum">0</infon>
<offset>-1</offset>
<text>translation of title sentence</text>
</sentence>
</passage>
<passage>
<infon key="language">ES</infon>
<infon key="section">abstract</infon>
<offset>-1</offset>
<sentence>
<infon key="sentnum">0</infon>
<offset>-1</offset>
<text>translation of sentence 0</text>
</sentence>
<sentence>
<infon key="sentnum">1</infon>
<offset>-1</offset>
<text>translation of sentence 1</text>
</sentence>
...
</passage>
</document>
</xmp>
<p>
Please identify each sentence with the corresponding "sentnum" specified in the test file.
The submission file has the same format of the test file, except for the "language" attribute, which should contain the target language instead of the source language,
and the "text" tag, which should contain the translation of the text to the target language.
</p>
<h3>Submission Requirements</h3>
<p>Please register your team using this <a href="https://script.google.com/macros/s/AKfycbwXS8N3aS3m7kc4O4xGfCkR1d0zqN-5Eq0rLS9j-5JCQsoCkNn_/exec">form</a>.
You will receive a mail with the confirmation of your registration. The link for submission is informed in this mail.
</p>
<p>The test files are available in the "testset" folder in the <a href="https://drive.google.com/open?id=0B3UxRWA52hBja0t2azlkN3d2elk">WMT'16 biomedical task Google Drive account</a>
and their file names are according to the dataset (biological or health) and language pairs (e.g., en2es or es2en).
For instance, the test file for the biological dataset for English to Spanish is called "biological_en2es.xml".<p>
<p>The format for the submission files should included the original test file name preceded by the team identifier (as registered in the form above) and the run number,
following this example:
the submission file for run 1 of the "HPI" team for the biological dataset for English to Spanish should be called "HPI_run1_biological_en2es.xml".<p>
<p>Each team is allowed to submit up to 3 runs per test file, i.e., 3 runs for the "biological_en2es.xml" test file, 3 runs for the "biological_es2en.xml", etc.
<b>There is no biological test set for neither "fr2en" nor "en2fr" language pairs.</b></p>
<h3>Important dates</h3>
<table>
<tr><td>Release of training data </td><td>end of January 2016</td></tr>
<tr><td>Release of test data </td><td>April 15, 2016</td></tr>
<tr><td>Results submission deadline </td><td>April 22, 2016</td></tr>
<tr><td>Paper submission deadline</td><td>May 8, 2016</td></tr> <!-- fixed?-->
<tr><td>Notification of acceptance</td><td>June 5, 2016</td></tr> <!-- fixed?-->
<tr><td>Camera-ready deadline</td><td>June 22, 2016</td></tr> <!-- fixed?-->
</table>
<h3>Organisers</h3>
Antonio Jimeno Yepes (IBM Research Australia)<br>
Aurélie Névéol (LIMSI, CNRS, France)<br>
Mariana Neves (Hasso-Plattner Institute, Germany)<br>
Karin Verspoor (University of Melbourne, Australia)<br>
<br/><br/>
Please contact us in the mail <a href="mailto:wmtbiomedical@gmail.com">wmtbiomedical@gmail.com</a>.
Please also joing our <a href="https://groups.google.com/forum/?hl=en#!forum/wmt-biomedical-task">discussion forum</a>.
<br/>
</body>
</HTML>