Package pysimt
pysimt
is a PyTorch
-based sequence-to-sequence (S2S) framework that facilitates
research in unimodal and multi-modal machine translation. The framework
is especially geared towards a set of recent simultaneous MT approaches, including
heuristics-based decoding and prefix-to-prefix training/decoding. Common metrics
such as average proportion (AP), average lag (AL), and consecutive wait (CW)
are provided through well-defined APIs as well.
Features
pysimt
includes two state-of-the-art S2S approaches to neural machine
translation (NMT) to begin with:
- RNN-based attentive NMT (Bahdanau et al. 2014)
- Self-attention based Transformers NMT (Vaswani et al. 2017)
The toolkit mostly emphasizes on multimodal machine translation (MMT), and therefore the above models easily accomodate multiple source modalities through encoder-side (Caglayan et al. 2020) and decoder-side multimodal attention (Caglayan et al. 2016) approaches.
Simultaneous NMT
The following notable approaches in the simultaneous NMT field are implemented:
- Heuristics-based decoding approaches wait-if-diff and wait-if-worse (Cho and Esipova, 2016)
- Prefix-to-prefix training and decoding approach wait-k (Ma et al., 2019)
Simultaneous MMT
The toolkit includes the reference implementation for the following conference papers that initiated research in Simultaneous MMT:
- Simultaneous Machine Translation with Visual Context (Caglayan et al. 2020)
- Towards Multimodal Simultaneous Neural Machine Translation (Imankulova et al. 2020)
Other features
- CPU / (Single) GPU training of sequence-to-sequence frameworks
- Reproducible experimentation through well-defined configuration files
- Easy multimodal training with parallel corpora
- Logging training progress and validation performance to Tensorboard
- Text, Image Features and Speech encoders
- Early-stopping and model checkpointing using various criteria such as MultiBLEU, SacreBLEU, METEOR, word error rate (WER), character error rate (CER), etc.
- Ready-to-use latency metrics for simultaneous MT, including average proportion (AP), average lag (AL), and consecutive wait (CW)
- Beam search translation for consecutive, greedy search translation for simultaneous MT
- Utilities to produce reports for simultaneous MT performance
Installation
Essentially, pysimt
requires Python>=3.7
and torch>=1.7.0
. You can access
the other dependencies in the provided environment.yml
file.
The following command will create an appropriate Anaconda environment with pysimt
installed in editable mode. This will allow you to modify to code in the GIT
checkout folder, and then run the experiments directly.
$ conda env create -f environment.yml
Note
If you want to use the METEOR metric for early-stopping or the pysimt-coco-metrics
script to evaluate your models' performance, you need to run the pysimt-install-extra
script within the pysimt Anaconda environment. This will download and install
the METEOR paraphrase files under the ~/.pysimt
folder.
Command-line tools
Once installed, you will have access to three command line utilities:
pysimt-build-vocab
- Since
pysimt
does not pre-process, tokenize, segment the given text files automagically, all these steps should be done by the user prior to training, and the relevant vocabulary files should be constructed usingpysimt-build-vocab
. - Different vocabularies should be constructed for source and target language
representations (unless
-s
is given). - The resulting files are in
.json
format.
Arguments:
-o, --output-dir OUTPUT_DIR
: Output directory where the resulting vocabularies will be stored.-s, --single
: If given, a single vocabulary file for all the given training corpora files will be constructed. Useful for weight tying in embedding layers.-m, --min-freq
: If given an integerM
, it will filter out tokens occuring< M
times.-M, --max-items
: If given an integerM
, the final vocabulary will be limited toM
most-frequent tokens.-x, --exclude-symbols
: If given, the vocabulary will not include special markers such as<bos>, <eos>
. This should be used cautiously, and only for ad-hoc model implementations, as it may break the default models.files
: A variable number of training set corpora can be provided. If-s
is not given, one vocabulary for each will be created.
pysimt-coco-metrics
This is a simple utility that computes BLEU, METEOR, CIDEr, and ROUGE-L
using the well known coco-caption library. The library is shipped within
pysimt
so that you do not have to install it separately.
Arguments:
-l, --language
: If given a stringL
, the METEOR will be informed with that information. For languages not supported by METEOR, English will be assumed.-w, --write
: For every hypothesis file given as argument, a<hypothesis file>.score
file will be created with the computed metrics inside for convenience.-r, --refs
: List of reference files for evaluation. The number of lines across multiple references should be equal.systems
: A variable number of hypotheses files that represent system outputs.
Example:
$ pysimt-coco-metrics -l de system1.hyps system2.hyps -r ref1
$ pysimt-coco-metrics -l de system1.hyps system2.hyps -r ref1 ref2 ref3
Note
This utility requires tokenized hypotheses and references, as further
tokenization is not applied by the internal metrics. Specifically for BLEU,
if you are not evaluating your models for MMT or image captioning,
you may want to use sacreBLEU
for detokenized hypotheses and references.
Tip
The Bleu_4
produced by this utility is equivalent to the output of
multi-bleu.perl
and sacreBLEU
(when --tokenize none
is given to the latter).
pysimt
This is the main entry point to the software. It supports two modes, namely pysimt train and pysimt translate.
Training a model
Translating with a pre-trained model
Configuring An Experiment
Models
- A
pysimt
model derives fromtorch.nn.Module
and implements specific API methods.
Contributing
pysimt
is on GitHub. Bug reports and pull requests are welcome.
Citing The Toolkit
As of now, you can cite the following work if you use this toolkit. We will update this section if the software paper is published elsewhere.
@inproceedings{caglayan-etal-2020-simultaneous,
title = "Simultaneous Machine Translation with Visual Context",
author = {Caglayan, Ozan and
Ive, Julia and
Haralampieva, Veneta and
Madhyastha, Pranava and
Barrault, Lo{\"\i}c and
Specia, Lucia},
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.emnlp-main.184",
pages = "2350--2361",
}
License
pysimt
uses MIT License.
Copyright (c) 2020 NLP@Imperial
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Expand source code
"""
`pysimt` is a `PyTorch`-based sequence-to-sequence (S2S) framework that facilitates
research in unimodal and multi-modal machine translation. The framework
is especially geared towards a set of recent simultaneous MT approaches, including
heuristics-based decoding and prefix-to-prefix training/decoding. Common metrics
such as average proportion (AP), average lag (AL), and consecutive wait (CW)
are provided through well-defined APIs as well.
.. include:: ./docs.md
"""
__version__ = '1.0.0'
# Disable documentation generation for the following sub modules
__pdoc__ = {
'cocoeval': False,
'config': False,
'logger': False,
}
Sub-modules
pysimt.datasets
-
A dataset in
pysimt
inherits fromtorch.nn.Dataset
and is designed to read and expose a specific type of corpus … pysimt.evaluator
pysimt.layers
-
Different layer types that may be used in seq-to-seq models.
pysimt.lr_scheduler
-
Learning rate scheduler wrappers.
pysimt.mainloop
-
Training main loop.
pysimt.metrics
pysimt.models
pysimt.monitor
-
Training progress monitor.
pysimt.optimizer
-
Stochastic optimizer wrappers.
pysimt.samplers
pysimt.stranslator
pysimt.translators
pysimt.utils
pysimt.vocabulary
-
Vocabulary class for integer-token mapping.