Package pysimt

pysimt is a PyTorch-based sequence-to-sequence (S2S) framework that facilitates research in unimodal and multi-modal machine translation. The framework is especially geared towards a set of recent simultaneous MT approaches, including heuristics-based decoding and prefix-to-prefix training/decoding. Common metrics such as average proportion (AP), average lag (AL), and consecutive wait (CW) are provided through well-defined APIs as well.

Features

pysimt includes two state-of-the-art S2S approaches to neural machine translation (NMT) to begin with:

The toolkit mostly emphasizes on multimodal machine translation (MMT), and therefore the above models easily accomodate multiple source modalities through encoder-side (Caglayan et al. 2020) and decoder-side multimodal attention (Caglayan et al. 2016) approaches.

Simultaneous NMT

The following notable approaches in the simultaneous NMT field are implemented:

Simultaneous MMT

The toolkit includes the reference implementation for the following conference papers that initiated research in Simultaneous MMT:

Other features

  • CPU / (Single) GPU training of sequence-to-sequence frameworks
  • Reproducible experimentation through well-defined configuration files
  • Easy multimodal training with parallel corpora
  • Logging training progress and validation performance to Tensorboard
  • Text, Image Features and Speech encoders
  • Early-stopping and model checkpointing using various criteria such as MultiBLEU, SacreBLEU, METEOR, word error rate (WER), character error rate (CER), etc.
  • Ready-to-use latency metrics for simultaneous MT, including average proportion (AP), average lag (AL), and consecutive wait (CW)
  • Beam search translation for consecutive, greedy search translation for simultaneous MT
  • Utilities to produce reports for simultaneous MT performance

Installation

Essentially, pysimt requires Python>=3.7 and torch>=1.7.0. You can access the other dependencies in the provided environment.yml file.

The following command will create an appropriate Anaconda environment with pysimt installed in editable mode. This will allow you to modify to code in the GIT checkout folder, and then run the experiments directly.

$ conda env create -f environment.yml

Note

If you want to use the METEOR metric for early-stopping or the pysimt-coco-metrics script to evaluate your models' performance, you need to run the pysimt-install-extra script within the pysimt Anaconda environment. This will download and install the METEOR paraphrase files under the ~/.pysimt folder.

Command-line tools

Once installed, you will have access to three command line utilities:

pysimt-build-vocab

  • Since pysimt does not pre-process, tokenize, segment the given text files automagically, all these steps should be done by the user prior to training, and the relevant vocabulary files should be constructed using pysimt-build-vocab.
  • Different vocabularies should be constructed for source and target language representations (unless -s is given).
  • The resulting files are in .json format.

Arguments:

  • -o, --output-dir OUTPUT_DIR: Output directory where the resulting vocabularies will be stored.
  • -s, --single: If given, a single vocabulary file for all the given training corpora files will be constructed. Useful for weight tying in embedding layers.
  • -m, --min-freq: If given an integer M, it will filter out tokens occuring < M times.
  • -M, --max-items: If given an integer M, the final vocabulary will be limited to M most-frequent tokens.
  • -x, --exclude-symbols: If given, the vocabulary will not include special markers such as <bos>, <eos>. This should be used cautiously, and only for ad-hoc model implementations, as it may break the default models.
  • files: A variable number of training set corpora can be provided. If -s is not given, one vocabulary for each will be created.

pysimt-coco-metrics

This is a simple utility that computes BLEU, METEOR, CIDEr, and ROUGE-L using the well known coco-caption library. The library is shipped within pysimt so that you do not have to install it separately.

Arguments:

  • -l, --language: If given a string L, the METEOR will be informed with that information. For languages not supported by METEOR, English will be assumed.
  • -w, --write: For every hypothesis file given as argument, a <hypothesis file>.score file will be created with the computed metrics inside for convenience.
  • -r, --refs: List of reference files for evaluation. The number of lines across multiple references should be equal.
  • systems: A variable number of hypotheses files that represent system outputs.

Example:

$ pysimt-coco-metrics -l de system1.hyps system2.hyps -r ref1
$ pysimt-coco-metrics -l de system1.hyps system2.hyps -r ref1 ref2 ref3

Note

This utility requires tokenized hypotheses and references, as further tokenization is not applied by the internal metrics. Specifically for BLEU, if you are not evaluating your models for MMT or image captioning, you may want to use sacreBLEU for detokenized hypotheses and references.

Tip

The Bleu_4 produced by this utility is equivalent to the output of multi-bleu.perl and sacreBLEU (when --tokenize none is given to the latter).

pysimt

This is the main entry point to the software. It supports two modes, namely pysimt train and pysimt translate.

Training a model

Translating with a pre-trained model

Configuring An Experiment

Models

  • A pysimt model derives from torch.nn.Module and implements specific API methods.

Contributing

pysimt is on GitHub. Bug reports and pull requests are welcome.

Citing The Toolkit

As of now, you can cite the following work if you use this toolkit. We will update this section if the software paper is published elsewhere.

@inproceedings{caglayan-etal-2020-simultaneous,
  title = "Simultaneous Machine Translation with Visual Context",
  author = {Caglayan, Ozan  and
    Ive, Julia  and
    Haralampieva, Veneta  and
    Madhyastha, Pranava  and
    Barrault, Lo{\"\i}c  and
    Specia, Lucia},
  booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
  month = nov,
  year = "2020",
  address = "Online",
  publisher = "Association for Computational Linguistics",
  url = "https://www.aclweb.org/anthology/2020.emnlp-main.184",
  pages = "2350--2361",
}

License

pysimt uses MIT License.

Copyright (c) 2020 NLP@Imperial

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Expand source code
"""
`pysimt` is a `PyTorch`-based sequence-to-sequence (S2S) framework that facilitates
research in unimodal and multi-modal machine translation. The framework
is especially geared towards a set of recent simultaneous MT approaches, including
heuristics-based decoding and prefix-to-prefix training/decoding. Common metrics
such as average proportion (AP), average lag (AL), and consecutive wait (CW)
are provided through well-defined APIs as well.


.. include:: ./docs.md
"""

__version__ = '1.0.0'

# Disable documentation generation for the following sub modules
__pdoc__ = {
    'cocoeval': False,
    'config': False,
    'logger': False,
}

Sub-modules

pysimt.datasets

A dataset in pysimt inherits from torch.nn.Dataset and is designed to read and expose a specific type of corpus …

pysimt.evaluator
pysimt.layers

Different layer types that may be used in seq-to-seq models.

pysimt.lr_scheduler

Learning rate scheduler wrappers.

pysimt.mainloop

Training main loop.

pysimt.metrics
pysimt.models
pysimt.monitor

Training progress monitor.

pysimt.optimizer

Stochastic optimizer wrappers.

pysimt.samplers
pysimt.stranslator
pysimt.translators
pysimt.utils
pysimt.vocabulary

Vocabulary class for integer-token mapping.