This page contains the source code and the resources used in our papers on training a PoS tagger from ambiguous labels:
Please cite our EMNLP paper if you use the code/resources available in this page:
@InProceedings{wisniewski-EtAl:2014:EMNLP2014,
author = {Wisniewski, Guillaume and P\'{e}cheux, Nicolas and Gahbiche-Braham, Souhir and Yvon, Fran\c{c}ois},
title = {Cross-Lingual Part-of-Speech Tagging through Ambiguous Learning},
booktitle = {Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
month = {October},
year = {2014},
address = {Doha, Qatar},
publisher = {Association for Computational Linguistics},
pages = {1779--1785},
url = {http://www.aclweb.org/anthology/D14-1187}
}
The following program can be used to extract the type constraints from Wiktionary dumps (look for the [lang]wiktionary links, where [lang] is the ISO 639 code for the language). Before using it, you will have to install the following dependencies:
Note that the program has been improved since our experiments for the EMNLP paper and the extracted dictionaries may be slightly different. For the sake of reproducibility, here are the constraints used in our EMNLP paper:
The code for the ambiguous PoS tagger is available in the following archive.
Once the archive has been decompressed, it is possible to project the tags through alignment links with the following code:
lang = "fr"
# All these files are generated during data preparation
FREQUENT_SUFFIXES = "{}_frequent_suffixes.pickle".format(lang)
FREQUENT_WORDS = "{}_frequent_words.pickle".format(lang)
AL_CONSTRAINTS = "{}_al_constraints.json".format(lang)
WEAKLY_TRAIN = "{}_projected.pickle".format(lang)
# Extract information about frequent words and suffixes
create_feature_template(CORPUS, FREQUENT_WORDS, FREQUENT_SUFFIXES)
# Extract alignment constraints
extract_alignment_constraints(CORPUS, AL_CONSTRAINTS, -1)
# Label training set. The resulting dataset will be in WEAKLY_TRAIN
project_labels(WEAKLY_TRAIN, CORPUS, False, WIKI_CONSTRAINTS,
AL_CONSTRAINTS, PRIORITY_CONSTRAINTS, "intersection")
running this script requires the following files:
CORPUS this file contains the corpus with all information required to transfer the PoS information. The file is a pickle file in which examples are pickled one after the other. Each example is a dictionary with the following keys:
- src the tokenized source sentence
- tgt the tokenized target sentence
- al the alignment between the source and the target (a dictionary that maps a source index to a target index)
- src_pos a list describing the PoS tags of each source sentence (if not available, the source sentence)
- tgt_pos a list describing the PoS tags of the target sentence (if not available, the target sentence)
WIKI_CONSTRAINTS Wiktionary constraints are in the format used in [Li et al, 2012] paper (each line contains a word and one PoS tag it can have separated by a tabulation); the same word can appear on several lines.
PRIORITY_CONSTRAINTS Priority constraints are hand-made constraints that can be used, for instance, to account for differences in conventions, punctuation normalization or to correct ‘error’ in extraction from Wiktionary. Priority constraints are stored in a json file that describe a mapping between a regexp and a list of tags
The following code can then be used to run a ‘simple’ weakly PoS tagger:
lang = "fr"
train_set = "fr_projected.pickle"
output = "fr_weakly_mode.pickle"
train_weakly_tagger(lang, train_set, "constraints_features", output)