RJCRI 06 : Unnatural Language Detection
In the context of web search engines, the escalation between ranking techniques and spamdexing techniques has led to the appearance of faked contents in web pages. If random sequences of keywords are easily detectable, web pages produced by dedicated content generators are a lot more difcult to detect.
Motivated by search engines applications, we will focus on the problem of automatic unnatural language detection. We will study both syntactical and semantical aspects of this problem, and for both of them we will present probabilistic and symbolic approaches.
[pdf] [lnk]@inproceedings{lavergne2006rjcri,
author = {Lavergne, Thomas},
title = {Unnatural Language Detection},
booktitle = {Young Scientist' conference on Information Retrieval ({RJCRI}'06)},
year = {2006},
pages = {383--388},
month = {3},
location = {Lyon, France},
url = {http://www.irit.fr/ARIA/2006/383.pdf}
}
AIRWeb 06 : Tracking Web Spam with Hidden Style Similarity
Automatically generated content is ubiquitous in the web: dynamic sites built using the three-tier paradigm are good examples (e.g. commercial sites, blogs and other sites powered by a web authoring software), as well as less legitimous spamdexing attempts (e.g. link farms, faked directories...).
Those pages built using the same generating method (template or script) share a common "look and feel" that is not easily detected by common text classifcation methods, but is more related to stylometry.
In this paper, we present a (hidden) style similarity measure based on extra-textual features in html source code. We also describe a method to clusterize a large collection of documents according to this measure. The clustering algorithm being based on fingerprints, we also give some recalls about fingerprinting. By conveniently sorting the generated clusters, one can efficiently track back instances of a particular automatic content generation method among web pages collected using a crawler. This is particularly useful to detect pages across different sites sharing the same design - this is often a good hint of either spamdexing attempt or mirrored content.
[pdf] [lnk]@inproceedings{urvoy2006airweb,
author = {Urvoy, Tanguy and Lavergne, Thomas and Filoche, Pascal},
title = {Tracking Web Spam with Hidden Style Similarity},
booktitle = {International Workshop on Adversarial Information Retrieval on the Web ({AIRW}eb'06)},
year = {2006},
pages = {25--31},
month = {8},
location = {Seattle, Washington, {USA}},
url = {http://airweb.cse.lehigh.edu/2006/urvoy.pdf}
}
ACM Tweb : Tracking Web Spam with HTML Style Similarities
Automatically generated content is ubiquitous in the web: dynamic sites built using
the three-tier paradigm are good examples (e.g., commercial sites, blogs and other
sites edited using web authoring software), as well as less legitimate spamdexing
attempts (e.g., link farms, faked directories).
Those pages built using the same generating method (template or script) share a
common look and feel that is not easily detected by common text classification
methods, but is more related to stylometry.
In this work we study and compare several HTML style similarity measures based on
both textual and extra-textual features in HTML source code. We also propose a
flexible algorithm to cluster a large collection of documents according to these
measures. Since the proposed algorithm is based on locality sensitive hashing
(LSH), we first review this technique.
We then describe how to use the HTML style similarity clusters to pinpoint dubious
pages and enhance the quality of spam classifiers. We present an evaluation of our
algorithm on the WEBSPAM-UK2006 dataset.
@article{urv08acmtweb,
author = {Urvoy, Tanguy and Chauveau, Emmanuel and Filoche, Pascal and Lavergne, Thomas},
title = {Tracking Web Spam with {HTML} Style Similarities},
journal = {{ACM} Trans. Web},
volume = {2},
number = {1},
year = {2008},
issn = {1559-1131},
pages = {1--28},
doi = {DOI:10.1145/1326561.1326564},
publisher = {{ACM}},
address = {New York, {NY}, {USA}}
}
JADT 08 : Taxonomie de textes peu-naturels
In this paper, we define what is a natural text in a pragmatic way. Then, we present
various types of unnatural texts and more particularly the simplest generators, which
are also the most widespread in spamdexing. Finally, we describe some statistical
tests which allow a first filtering of unnatural texts.
Dans cet article nous définissons de manière pragmatique ce qu'est un texte naturel.
Puis nous présentons différentes catégories de textes non-naturels et plus
particulièrement les méthodes de génération les plus simples qui sont aussi les plus
répandues dans le cadre du spamdexing. Enfin nous proposons quelques tests statistiques
permettant un premier filtrage des textes non-naturels.
@inproceedings{lavergne2008jadt,
author = {Lavergne, Thomas},
title = {Taxonomie de textes peu-naturels},
booktitle = {Proceedings of 9th International Conference on Textual Data statistical Analysis},
year = {2008},
pages = {679--687},
url = {http://www.cavi.univ-paris3.fr/lexicometrica/jadt/jadt2008/pdf/lavergne.pdf}
}
PAN 08 : Detecting Fake Content with Relative Entropy Scoring
How to distinguish natural texts from artificially generated ones ? Fake content is commonly encountered on the Internet, ranging from web scraping to random word salads. Most of this fake content is generated for spam purpose. In this paper, we present two methods to deal with this problem. The first one uses classical language models, while the second one is a novel approach using short range information between words.
[pdf] [lnk]@inproceedings{lavergne2008pan,
author = {Lavergne, Thomas and Urvoy, Tanguy and Yvon, Fran\c{c}ois},
title = {Detecting Fake Content with Relative Entropy Scoring},
booktitle = {International Workshop on Plagiarism Analysis, Authorship Identification,
and Near-Duplicate Detection ({PAN})},
year = {2008},
url = {http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-377/paper4.pdf}
}
PHD Thesis: Détection des textes non-naturels
[pdf]@phdthesis{lavergne2009phd,
author = {Lavergne, Thomas},
title = {D\'etection des textes non-naturels},
school = {{ENST} Paris and Orange Labs},
month = {4},
year = {2009}
}
ACL-IJCNLP 09 : Introduction of a new paraphrase generation tool based on Monte-Carlo sampling
We propose a new specifically designed method for paraphrase generation based on Monte-Carlo sampling and show how this algorithm is suitable for its task. Moreover, the basic algorithm presented here leaves a lot of opportunities for future improvement. In particular, our algorithm does not constraint the scoring function in opposite to Viterbi based decoders. It is now possible to use some global features in paraphrase scoring functions. This algorithm opens new outlooks for paraphrase generation and other natural language processing applications like statistical machine translation.
[pdf] [lnk]@inproceedings{chevelu2009acl,
author = {Chevelu, Jonathan and Lavergne, Thomas and Lepage, Yves and Moudenc, Thierry},
title = {Introduction of a new paraphrase generation tool based on Monte-Carlo sampling},
booktitle = {Joint Conference of the 47th Annual Meeting of the Association for Computational
Linguistics and the 4th International Joint Conference on Natural Language
Processing (ACL-IJCNLP)},
year = {2009},
pages = {249--252},
url = {http://www.aclweb.org/anthology/P/P09/P09-2063.pdf}
}
PACLING 09 : Transformation rules and Monte-Carlo sampling: a different approach for statistical paraphrase generation
Paraphrase generation is often presented as a monolingual statistical machine translation problem. This approach cannot take advantage of paraphrases particularities by transforming only parts of sentences. We propose a different paradigm for statistical paraphrase generation where a paraphrase is seen as the application of a set of transformation rules on a sentence. We propose a new method, adapted to this point of view, based on Monte-Carlo sampling and show how this algorithm is suitable for paraphrase generation. Moreover, the basic algorithm presented here leaves a lot of opportunities for future improvement. In particular, our algorithm does not constraint the scoring function in opposite to Viterbi based decoders. It is now possible to use some global features in paraphrase scoring functions. This algorithm opens new outlooks for paraphrase generation and other natural language processing applications like statistical machine translation.
[pdf] [lnk]@inproceedings{chevelu2009pacling,
author = {Chevelu, Jonathan and Lavergne, Thomas and Lepage, Yves and Moudenc, Thierry},
title = {Transformation rules and Monte-Carlo sampling: a different approach for
statistical paraphrase generation},
booktitle = {Conference of the Pacific Association for Computational Linguistics (PACLING)},
year = {2009},
pages = {230--235}
}
LRE: Filtering artificial texts with statistical machine learning techniques
Fake content is flourishing on the Internet, ranging from basic random word salads to web scraping. Most of this fake content is generated for the purpose of nourishing fake web sites aimed at biasing search engine indexes: at the scale of a search engine, using automatically generated texts render such sites harder to detect than using copies of existing pages. In this paper, we present three methods aimed at distinguishing natural texts from artificially generated ones: the first method uses basic lexicometric features, the second one uses standard language models and the third one is based on a relative entropy measure which captures short range dependencies between words. Our experiments show that lexicometric features and language models are efficient to detect most generated texts, but fail to detect texts that are generated with high order Markov models. By comparison our relative entropy scoring algorithm, especially when trained on a large corpus, allows to detect these “hard” text generators with a high degree of accuracy.
[pdf] [lnk]@article{lavergne2010lre,
author = {Lavergne, Thomas and Urvoy, Tanguy and Yvon, Fran\c{c}ois},
title = {Filtering artificial texts with statistical machine learning techniques},
journal = {Language Resources and Evaluation},
volume = {45},
issue = {1},
year = {2011},
pages = {25--43},
doi = {DOI:10.1007/s10579-009-9113-0},
publisher = {Springer}
}
Efficient Learning of Sparse Conditional Random Fields for Supervised Sequence Labelling
Conditional Random Fields (CRFs) constitute a popular and efficient approach for supervised sequence labelling. CRFs can cope with large description spaces and can integrate some form of structural dependency between labels. In this contribution, we address the issue of efficient feature selection for CRFs based on imposing sparsity through an L1 penalty. We first show how sparsity of the parameter set can be exploited to significantly speed up training and labelling. We then introduce coordinate descent parameter update schemes for CRFs with L1 regularization. We finally provide some empirical comparisons of the proposed approach with state-of-the-art CRF training strategies. In particular, it is shown that the proposed approach is able to take profit of the sparsity to speed up processing and hence potentially handle larger dimensional models.
[pdf] [lnk]@article{sokolovska2010stsp,
author = {Sokolovska, Nataliya and Lavergne, Thomas and Capp\'{e}, Olivier and Yvon, Fra\c{c}ois},
title = {Efficient learning of sparse conditional random fields for supervised sequence labelling},
journal = {Journal of Selected Topics in Signal Processing},
volume = {4},
issue = {6},
pages = {953--964},
year = {2010},
doi = {DOI:10.1109/JSTSP.2010.2076150},
publisher = {{IEEE}},
}
Practical Very Large Scale CRFs
Conditional Random Fields (CRFs) constitute a popular approach for supervised sequence labelling, notably due to their ability to handle large description spaces and to integrate structural dependency between labels. Taking structure into account typically implies a number of parameters and a computational effort that grows quadratically with the cardinality of the label set. In this paper, we address the issue of training very large CRFs, containing up to hundreds output labels and several billion features. Efficiency stems here from the sparsity induced by the use of a ${\ell^1}$ penalty term. Based on our own implementation, we compare three recent proposals for implementing this regularization strategy. Our experiments demonstrate that very large CRFs can be trained efficiently and that larger models are able to improve the accuracy, while delivering compact parameter sets.
[pdf] [lnk]@inproceedings{lavergne2010acl,
author = {Lavergne, Thomas and Capp\'{e}, Olivier and Yvon, Fran\c{c}ois},
title = {Practical Very Large Scale CRFs}
booktitle = {Proceedings the 48th Annual Meeting of the Association for Computational
Linguistics (ACL)},
year = {2010},
location = {Uppsala, Sweden},
pages = {504--513},
}
Designing an improved discriminative word aligner
The quality of statistical machine translation systems depends on the quality of the word alignments, computed during the translation model training phase. IBM generative alignment models, despite their poor quality compared to a gold standard, perform well in practice.
In this paper, we propose an improved word aligner based on a maximum entropy alignment combination model, which employ better feature engineering, $\ell^1$ regularization, and an enhanced search space to improve the quality of both alignment and translation. For the Arabic-English language pair, we are able to reduce the Alignment Error Rate by 43.4%, and achieve $\approx1$ BLEU point enhancement over the IBM model 4 symmetrized alignments. These improvement are attainable at a lower computational cost, using only easy to estimate HMM and IBM model 1 features.
An analysis of the obtained results shows that a good balance between several alignment characteristics should be maintained in order to deliver good translation quality.
[pdf] [lnk]@article{tomeh2011ijcla,
author = {Tomeh, Nadi and Allauzen, Alexandre and Lavergne, Thomas and Yvon, Fran\c{c}ois},
title = {Designing an Improved Discriminative Word Aligner},
year = {2011},
journal = {International Journal of Computational Linguistics and Applications (IJCLA)},
}
LIMSI@WMT11
This paper describes LIMSI’s submissions to the Sixth Workshop on Statistical Machine Translation. We report results for the French-English and German-English shared translation tasks in both directions. Our systems use n-code, an open source Statistical Machine Translation system based on bilingual n-grams. For the French-English task, we focussed on finding efficient ways to take advantage of the large and heterogeneous training parallel data. In particular, using a simple filtering strategy helped to improve both processing time and translation quality. To translate from English to French and German, we also investigated the use of the SOUL language model in Machine Translation and showed significant improvements with a 10-gram SOUL model. We also briefly report experiments with several alternatives to the standard n-best MERT procedure, leading to a significant speed-up.
[pdf] [ilnk]@inproceedings{allauzen2011wmt,
author = {Allauzen, Alexandre and Bonneau-Maynard, H\'el\`ene and Le, Hai-Son and
Max, Aur\'elien and Wisniewski, Guillaume and Yvon, Fran\c{c}ois and
Adda, Gilles and Crego, Josep Maria and Lardilleux, Adrien and
Lavergne, Thomas and Sokolov, Artem},
title = {{LIMSI}@{WMT}11},
booktitle = {Proceedings of the 6th Workshop on Statistical Machine Translation (WMT)},
year = {2011},
pages = {309--315},
month = {7},
location = {Edinburgh, {UK}},
url = {http://www.statmt.org/wmt11/pdf/WMT35.pdf}
}
From n-gram-based to crf-based translation models
A major weakness of extant statistical machine translation (SMT) systems is their lack of a proper training procedure. Phrase extraction and scoring processes rely on a chain of crude heuristics, a situation judged problematic by many. In this paper, we recast the machine translation problem in the familiar terms of a sequence labeling task, thereby enabling the use of enriched feature sets and exact training and inference procedures. The tractability of the whole enterprise is achieved through an efficient implementation of the conditional random fields (CRFs) model using a weighted finite-state transducers library. This approach is experimentally contrasted with several conventional phrase-based systems.
[pdf] [lnk]@inproceedings{lavergne2011wmt,
author = {Lavergne, Thomas and Allauzen, Alexandre and Yvon, Fran\c{c}ois and Crego, Josep Maria},
title = {From n-gram-based to crf-based translation models},
booktitle = {Proceedings of the 6th Workshop on Statistical Machine Translation (WMT)},
year = {2011},
pages = {542--553},
month = {7},
location = {Edinburgh, {UK}},
url = {http://www.statmt.org/wmt11/pdf/WMT68.pdf},
}