vignettes/ptstem.Rmd
ptstem.Rmd
In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation.
From Wikipedia
This paragraph gives a nice explanation of what stemming is. Much of academic work on stemming was focused on English Language and it’s somewhat hard to find stemming algorithms for other languages. ptstem
tries to fix this, by providing a comprehensive interface for Portuguese Language stemming algorithms.
The implemented algorithms are:
R
in the rslp
package.R
in the hunspell
package.R
it’s implemented in the SnowballC
package.ptstem
has only one important function that is called ptstem
. You can easily stem a text by passing it to ptstem
.
text <- "Em morfologia linguística e recuperação de informação a stemização (do inglês, stemming) é
o processo de reduzir palavras flexionadas (ou às vezes derivadas) ao seu tronco (stem), base ou
raiz, geralmente uma forma da palavra escrita. O tronco não precisa ser idêntico à raiz morfológica
da palavra; ele geralmente é suficiente que palavras relacionadas sejam mapeadas para o mesmo
tronco, mesmo se este tronco não for ele próprio uma raiz válida. O estudo de algoritmos para
stemização tem sido realizado em ciência da computação desde a década de 60. Vários motores de
buscas tratam palavras com o mesmo tronco como sinônimos como um tipo de expansão de consulta, em
um processo de combinação."
## [1] "Em morfologia linguística e recuperação de informação a stemização (do inglês, stemming) é\no processo de reduzir palavras flexionadas (ou às vezes derivadas) ao seu tronco (stem), base ou\nraiz, geralmente uma forma da palavras escrita. O tronco não precisa ser idêntico à raiz morfologia\nda palavras; ele geralmente é suficiente que palavras relacionadas sejam mapeadas para o mesmo\ntronco, mesmo se este tronco não for ele próprio uma raiz válida. O estudo de algoritmos para\nstemização tem sido realizado em ciência da computação desde a década de 60. Vários motores de\nbuscas tratam palavras com o mesmo tronco com sinônimos com um tipo de expansão de consulta, em\num processo de combinação."
By default ptstem
uses the rslp algorithm to stem, and it complete stems with the most frequent word in the text (This is explained later). Is this example it’s a little hard to see improvements with stemming, because the text doesn’t contain many words with the same root. Let’s look at a more simple example.
## [1] "avião" "avião" "avião" "viação" "aves" "balão" "balão"
You can return the suffix stripped words (without completion) by setting the argument complete = FALSE
.
## [1] "avi" "avi" "avi" "viac" "ave" "bal" "bal"
You can also change the algorithm used to stem by setting the algorithm
argument.
## [1] "avião" "avião" "viação" "viação" "ave" "balão" "balão"
The hunspell stemmer is not a suffix-stripping algorithm, so it can find related words that has the same sufffix. It happened here with the word “aviação” that was related to “viação” instead of “avião” and “aviões”. Also you can see that hunspell is returning valid words, even with complete = FALSE
, but it does not necessarily returns words that appear in the text, see:
## [1] "avião"
To use the Porter stemmer, simply tweak the algorithm
argument again.
## [1] "aviã" "aviõ" "aviaçã" "viaçã" "aves" "balã" "balõ"
As Porter stemmer, is a general algorithm, it has some problems when detecting irregular forms of words. In this example, the stemming didn’t relate any words, if you hadn’t used the complete = FALSE
argument, you wouldn’t have noticed any difference between the input and the output vectors.
## [1] "avião" "aviões" "aviação" "viação" "aves" "balão" "balões"
ptstem
has two other arguments that can be used to ignore words in stemming.
n_char
: minimum number of characters of words to be stemmedignore
: vector of words and regex’s to igoreSometimes you have some words in a text that you don’t want to stem, like proper names or words in other languages and it’s usefull to ignore them. Sometimes you also have very small words, that if stemmed they loose their meaning, the rslp
algorithm has some rules about words lenghts, but hunspell
does not. That’s why n_char
argument is available.
## [1] "obam" "gost" "gost" "gost" "é" "e"
Here rslp
stemmed “obama” to “obam” and “firmware” to “firmw”. You can choose to not stem theese words by setting the ignore
parameter.
## [1] "obama" "gost" "gost" "gost" "é" "e"
By default, ptstem
does not stem words with less then three characters. If you set for at least 1 characters.
## [1] "obam" "gost" "gost" "gost" "e" "e"
You can see that “e” and “é” were united. It’s also possible to ignore regex’s, using the ignore
argument.
## [1] "obam" "gostei" "gostou" "gostamos" "é" "e"
This doesn’t stem words that start with “go”.
The goal of stemming algorithms is to group related words and to separate unrelated words. With this in mind, you can talk about two kinds of possible errors when stemming:
To measure these errors the function performance
was implemented. It returns a data.frame
with 3 columns. The name of the stemmer and 2 metrics:
Remember that OI is 0 if you don’t stem. So I think the true objective of a stemming algorithm is to reduce UI without augmenting OI too much.
ptstem
package provides a dataset of grouped words for the portuguese language (found in this link). It’s in this dataset that performance
function calculates the metrics described above.
See results:
## .id UI OI
## 1 rslp 0.08540752 0.04929234
## 2 hunspell 0.12835530 0.03221083
## 3 porter 0.13958028 0.03221083
## 4 modified-hunspell 0.05466081 0.06295754
This is not the only approach for measuring performance of the those algorithms. The article Assessing the impact of Stemming Accuracy on Information Retrieval – A multilingual perspective describes various ways to analyse stemming performance.