
Finnish stemming and lemmatization in python
There are plenty of options for natural language processing in English. For small languages like Finnish it is a different story. Not all solutions are easy to find.
In this blog I deal with stemming and lemmatization in Finnish language. Examples are written in python 3.6.
Difference between stemming and lemmatization
Transforming a word to a generalized format is helpful in many applications of text analysis. This is because words like cat and cats mean almost the same thing.
Lemmatization can be defined as converting words to their base forms. After the conversion, the different “versions” of a word such as cat, cats, cat’s or cats’ would all be simply cat.
Stemming is the other option to convert words to a general format. Stemming is not exactly the same operation as base form conversion as it goes deeper down to the structure and science of the language. More about stemming from Wikipedia.
Here is a simple example about the difference between lemmatization and stemming.
Original word | Lemmatized word | Stemmed word |
---|---|---|
Study | Study | Study |
Studies | Study | Studi |
More focus is put on lemmatization in this article. This is because Finnish lemmatization libraries were more difficult to find.
Finnish lemmatization with voikko python library
In the GitHub page Voikko describes the use cases for the library:
“Libvoikko provides spell checking, hyphenation, grammar checking and morphological analysis for Finnish language.”
It took some trial and error to find proper installation instructions for python. Instead of using python’s pip package installer, the following line worked for Linux users. For Windows users I recommend installing Ubuntu subsytem for Windows.
sudo apt -y install -y voikko-fi python-libvoikko
After installation the libvoikko library can be imported to python scripts as usual. Here is an example how to lemmatize a single Finnish word to its base form with python.
#Import the Voikko library import libvoikko #Define a Voikko class for Finnish v = libvoikko.Voikko(u"fi") #A word that might or might not be in base form #Finnish word "kissoja" means "cats" in English word = "kissoja" #Analyze the word voikko_dict = v.analyze(word) #Extract the base form as #analyze() function returns various info for the word word_baseform = voikko_dict[0]['BASEFORM'] #Print the base form of the word #This should print "kissa", which is "cat" in English print(word_baseform)
Finnish sentence lemmatization in python
Often you would like to perform the base form conversion for a block of text or for a sentence. To achieve this you should first split the long text to list of words. The you can apply Voikko’s analyze() function for each of them. Word splitting is called word tokenization.
There are different ways of doing tokenization depending on your objective. Sometimes commas, dashes and upper case letters matter, sometimes not.
Python package nltk provides an English module for tokenization which works for Finnish in most cases. But instead, I wrote my own tokenization script to demonstrate base form conversion for multiple sentences.
#Import the Voikko library import libvoikko #Define a Voikko class for Finnish v = libvoikko.Voikko(u"fi") #Some Finnish text txt = "Tähän jotain suomenkielistä tekstiä. Väärinkirjoitettu yhdys-sana, pahus." #Pre-process the text txt = txt.lower().replace(".", "").replace(",", "") #Split to list by space character word_list = txt.split(" ") #Initialize a list for base form words bf_list = [] #Loop all words in the list for w in word_list: #Analyze the word with voikko voikko_dict = v.analyze(w) #Extract the base form, if the word is recognized if voikko_dict: bf_word = voikko_dict[0]['BASEFORM'] #If word is not recognized, add the original word else: bf_word = w #Append to the list bf_list.append(bf_word) #Print results print("Original:") print(word_list) print("Lemmatized:") print(bf_list)
Finnish stemming with python
The nltk package provides stemming for Finnish language here.
And here are some Finnish stemming examples.
#Import nltk Snowball stemmer from nltk.stem.snowball import SnowballStemmer #Create a Finnish instance stemmer = SnowballStemmer("finnish") #Print the stemmed version of some Finnish word print(stemmer.stem("koiriemme"))
As you can see, the nltk stemmer is extremely easy to use. Antoher advanatage is, with very little code you can harness the same script for other languages.
Summary – Lemmatization and stemming in Finnish
This blog offered you simple and concrete examples to lemmatize and stem Finnish words in python. Hopefully this gets you started with your text mining project.
There is no absolute truth whether you should use stemming or lemmatization. One rule of thumb is that stemming captures more semantics than lemmatization. On the other hand lemmatization is easier to understand and generalizes more.
Now harness your creativity and try yourself!