Finnish stemming and lemmatization in python

Finnish stemming and lemmatization in python for text analytics.

March 29, 2019

Mikael Ahonen Data Scientist

There are plenty of options for natural language processing in English. For small languages like Finnish it is a different story. Not all solutions are easy to find.

In this blog I deal with stemming and lemmatization in Finnish language. Examples are written in python 3.6.

Difference between stemming and lemmatization

Transforming a word to a generalized format is helpful in many applications of text analysis. This is because words like cat and cats mean almost the same thing.

Lemmatization can be defined as converting words to their base forms. After the conversion, the different “versions” of a word such as cat, cats, cat’s or cats’ would all be simply cat.

Stemming is the other option to convert words to a general format. Stemming is not exactly the same operation as base form conversion as it goes deeper down to the structure and science of the language. More about stemming from Wikipedia.

Here is a simple example about the difference between lemmatization and stemming.

Original word	Lemmatized word	Stemmed word
Study	Study	Study
Studies	Study	Studi

More focus is put on lemmatization in this article. This is because Finnish lemmatization libraries were more difficult to find.

Finnish lemmatization with voikko python library

In the GitHub page Voikko describes the use cases for the library:

“Libvoikko provides spell checking, hyphenation, grammar checking and morphological analysis for Finnish language.”

It took some trial and error to find proper installation instructions for python. Instead of using python’s pip package installer, the following line worked for Linux users. For Windows users I recommend installing Ubuntu subsytem for Windows.

sudo apt -y install -y voikko-fi python-libvoikko

After installation the libvoikko library can be imported to python scripts as usual. Here is an example how to lemmatize a single Finnish word to its base form with python.

#Import the Voikko library
import libvoikko

#Define a Voikko class for Finnish
v = libvoikko.Voikko(u"fi")

#A word that might or might not be in base form
#Finnish word "kissoja" means "cats" in English
word = "kissoja"

#Analyze the word
voikko_dict = v.analyze(word)

#Extract the base form as
#analyze() function returns various info for the word
word_baseform = voikko_dict[0]['BASEFORM']

#Print the base form of the word
#This should print "kissa", which is "cat" in English
print(word_baseform)

Finnish sentence lemmatization in python

Often you would like to perform the base form conversion for a block of text or for a sentence. To achieve this you should first split the long text to list of words. The you can apply Voikko’s analyze() function for each of them. Word splitting is called word tokenization.

There are different ways of doing tokenization depending on your objective. Sometimes commas, dashes and upper case letters matter, sometimes not.

Python package nltk provides an English module for tokenization which works for Finnish in most cases. But instead, I wrote my own tokenization script to demonstrate base form conversion for multiple sentences.

#Import the Voikko library
import libvoikko

#Define a Voikko class for Finnish
v = libvoikko.Voikko(u"fi")

#Some Finnish text
txt = "Tähän jotain suomenkielistä tekstiä. Väärinkirjoitettu yhdys-sana, pahus."

#Pre-process the text
txt = txt.lower().replace(".", "").replace(",", "")

#Split to list by space character
word_list = txt.split(" ")

#Initialize a list for base form words
bf_list = []

#Loop all words in the list
for w in word_list:
  
  #Analyze the word with voikko
  voikko_dict = v.analyze(w)
  
  #Extract the base form, if the word is recognized
  if voikko_dict:
    bf_word = voikko_dict[0]['BASEFORM']
  #If word is not recognized, add the original word
  else:
    bf_word = w
  
  #Append to the list
  bf_list.append(bf_word)
  
#Print results
print("Original:")
print(word_list)
print("Lemmatized:")
print(bf_list)

Finnish stemming with python

The nltk package provides stemming for Finnish language here.

And here are some Finnish stemming examples.

#Import nltk Snowball stemmer
from nltk.stem.snowball import SnowballStemmer

#Create a Finnish instance
stemmer = SnowballStemmer("finnish")

#Print the stemmed version of some Finnish word
print(stemmer.stem("koiriemme"))

As you can see, the nltk stemmer is extremely easy to use. Antoher advanatage is, with very little code you can harness the same script for other languages.

Summary – Lemmatization and stemming in Finnish

This blog offered you simple and concrete examples to lemmatize and stem Finnish words in python. Hopefully this gets you started with your text mining project.

There is no absolute truth whether you should use stemming or lemmatization. One rule of thumb is that stemming captures more semantics than lemmatization. On the other hand lemmatization is easier to understand and generalizes more.

Now harness your creativity and try yourself!

A Data scientist’s abc to AI ethics, part 1 – About AI and ethics

In this series of posts I’ll try to paint the borderline between AI and ethics from a bit more analytical and technically oriented perspective. My immediate aims are to restrict hype meanings, and to draw some links to related fields.

January 25, 2019

Paavo Toivanen

In the daily life we constantly encounter new types of machine actors: in social media; in the grocery store; when negotiating a loan. Some of them appear amiable and friendly, but almost all are difficult to understand deeply. Their constitution may be cryptic and inaccessible.

It’s of course a subject of interest as in which ways algorithms and machine actors impact our society. Maybe they do not remain value-free or neutral in a larger context.

Philosophical and other types of interest

The Finnish Philosophical Society’s January 2019 colloquium targeted these kinds of questions. Talks concerned AI, humanity, and society at large. Prominent topics included the existence of machine autonomy and ethics. One interesting track concerned the definitions of moral and juridic responsibility. Many weighty concepts like humanity, personhood, and the aesthetics of AI, were discussed too.

From a purely philosophical perspective, technology might be viewed as one particular type of otherness. It is something out of bounds of direct personal interest.

On the other hand the landscape around AI may appear supercharged at the moment. Even the word AI reveals many interest vectors. “Whose agenda does the ethics of AI in each case forward?” Maija-Riitta Ollila asked in her presentation.

No wonder many people with a technical background are a bit wary of the term. Often it would be more appropriate to use a less charged one – some good alternatives include machine learning, statistical analysis, and decision modeling.

Between AI and ethics

Most of the talks in the colloquium shared this very sensible view that AI as a term should be subject to critique. One moral responsibility then for tech people is just shooting down related hype.

But the landscape of AI and ethics is complex and controversial. As if to back this observation, many presenters in the colloquium openly asked the audience to correct them on technical points if they should go wrong.

For instance, cognitive and emotional modeling are named as two quite distinct areas of research within cognitive science and neuroscience. The first holds much more progress than the other, when we compare their achievements. Logic is relatively easier to simulate than emotional attitudes. We may equate this with the innate complexity of human action and information processing that this simulation platform only exemplifies.

Furthermore, as illustrated by many intriguing thought experiments, problems arise when we try to attribute an ethical or moral role to a machine actor. Some of these I’ll try to explicate in later posts.

A bit of a discomfort for me has been the relationship between AI discussion and ethics. Is the talk always morally sound? Sometimes it felt that ethics won’t fit into the world of AI marketing. If I should define ethics with a few words, I would probably state that it is deep thinking about prevalent problems of good and bad.

Some wisdom about AI

We may juxtapose this with a punchline about contemporary AI. “[The] systems are merely optimization machines, and ultimately, their target is optimization of business profit”, one fellow Data scientist wryly commented to me.

So on the surface level, computer science and mathematical problems might not connect to ethics at all. The situation may be alike in sales and marketing. Also in philosophy, formal logic on the one hand and ethics and cultural philosophy on the other are largely separate areas.

What to make of this divide? My next post will examine popular perceptions of AI in the wild.

This is the first of four posts that will handle the topics of AI and ethics from a bit more technical angle.