Finnish stemming and lemmatization in python. See python code examples and try scripts yourself. This tutorial uses python 3.

Finnish stemming and lemmatization in python

Finnish stemming and lemmatization in python for text analytics.

There are plenty of options for natural language processing in English. For small languages like Finnish it is a different story. Not all solutions are easy to find.

In this blog I deal with stemming and lemmatization in Finnish language. Examples are written in python 3.6.

Difference between stemming and lemmatization

Transforming a word to a generalized format is helpful in many applications of text analysis. This is because words like cat and cats mean almost the same thing.

Lemmatization can be defined as converting words to their base forms. After the conversion, the different “versions” of a word such as cat, cats, cat’s or cats’ would all be simply cat.

Stemming is the other option to convert words to a general format. Stemming is not exactly the same operation as base form conversion as it goes deeper down to the structure and science of the language. More about stemming from Wikipedia.

Here is a simple example about the difference between lemmatization and stemming.

Original word Lemmatized word Stemmed word
Study Study Study
Studies Study Studi

More focus is put on lemmatization in this article. This is because Finnish lemmatization libraries were more difficult to find.

Finnish lemmatization with voikko python library

In the GitHub page Voikko describes the use cases for the library:

“Libvoikko provides spell checking, hyphenation, grammar checking and morphological analysis for Finnish language.”

It took some trial and error to find proper installation instructions for python. Instead of using python’s pip package installer, the following line worked for Linux users. For Windows users I recommend installing Ubuntu subsytem for Windows.

sudo apt -y install -y voikko-fi python-libvoikko

 

After installation the libvoikko library can be imported to python scripts as usual. Here is an example how to lemmatize a single Finnish word to its base form with python.

#Import the Voikko library
import libvoikko

#Define a Voikko class for Finnish
v = libvoikko.Voikko(u"fi")

#A word that might or might not be in base form
#Finnish word "kissoja" means "cats" in English
word = "kissoja"

#Analyze the word
voikko_dict = v.analyze(word)

#Extract the base form as
#analyze() function returns various info for the word
word_baseform = voikko_dict[0]['BASEFORM']

#Print the base form of the word
#This should print "kissa", which is "cat" in English
print(word_baseform)

 

Finnish sentence lemmatization in python

Often you would like to perform the base form conversion for a block of text or for a sentence. To achieve this you should first split the long text to list of words. The you can apply Voikko’s analyze() function for each of them. Word splitting is called word tokenization.

There are different ways of doing tokenization depending on your objective. Sometimes commas, dashes and upper case letters matter, sometimes not.

Python package nltk provides an English module for tokenization which works for Finnish in most cases. But instead, I wrote my own tokenization script to demonstrate base form conversion for multiple sentences.

#Import the Voikko library
import libvoikko

#Define a Voikko class for Finnish
v = libvoikko.Voikko(u"fi")

#Some Finnish text
txt = "Tähän jotain suomenkielistä tekstiä. Väärinkirjoitettu yhdys-sana, pahus."

#Pre-process the text
txt = txt.lower().replace(".", "").replace(",", "")

#Split to list by space character
word_list = txt.split(" ")

#Initialize a list for base form words
bf_list = []

#Loop all words in the list
for w in word_list:
  
  #Analyze the word with voikko
  voikko_dict = v.analyze(w)
  
  #Extract the base form, if the word is recognized
  if voikko_dict:
    bf_word = voikko_dict[0]['BASEFORM']
  #If word is not recognized, add the original word
  else:
    bf_word = w
  
  #Append to the list
  bf_list.append(bf_word)
  
#Print results
print("Original:")
print(word_list)
print("Lemmatized:")
print(bf_list)

 

Finnish stemming with python

The nltk package provides stemming for Finnish language here.

And here are some Finnish stemming examples.

#Import nltk Snowball stemmer
from nltk.stem.snowball import SnowballStemmer

#Create a Finnish instance
stemmer = SnowballStemmer("finnish")

#Print the stemmed version of some Finnish word
print(stemmer.stem("koiriemme"))

As you can see, the nltk stemmer is extremely easy to use. Antoher advanatage is, with very little code you can harness the same script for other languages.

Summary – Lemmatization and stemming in Finnish

This blog offered you simple and concrete examples to lemmatize and stem Finnish words in python. Hopefully this gets you started with your text mining project.

There is no absolute truth whether you should use stemming or lemmatization. One rule of thumb is that stemming captures more semantics than lemmatization. On the other hand lemmatization is easier to understand and generalizes more.

Now harness your creativity and try yourself!

Building machine learning models with AWS SageMaker

A small group of Solita employees visited AWS London office last November and participated in a workshop. There we got to know the AWS service called SageMaker. SageMaker turned out to be easy to learn and use and in this blog post I'm going to tell more about it and demonstrate with short code snippets how it works.

AWS SageMaker

SageMaker is an Amazon service that was designed to build, train and deploy machine learning models easily. For each step there are tools and functions that make the development process faster. All the work can be done in Jupyter Notebook, which has pre-installed packages and libraries such as Tensorflow and pandas. One can easily access data in their S3 buckets from SageMaker notebooks, too. SageMaker provides multiple example notebooks so that getting started is very easy. I introduce more information about different parts of SageMaker in this blog post and the picture below summarises how they work together with different AWS services.

Picture of how SageMake interacts with other AWS services during build, train and deploy phase

Dataset

In the example snippets I use the MNIST dataset which contains labeled pictures of alphabets in sign language. They are 28×28 grey-scale pictures, which means each pixel is represented as an integer value between 0-255. Training data contains 27 455 pictures and test data 7 127 pictures and they’re stored in S3.

For importing and exploring the dataset I simply use pandas libraries. Pandas is able to read data from S3 bucket:

import pandas as pd

bucket = ''
file_name = 'data-file.csv'

data_location = 's3://{}/{}'.format(bucket, file_name)

df = pd.read_csv(data_location)

From the dataset I can see that its first column is a label for picture, and the remaining 784 columns are pixels. By reshaping the first row I can get the first image:

from matplotlib import pyplot as plt
pic=df.head(1).values[0][1:].reshape((28,28))

plt.imshow(pic, cmap='gray')

plt.show()

Image with alphabet d in sign language

Build

The build phase in AWS SageMaker means exploring and cleaning the data. Keeping it in csv format would require some changes to data if we’d like to use SageMaker built-in algorithms. Instead, we’ll convert the data into RecordIO protobuf format, which makes built-in algorithms more efficient and simple to train the model with. This can be done with the following code and should be done for both training and test data:

from sagemaker.amazon.common import write_numpy_to_dense_tensor
import boto3

def convert_and_upload(pixs, labels, bucket_name, data_file):
	buf = io.BytesIO()
	write_numpy_to_dense_tensor(buf, pixs, labels)
	buf.seek(0)

	boto3.resource('s3').Bucket(bucket_name).Object(data_file).upload_fileobj(buf)

pixels_train=df.drop('label', axis=1).values
labels_train=df['label'].values

convert_and_upload(pixels_train, labels_train, bucket, 'sign_mnist_train_rec')

Of course, in this case the data is very clean already and usually a lot more work is needed in order to explore and clean it properly before it can be used to train a model. Data can also be uploaded back to S3 after the cleaning phase for example if cleaning and training are kept in separate notebooks. Unfortunately, SageMaker doesn’t provide tools for exploring and cleaning data, but pandas is very useful for that.

Train

Now that the data is cleaned, we can either use SageMaker’s built-in algorithms or use our own, provided by for example sklearn. When using other than SageMaker built-in algorithms you would have to provide a Docker container for the training and validation tasks. More information about it can be found in SageMaker documentation. In this case as we want to recognise alphabets from the pictures we use k-Nearest Neighbors -algorithm which is simple and fast algorithm for classification tasks. It is one of the built-in algorithms in SageMaker, and can be used with very few lines of code:

knn=sagemaker.estimator.Estimator(get_image_uri(
	boto3.Session().region_name, "knn"),
	get_execution_role(),
	train_instance_count=1,
	train_instance_type='ml.m4.xlarge',
	output_path='s3://{}/output'.format(bucket),
	sagemaker_session=sagemaker.Session())

knn.set_hyperparameters(**{
	'k': 10,
	'predictor_type': 'classifier',
	'feature_dim': 784,
	'sample_size': 27455
})

in_config_test = sagemaker.s3_input(
	   s3_data='s3://{}/{}'.format(bucket,'sign_mnist_test_rec'))

in_config_train = sagemaker.s3_input(
	   s3_data='s3://{}/{}'.format(bucket,'sign_mnist_train_rec'))

knn.fit({'train':in_config_train, 'test': in_config_test})

So let’s get into what happens there. Estimator is an interface for creating training tasks in SageMaker. We simply tell it which algorithm we want to use, how many ML instances we want for training, which type of instances they should be and where the trained model should be stored.

Next we define hyperparameters for the algorithm, in this case k-Nearest Neighbors classifier. Instead of the classifier we could have a regressor for some other type of machine learning task. Four parameters shown in the snippet are mandatory, and the training job will fail without them.  By tuning hyperparameters the accuracy of the model can be improved. SageMaker also provides automated hyperparameter tuning but we won’t be using them in this example.

Finally we need to define the path to the training data. We do it by using Channels which are just named input sources for training algorithms. In this case as our data is in S3, we use s3_input class. Only the train channel is required, but if a test channel is given, too, the training job also measures the accuracy of the resulting model. In this case I provided both.

For kNN-algorithm the only allowed datatypes are RecordIO protobuf and CSV formats. If we were to use CSV format, we would need to define it in configuration by defining the named parameter content_type and assigning ‘text/csv;label_size=0’ as value. As we use RecordIO protobuf type, only s3_data parameter is mandatory. There are also optional parameters for example for shuffling data and for defining whether the whole dataset should be replicated in every instance as a whole. When the fit-function is called, SageMaker creates a new training job and logs its the training process and duration into the notebook. Past training jobs with their details can be found by selecting ‘Training jobs’ in the SageMaker side panel. There you can find given training/test data location and find information about model accuracy and logs of the training job.

Deploy

The last step on our way to getting predictions from the trained model is to set up an endpoint for it. This means that we automatically set up an endpoint for real-time predictions and deploy trained model for it to use. This will create a new EC2 instance which will take data as an input and provide prediction as a response. The following code is all that is needed for creating an endpoint and deploying the model for it:

import time

def get_predictor(knn_estimator, estimator_name, instance_type, endpoint_name=None): 
    knn_predictor = knn_estimator.deploy(initial_instance_count=1, instance_type=instance_type,
                                        endpoint_name=endpoint_name)
    knn_predictor.content_type = 'text/csv'
    return knn_predictor


instance_type = 'ml.m5.xlarge'
model_name = 'knn_%s'% instance_type
endpoint_name = 'knn-ml-%s'% (str(time.time()).replace('.','-'))
predictor = get_predictor(knn, model_name, instance_type, endpoint_name=endpoint_name)

and it can be called for example in the following way:

file = open("path_to_test_file.csv","rb")

predictor.predict(file)

which would return the following response:

b'{"predictions": [{"predicted_label": 6.0}, {"predicted_label": 3.0}, {"predicted_label": 21.0}, {"predicted_label": 0.0}, {"predicted_label": 3.0}]}'

In that case we got five predictions, because the input file contains five pictures. In a real life case we could use API Gateway and Lambda functions for providing interface for real-time predictions. The Lambda function can use boto3 library to connect to the created endpoint and fetch a prediction. In the API gateway we can setup an API that calls the lambda function once it gets a POST request and returns the prediction in response.

Conclusions

AWS SageMaker is a very promising service that allows reading data, training a model and deploying the endpoint with less than a hundred lines of code. It provides many good functions for training but also allows using Docker for custom training jobs. Jupyter Notebook is familiar tool to data scientists, so it’s very nice that it is used in SageMaker. SageMaker also integrates very easily with other AWS Services and allocating resources for training and endpoints is very easy. The machine learning algorithms are optimised for AWS, so their performance is very high.

The amount of code needed for training a model is not the biggest challenge in a data scientist’s everyday job, though. There are already very good libraries for that purpose, and one of the most time consuming part is usually cleaning and altering the data so that it can be used for training. For that SageMaker doesn’t provide help.

All in all, optimised algorithms, automated hyperparameter tuning, easy integration and interaction with other AWS services saves a lot time and trouble for data scientists. Trying out SageMaker is definitely worthwhile.

 

 

A data scientist’s abc to AI ethics, part 2 – popular opinions about AI

In this series of posts I’ll try to paint the borderline between AI and ethics from a bit more analytical and technically oriented perspective. Here I start to examine how AI is perceived, and how we may start to analyze ethical agency.

Multiple images

From 3D apps to evil scifi characters, in everyday use it can mean almost anything. It’s a bit of a burden that it is associated with Terminator, for instance. Or that the words deep learning might receive god like overtones in marketing materials.

Let’s go on with some AI related examples. On a PowerPoint slide, AI might be viewed as an economic force. For yet another example, we could look at AI regulation.

Say a society wants to regulate corporate action, or set limits to war damage with weapon treaties. Likewise, core AI activities might need legal limits and best practices. Like, how to make automatic decisions fair. My colleague Lassi wrote a nice recap about this also from an AI ethics perspective.

Now in my view, new technology won’t relieve humans from ethics or moral responsibility. Public attention will still be needed. Like Thomas Carlyle suggested, publicity has some corrective potential. It forces institutions to tackle their latent issues and ethical blind spots. Just like public reporting helps to keep corporate and government actions in check.

Then one very interesting phenomenon, at least from an analytical perspective, are people’s attitudes towards machines.

Especially in connection to ethics, it is relevant how we tend to personify things. Even while we consciously view a machine as dumb, we might transfer some ethical and moral agency to it.

A good example is my eight year old son, who anticipated a new friend from Lego Boost robot. Even I harbor a level of hate towards Samsung’s Bixby™ assistant. Mine is a moral feeling too.

These attitudes can be measured to a certain extent, in order to improve some models. This I’ll touch a bit later.

Perceived moral agency

There is a new analytical concept that describes machines and us, us with machines. This concept of perceived moral agency describes how different actors are viewed.

Let’s say we see a bot make a decision. We may view it as beneficial or harmful, as ethical or unethical. We might harbor a simple question whether the bot has morals or not. A researcher may also ask how much morals the bot was perceived to have.

Here we have two levels of viewing the same thing, a question about how much a machine resembles humans, and then a less intermediate one about how it is perceived in the society.

I think that in the bigger picture we make chains of moral attribution, like in my Bixby case. My moral emotion is conveyed towards Samsung the company, even if my immediate feelings were triggered by Bixby the product. I attribute moral responsibility to a company, seeing a kind of secondary cause for my immediate reactions. The same kind of thing occurs when we say that the government is responsible of air pollution, for instance.

What’s more to the point, these attributive chains apply to human professionals too. An IT manager or a doctor is bound by professional ethics. Their profession in turn is bound by the consensus within that group. If a doctor’s actions are perceived as standard protocol, it is hard to see them as personal ethics or lack of it.

Design and social engineering

Medical decision assistants and other end products are the result of dozens of design choices. And sometimes design choices, if not downright misleading, voluntarily support illusions.

For instance, an emotional reaction from a chat bot. It might create an illusion that the bot “decides” to do something. We may see a bot as willing or not willing to help. This choice may even be real in some sense. A bot was given a few alternative paths of action, and it did something.

Now what is not immediately clear are a bot’s underlying restrictions. We might see a face with human-like emotions. Then we maybe assume human emotional complexity behind the facade.

Chat bots and alike illustrate the idea of social engineering. What it means is that a technical solution is designed to be easy to assimilate. If a machine exploits cultural stereotypes and roles in a smart way, it might get very far with relatively little intelligence.

A classic example is therapist bot ELIZA from 1960s. Users would interact via a text prompt, and ELIZA would respond quite promptly to their comments. Maybe it asked its “patient” to tell a bit more about their mother. It didn’t actually understand any sentence meanings, but it was designed to react in a grammatically correct way. As the reports go, some of the users even formed an addictive relationship with it.

The central piece of social engineering was to model ELIZA as a psychotherapist. This role aided ELIZA in directing user attention. It might also have kept them away from sizing up ELIZA and its limitations. To read more about ELIZA, you may start from its Wikipedia page.

Engagement and management

ELIZA of course was quite harmless. For toys, it is even desirable to entice the imagination. A human facade can create positive commitment in the user. This type of thing is called engagement in web marketing.

On the other hand, social engineering is hard work and not always rewarding. An interesting related tweet came from a game scriptwriter.

This scriptwriter would wish players to submerge and have profound emotional experiences in her games. In her day to day work she had noticed a constant toil with her characters. Was this need for detail even bigger than, say, in a novel? Yes, she suggested.

The scriptwriter also analyzed this a bit. She noticed that repetitive out-of-context action is likely to distance a user. What’s more, it is also very likely to occur when prolonged interaction is available.

I’m tempted to think that these are the two sides of engaging a user. The catch and the aftermath.

As far as modeling and computational perspective go, another significant theme is the nature of automatic decisions.

The most relevant questions are these. How is the world modeled from the decision making agent’s perspective? What kind of background work does it require? How then about management? What kind of data does the agent consume? How to control data quality?

These will get a bit more detail in my next post. Stay tuned, and thanks for reading!

This is the second post about AI and ethics, in a series of four.

New call-to-action

Experiences from FastText in a text classification project.

FastText in a text classification project

In this blog I describe how we did text classification for funding applications with FastText package.

Describing the business need for text mining

Companies applied funding with this kind of form.

Application form could have been something like this. The form is a simplified example.
Application form could have been something like this. The form is a simplified example.

 

The documents were classified in a several categories by the application handler in the process management software.

The handler classified the application document in several categories base on application texts.
The human handler classified the application to categories such as Business development, Agriculture and Digitalization.

 

The manual process was not only time consuming, but also frustrating. Reporting was the primary reason for the classification.

Text data is often the most sensitive data

We had two primary ways of getting data.

Customer’s software was developed by Solita’s team. This made it easy to access the SQL database of the testing environment. As a result, we had all numerical and structured data in our hands. Numerical data was useful for application risk prediction, but we needed text data for document classification.

The text data was encrypted in the test database. This meant that we needed a way to securely import the plain language text data from the production SQL database.

There is a good reason why the access to text data should not be easy. Text data might contain sensitive information such as personal data or business secrets.

Selecting FastText as our text mining tool

My personal experience from text mining and classification was very thin. After discussions with the team we decided to go with the FastText package. It has been designed for simple text classification by Facebook.

FastText is quite easy command line tool for both supervised and unsupervised learning. We used a python package which apparently don’t support all original features such as nearest neighbor prediction [link].

For supervised prediction you create individual text files for training and testing data [link]. After files are created, training the neural network behind FastText takes just a few lines of code. We used the supervised method to classify the applications.

Example from FastText supervised tutorial data. FastText training data has labels at the beginning of each line followed by the actual text.
FastText training data has labels at the beginning of each line followed by the actual text.

 

For unsupervised analysis you can just dump a bunch of text to a file to create word vectors [link]. Word vectors are useful for finding words similar to each other.

While English has either singular or plural format such as dog or dogs, Finnish language has koira, koirat, koirani, koiranne, koirienne, koirilatammekohan… There are literally tens of variations for each word. FastText is especially great for languages like Finnish where suffixes at the end of each word vary depending on the context. This is because in addition of creating features from word counts FastText can also take into account combinations of words as well as sub-word character sequences.

A model per category using a document as an observation

Each application had multiple text fields and multiple categories to automatically predict. How to approach the complex problem?

In database the there was individual row for each combination of application and text field.
In database the there was a row for each combination of application and text field.

 

We decided to bundle all applicable text fields from the applications together. Another option would have been to make predictions for each combination of application and text field, and then select the class with most “votes” from text field predictions.

We combined all answers to a single string and make one prediction per application.
We combined all answers to a single string and did one prediction per application.

 

We left out text fields such as team description. Those fields did not bring significant information for the classification.

Trying to understand the labeling principles of FastText made us scratch our heads. The initial idea was to create a single classification model. That model would have included all related labels in a single training row.

In theory this could lead to a situation where all top predictions are from the same category such as Digitalization. As we wanted to get the most probable prediction from each category, we decided to train individual model per category (Business development, Agriculture and Digitalization).

FastText supervised algorithm accuracy

The labels inside categories were unequally balanced. Some categories had even tens of labels with very few observations.

Example of label count shares for Digitalization category.
Example of label count shares for Digitalization category.

 

Class imbalance meant that prediction accuracy reached 50% to 90% for some categories by simply guessing the most frequent label. We took this as our base line.

Eventually our model-per-category-strategy produced a few percentage units higher accuracies than choosing the most common label. This only happened after we decided to return the weakest predictions back to manual processing. The probability of prediction’s correctness was automatically given by FastText.

In our case it was enough to beat the naive strategy of choosing the most common label.

The prediction ability of FastText increases when applications with low prediction probability are returned to manual classification.
The prediction ability of FastText increases when applications with low prediction probability are returned to manual classification. This decreases the number of applications getting automated prediction.

Summarizing the FastText classification experiment

Apparently the application handlers don’t pay too much attention about which label they choose. This made us question the whole process. What is the value of reports that are based on application handler’s hunch? And if the labeling criteria are not uniform, how could a machine find any patterns?

Let’s say there are 2000 annual applications. One of the labels gets selected 30 times per year. Binomial probability calculation reveals that 95% confidence interval for 30 labels is actually from 20 to 40. A decision maker might think that a series of 20, 30 and 40 during a three year range indicates ascending trend for the label. But in reality, it’s just a matter of random variation. In one of the categories 15 out of 20 labels had this few or less observations.

FastText favored more common labels as it increased the overall accuracy. This came with the cost that some labels never got predictions.

When the solution has ran in production for a while, it is time to see if the handlers ever make the effort to correct the machine’s initial recommendation. If not, some labels will never end up to the reports.

There are endless number of solutions to automate such document classification. In our project the fast testing cycle to try different approaches was the key. The goal was not to make perfect, but improve the existing situation.

Whatever the prediction accuracy will be, this kind of text mining experiment provides valuable information for the organization.

A Machine Learning Example For Business

This article gives you a practical example of predictive machine learning. This blog read is mainly targeted for business people who want to harness machine learning for business but also understand how it actually works. The aim is to demonstrate three things: tell a concrete example without going too deep in theory, show the difference between machine learning and reporting and demonstrate the benefits of machine learning to business.

Introduction

According to a definition machine learning means that a machine can learn without having explicitly programmed for the task. This doesn’t mean that the program code would evolve by itself, but different results are produces in different situations depending on the data and the set of rules that are applied.

For those who are interested in technical part, the R code and data is available at the end of the post.

Machine learning use cases

Let’s quickly see some machine learning use cases.

#1 Predicting by variables

Problem. How to predict the number of visitors in an event?

Simple solution. Calculate the average from past events.

Machine learning solution. Take advantage of known event related variables when predicting number of visitors. These variables might be weather, day of week and ticket price. This makes the prediction more accurate than overall average.

Predicting by variables

#2 Clustering

Problem. How to identify customer segments?

Simple solution. Make an educated guess by using fixed values.

Machine learning solution. The algorithm usually needs some initial parameters such as the number of clusters to identify. Segmentation process could be automated and standardized. The segmentation could be run once a day, week or month without any manual work. Another advantage is that there’s no need to set hard limits such as If customer’s annual income is more than 20 000€ and age more than 50, she’s in segment A. The algorithm will set these limits for you.

Machine learning clustering example

#3 Image recognition and classification

Problem. How to recognize objects, emotions or text from an image?

Simple solution. Do it manually.

Machine learning solution. The algorithm has to be “trained” with already known cases to recognize which kind of pixel and color combinations means certain object. Image recognition could be used to identify data tables from hand written documents to transform the data to digital format. You could also detect other kind of objects such as a human face and take the analysis further by recognizing person’s emotions.

Image recognition example

A business problem: Online store deliveries

In the more detailed example, I will go through similar case than the #1: The process of predicting by variables as it is very common scenario in business analytics. Also many of the concepts can be used even in image recognition once the image pixels has been converted to tabular data.

Let’s imagine that we are running an online shop that orders nutrition supplements from a wholesaler and sells the products to consumers.

Machine learning on supply chain management

The online store wants to maximize those wholesale orders that will arrive on time. This will increase predictability and end user satisfaction, that will lead to bigger profits.

Sample data

In the sample data a record corresponds to an order made for a wholesaler in the past. The first column is the week when the order has been made (week). Then there are the day of week (order_day), order batch size (order_size), sub contractor delivering the order (sub_contractor) and the info whether the delivery was late or not (is_late).

Sample data for machine learning

Usually collecting and cleaning the data is by far the most time consuming part. That has been already done in this example, so we can move to analysis. For example, maybe in the original data the size of order was the number of items, but for this case the variable has expressed in S (small), M (medium) or L (large).

There are dozens of predictive machine learning algorithms that work in a different way, but a typical situation is to create a table like this, where the last column is the variable that you want to predict. For historical data the predicted variable is known already. All or part of the other variables are predictor variables.

Result variable in sample data

Predicting from reports is not systematic

In traditional reporting you could analyze the rate of late arriving orders from previously introduced 20 rows of data by doing basic data aggregation and see how many percent of orders have arrived late on average for each variable combination.

Summary report

Using reports to forecast future is problematic, however. If there’s only one record for a combination such as Mon, L and Express in row 1, the probability is always 0 or 1, which can’t be true. The summary report also does not handle individual variables (columns) as independent entities, and that’s why there is no way of evaluating probability for previously non existing combinations such as Tue, M and Go.

And the most important part: There’s no way of evaluating how these insights fit into new wholesale orders.

Concept of training and testing in machine learning

The secret of predictive machine learning is to split the process in two phases:

  1. Training the model
  2. Testing the model

The purpose is, that after these two stages you should be confident whether your model works for the particular case or not.

For testing and training, the data has to be split in two parts: Training data and testing data. Let’s use a rule of thumb and take 70% of rows for training and 30% for testing. Note, that all the data is historical and thus the actual outcome for all records are known.

First 14 rows (70%) are for model training.

Training data

In the training phase the selected algorithm produces a model. Depending on the selected method, the model could be for example:

  • An algorithm.
  • A formula such as 2x+y+3z-1.
  • A lookup table.
  • A decision tree.
  • Any other kind of data structure.

The important thing is, that the produced model should be something that you can use to make predictions in the testing phase. The situation is very fruitful, as now you can make a prediction for each record in testing data with the model, but you can also see, what was the actual outcome for that observation.

The trained model will be tested with the last 6 rows (30%). Testing data

Selecting the machine learning algorithm

Normally you would pre-select multiple algorithms and train a model with each of them. The best performing one could be chosen after the testing phase. For simplicity, I selected just one for this example.

The algorithm is called naive bayesian classifier. I will extract the complicated name to its components.

Naive. The algorithm is naive or “stupid” as it doesn’t take in to account interactions between variables such as order_day and sub_contractor.

Bayesian. A field of statistics named after Thomas Bayes. The expectation is that everything can be predicted from historical observations.

Classifier. The predicted variable is_late is a categorical and not a number.

The naive bayesian classifier algorithm will be used to predict is_late when predictor variables are know beforehand: order_day, order_size and sub_contractor. All variables are expected to be categorical for naive bayes, and that is why it suits to out needs well.

Naive bayes classifier is quite simple algorithm and gives the same result every time when using the same data. It would be possible to explain the logic behind each predictions. In some cases, this is a very valuable.

The week column won’t be used for prediction and its purpose will be explained at the end of the next chapter.

Training and testing the model

Without going too deep in the details, you can run something like this in statistical programming language such as R:

model <- naiveBayes(is_late~order_day+order_size+sub_contractor, data=data.train)

And this kind of data will be stored in computers memory:

Naive bayes model output in R

The point is not to learn inside out what these numbers mean, but to understand that this is the model and the software understands how to apply them to make predictions for new orders:

predictions <- predict(model, newdata=data.test)

By doing some copy and paste operations, the predictions can be appended to testing data:

Naive bayes predictions

If the value in Yes column is more than 0.5, it means that the prediction for that order is to arrive late.

In our case the order of records matter: The model should be trained with observations that have occurred before the observations in test data. The number of week (`week`) is a helper column to order the data for train and test splitting.

Evaluating the goodness of the model

A good way to evaluate the performance of the model in this example is to calculate how many predictions we got correct. The actual value can be seen from is_latecolumn and the prediction from Prediction column.

Because I had the power to create the data for this example, all 6 predictions were correct. For the second record the probability for both being late and not being late was 50%. If this would have been interpreted as Yes, the prediction would have been incorrect, and the accuracy would have fallen to 83.3% or 5/6.

Naive bayes prediction 50-50

Evaluating the business benefit

So far we have:

  1. Identified a business problem.
  2. Combined and cleaned possibly relevant data.
  3. Selected a suitable algorithm.
  4. Trained a model.
  5. Tested the model with historical data.
  6. Evaluated the prediction accuracy.

Now it’s time to evaluate the business benefit.

We are confident, that our model is between 83.3% to 100% accurate for unseen observations, where only order day, order size and sub contractor are known before setting up the order. Let’s trust the mid point and say that accuracy is 91.7%.

You can count that in test data 3/6 or 50% of orders arrived late. Supposing that you would have used your trained machine learning model, you could have made the order only for those 3 or 4 deliveries that are predicted to arrive on time and a couple of additional orders that are not described in the test data.

In real world the situation would be more complex, but you get the idea that accuracy of 91.7% is more than 50%. With these numbers you are probably also able to estimate the impact in terms of money.

Implementing the model to daily work

Here are some ways to harness the solution for business.

Data study. Improve company’s process by experimenting the optimal order types.

Calculator. Let employers enter the three predictor variables to a web calculator before placing the order to evaluate the risk of receiving products late.

Automated orders. Embed the calculation as part of the company’s IT infrastructure. The system can make the optimal orders automatically.

As time goes by, more historical orders can be used in training and testing. More data makes the model more reliable.

The model can be trained with the order data from past year once a day for example. Frequent training and limiting data only to latest observations doesn’t make machine literally learn, but rather this adjust the model to give the best guess for the given situation.

Summary

A predictive machine learning model offers a way to see the future with the expectation that the environment remains somewhat unchanged. The business decisions don’t need to be based on hunches and prediction methods can be evaluated systematically.

Thoughtfully chosen machine learning model is able to make optimal choices in routine tasks with much greater accuracy than humans. Machine learning makes it also possible to estimate the monetary impacts of decisions even before the actions has been taken.

Machine learning won’t replace traditional reporting nor it is in conflict with it. It is still valuable to know key performance indicators from past month or year.

Even though there are endless ways to do machine learning, predicting a single variable is a pretty safe place to start.

Data

week,order_day,order_size,sub_contractor,is_late
1,Mon,S,Go,Yes
1,Mon,L,Express,No
1,Tue,L,Go,Yes
1,Tue,L,Express,Yes
2,Mon,L,Go,No
2,Mon,M,Express,No
2,Tue,S,Go,Yes
2,Tue,L,Express,No
3,Mon,M,Go,No
3,Mon,S,Express,No
3,Tue,L,Go,Yes
3,Tue,M,Express,Yes
4,Mon,S,Go,Yes
4,Mon,S,Express,No
4,Tue,S,Go,Yes
4,Tue,M,Express,No
5,Mon,M,Go,No
5,Mon,M,Express,No
5,Tue,L,Go,Yes
5,Tue,L,Express,Yes

R Code

library(e1071)
library(plyr)
library(dplyr)

f.path <- 'C:/Users/user/folder/data.csv'
df <- read.csv(f.path, sep=",")
df

#Make a summery table
df.sum <- df
df.sum$is_late <- ifelse(df.sum$is_late=="Yes",1,0)
agg <- aggregate(df.sum[,'is_late'], list(df.sum$order_day, df.sum$order_size, df.sum$sub_contractor), mean)
names(agg) <- c("order_day","order_size","sub_contractor","is_late_probability")
agg$is_late_probability <- round(agg$is_late_probability,2)
agg

#Split to train and test data
data.train <- df[1:14, ]
data.train
data.test <- df[15:20, ]
data.test

#Count data by order size
table(data.train[1:14, 'order_day'], data.train[1:14, 'is_late'])
table(data.train[1:14, 'order_size'], data.train[1:14, 'is_late'])
table(data.train[1:14, 'sub_contractor'], data.train[1:14, 'is_late'])

#Fit naive bayes model
fit <- naiveBayes(is_late~order_day+order_size+sub_contractor, data=data.train)
fit

#Predict probabilites
probs <- predict(fit, data.test, type = "raw")
probs <- data.frame(probs)
probs

#Get more probable option
Prediction <- ifelse(probs$Yes>0.5,"Yes","No")

#Paste predicted probabilities to original data
data.test.new <- cbind(data.test, probs, Prediction)

#Prediction rate
sum(data.test.new$Prediction == data.test.new$is_late) / nrow(data.test.new)

Why are deep learning models so popular?

Deep learning (i.e. big neural networks) plays a central role in the ongoing boom of artificial intelligence and data science. Last year, a partly neural network based AI beat a human grandmaster for the first time in Go, a complex board game. Judging by the hype, it feels like deep neural networks can be found in every other state-of-the-art AI solution.

In practice they have their downsides, but although they are not the be-all-end-all of machine learning algorithms, neural networks are versatile and useful. In our client projects, we have leveraged deep learning in image recognition and multivariate time series forecasting tasks, for example. What qualities make neural networks efficient?

On a high level, a neural network (and most other supervised machine learning algorithms) can be seen as a device that takes in numerical inputs and spits out numerical outputs.

When the model is built, the general structure of what is inside the device, as well as the structure of its inputs and outputs, is specified. At this point, the model already “works” in the sense that it can take inputs to produce outputs, but the results are random.

Then, the model is trained by continuously feeding it with actual data, that is, correct answers to the problem at hand. During this training process the parameters of the model (the bolts and cogs inside the device) are adjusted in very small increments. In time, the algorithm converges and learns the relationships in the training data. In the end, the model learns how to map input data to outputs using a similar logic that underlies the training data it was fed. After training, the model can be used to make predictions based on new input data, something that the model has not seen before.

What goes in, what goes out?

The inputs to the device could be virtually anything that can be represented as arrays of numbers: images, time series data, videos, free text articles after being transformed into numerical representations, you name it. The outputs can also take various shapes.

The output could be a single number, say a weather forecast on a given hour. Or it could be an array of several figures, like the pixel coordinates of an identified suspected cancer in an x-ray image received as input.

The end result does not even have to be numeric, even though a neural network only crunches numbers. For example, the network can produce an array of likelihood estimates that are converted into a categorical classification in the end.

Given a picture, the model could say it is a cat with 80 % certainty, a dog with 15 % certainty or a car with 5% certainty. Although almost anything can be represented in numerical format in some manner, deciding how the numerical representations of the inputs and outputs are actually done is usually not easy. This preprocessing step is an important part of the data science workflow.

Anatomy of a neural network

In the case of neural networks, what is inside the device is a large amount of interconnected simple processing units called neurons. A neuron takes a number, squeezes it through some non-linear function and then outputs the result. In a deep neural network, the neurons are organized into layers that succeed each other. The input signal is first sent to the first layer of neurons, which send their outputs to all the neurons in the next layer. This process continues layer after layer until the output layer is reached. The construct is inspired by biological neurons, the main components of the central nervous system.

The connections between neurons are also given so called weights, which are basically little valves that determine how much of each input the unit propagates up the network. Adjusting these weights is what actually happens during the training process and what allows neural networks to fit specific problems. A deep neural network could contain millions of weights, so the good news is that adjusting the weights can be automated efficiently (using backpropagation and gradient descent methods).

Advantages of neural networks

The idea of training a machine to transform numerical representations of inputs to outputs applies to most machine learning models, so what makes neural networks work special? Three reasons come to mind.

First, the structure of a neural network is specified only very broadly before the model is trained, which gives a lot of room for the model to adjust during training.

In statistical terms, large neural networks can be thought of as being somewhere in between parametric and nonparametric models. In a parametric model, for instance a traditional regression, the number of parameters in the model is strictly determined before fitting the model. In a nonparametric model, the model structure is determined more broadly, and the training process can adjust the number of parameters as well as their values. Thus, in a nonparametric model, there is more freedom for the model structure to adjust to the problem being solved. In neural networks, the number of parameters (weights) is strictly determined beforehand, which would imply that they are parametric models. However, the number of weights can be enormous and the training process could allow many of the weights to zero out, effectively blocking certain paths through the network. For these reasons, deep and broad neural networks resemble nonparametric models in practice. The nonparametric nature gives neural networks structural freedom to adapt to many kinds of problems.

Second, since neural networks consist of chained little functions that perform nonlinear transformations at each step, they are inherently nonlinear models.

This allows them to model many problems better since many real-world relationships are nonlinear. (For example, the area of a square shaped field increases exponentially instead of linearly as its width increases.)

Third, the vanilla version of a neural network that has been discussed so far can be adjusted to make it a better fit to certain problems.

Convolutional neural networks, for example, are good at making broad abstractions from very detailed inputs and work especially well for image recognition problems. In recurrent neural networks, the neuron layers have feedback loops, which essentially means that the network is able to remember previous inputs. This trait makes them a good fit for time series forecasting and natural language processing.

Advanced deep learning models – the ones that are used in solutions that are able to beat humans in complex games or drive vehicles – combine these basic architectures. For instance, the model could begin with convolutional layers that are good at abstracting information. This network could be followed by a recurrent network that has a memory and the ability to learn sequential and spatial relationships. Finally, a regular fully connected layer could produce the final output.

The downsides

If deep learning is so powerful, then why don’t we dump all other machine learning algorithms in their favor? One of the biggest caveats of neural networks is the fact that they are black boxes: it is usually impossible to intuitively explain why a neural network has given a certain prediction. Sometimes, the intermediate outputs that the neurons produce can be analyzed and explained, but many times this is not the case. Other algorithms, such as traditional linear models or tree-based models like random forest, can usually be analyzed and explained better.

Another downside is that neural networks need large amounts of training data and can take a long time to learn. This was a huge problem in the past but has been mitigated somewhat by technological advancement. One of the reasons for the surge in deep learning’s popularity can be attributed to improvements in GPU computational power and the advancements of cloud computing. Today, it is possible to train complex deep learning models in a matter of hours, not weeks.

How to get started?

Even though they are technically formidable when you dive into details, neural networks are not that hard to experiment with. A good way to get your hands dirty, if you know Python or R, is to first google and follow through hands-on coding tutorials, like this one. It is also a good idea to register into www.kaggle.com and build a neural network solution to one of the simpler problems, say, “Titanic: Machine Learning from Disaster”.

Once familiar with the basics, try something a little more advanced by following along these inspiring blog posts, for example: