How to choose your next machine learning project

Three steps to be intentionally agnostic about tools. Reduce technical debt, increase stakeholder trust and make the objective clear. Build a machine learning system because it adds value, not because it is a hammer to problems.

As data enthusiasts we love to talk, read and hear about machine learning. It certainly delivers value to some businesses. However, it is worth taking a step back. Do we treat machine learning as a hammer to problems? Maybe a simple heuristic does the job with substantially lower technical debt than a machine learning system.

Do machine learning like the great engineer you are, not like the great machine learning expert you aren’t.

Google developers. Rules of ML.

In this article, I look at a structured approach to choose the next data science project that aligns to business goals. It combines objective key results (OKR), value-feasibility and other suggestions to stay focused. It is especially useful for data science leads, business intelligence leads or data consultants.

Why data science projects require a structured approach

ML solves complex problems with data that has a predictive signal for the problem at hand. It does not create value by itself.

So, we love to talk about Machine learning (ML) and artificial intelligence (AI). On the one hand, decision makers get excited and make it a goal: “We need to have AI & ML”. On the other hand, the same goes for data scientists who claim: “We need to use a state-of-the-art method”. Being excited about technology has its upsides, but it is worth taking a step back for two reasons.

  1. Choosing a complex solution without defining a goal creates more issues than it solves. Keep it simple, minimize technical debt. Make it easy for a future person to maintain it, because that person might be you.
  2. A method without a clear goal fails to create business value and erodes trust. Beyond the hype around machine learning, we do data science to create business value. Ignoring this lets executives reduce funding for the next data project.

This is nothing new. But, it does not hurt to be reminded of it. If I read about an exciting method, I want to learn and apply it right away. What is great for personal development, might not be great for the business. Instead, start with what before thinking about how.

In the next section, I give some practical advice on how to structure the journey towards your next data project. The approach helps me to focus on what is next up for the business to solve instead of what ML method is in the news.

How to choose the next data science project

“Rule #1: Don’t be afraid to launch a product without machine learning.”

Google developers. Rules of ML.

Imagine you draft the next data science cases at your company. What project to choose next? Here are three steps to structure the journey.

Photo by Leah Kelley from Pexels

Step 1: Write data science project cards

The data science project card helps to focus on business value and lets you be intentionally agnostic about methodologies in the early stage

Summarize each idea in a data science project card which includes some kind of OKR, data requirements, value-feasibility and possible extensions. It covers five parts which contain all you need to structure project ideas, namely an objective (what), its key results (how), ideal and available data (needs), the value-feasibility diagram (impact) and possible extension. What works for me is to imagine the end-product/solution to a business need/problem before I put it into a project card.

Find the project card templates as markdown or powerpoint slides.

I summarize the data science project in five parts.

  1. An objective addresses a specific problem that links to a strategic goal/mission/vision, for example: “Enable data-driven marketing to get ahead of competitors”, “Automate fraud detection for affiliate programs to make marketing focusing on core tasks” or “Build automated monthly demand forecast to safeguard company expansion”.
  2. Key results list measurable outcomes that mark progress towards achieving the objective, for example: “80% of marketing team use a dashboard daily”, “Cover 75% of affiliate fraud compared to previous 3 month average” or “Cut ‘out-of-stock’ warnings by 50%, compared to previous year average”.
  3. Data describes properties of the ideal or available dataset, for example: “Transaction-level data of the last 2 years with details, such as timestamp, ip and user agent” or “Product-level sales including metadata, such as location, store details, receipt id or customer id”.
  4. Extensions explores follow-up projects, for example: “Apply demand forecast to other product categories” or “Take insights from basket analysis to inform procurement.”
  5. The value-feasibility diagram puts the project into a business perspective by visualizing value, feasibility and uncertainties around it. The smaller the area, the more certain is the project’s value or feasibility.

To provide details, I describe a practical example how I use these parts for exploring data science projects.  The journey starts by meeting the marketing team to hear about their work, needs and challenges. If a need can be addressed with data, they become the end-users and project target group. Already here, I try to sketch the outcome and ask the team about how valuable it is which estimates the value.

Next, I take the company’s strategic goals and formulate an objective that links to them following OKR principles. This aligns the project with mid-term business goals, makes it part of the strategy and increases buy-in from top-level managers. Then I get back to the marketing team to define key results that let us reach the objective.

A draft of an ideal dataset gets compared to what is available with data owners or the marketing team itself. That helps to get a sense for feasibility. If I am uncertain about value and feasibility, I increase the area in the diagram. It is less about being precise, but about being able to compare projects with each other.

Step 2: Sort projects along value and feasibility

Value-feasibility helps to prioritize projects, takes a business perspective and increases stakeholder buy-in.

Ranking each project along value and feasibility makes it easier to see which one to prioritize. The areas visualize uncertainties on value and feasibility. The larger they stretch along an axis, the less certain I am about either value or feasibility. If they are more dot-shaped, I am confident about a project’s value and its feasibility.

Projects with their estimated value and feasibility

Note that some frameworks evaluate adaptation and desirability separately to value and feasibility. But you get low value when you score low on either adaptation or desirability. So, I estimate the value with business value, adaptation and desirability in my mind without explicitly mentioning it.

Data science projects tend to be long-term with low feasibility today and uncertain, but potentially high future value. Breaking down visionary, less feasible projects into parts that add value in themselves could produce a data science roadmap. For example, project C which has uncertain value and not feasible as of today, requires project B to be completed. Still, the valuable and feasible project A should be prioritized now. Thereafter, aim for B on your way to C. Overall, this overview helps to link projects and build a mid-term data science roadmap.

Related data science projects combined to a roadmap

Here is an example of a roadmap that starts with descriptive data science cases and progresses towards more advanced analytics such as forecasting. That gives a prioritization and helps to draft a budget.

Step 3: Iterate around the objective, method, data and value-feasibility

Be intentionally agnostic about the method first, then opt for the simplest one, check the data and implement. Fail fast, log rigorously and aim for the key results.

Implementing data science projects has so many degrees of freedom that it is beyond the scope of this article to provide an exhaustive review. Nevertheless, I collected some statements that can help through the project.

  1. Don’t be afraid to launch a product without machine learning. And do machine learning like the great engineer you are, not like the great machine learning expert you aren’t. (Google developers. Rules of ML.)
  2. Focus on few customers with general properties instead of specific use cases (Zhenzhong Xu, 2022. The four innovation phases of Netflix’ trillions scale real-time data infrastructure.)
  3. Keep the first model simple and get the infrastructure right. Any heuristic or model that gives quick feedback suits at early project stages. For example, start with linear regression or a heuristic that predicts the majority class for imbalanced datasets. Build and test the infrastructure around those components and replace them when the surrounding pipelines work (Google developers. Rules of ML. Mark Tenenholtz, 2022. 6 steps to train a model.)
  4. Hold the model fixed and iteratively improve the data. Embrace a data-centric view where data consistency is paramount. This means, reduce the noise in your labels and features such that an existing predictive signal gets carved out for any model (Andrew Ng, 2021. MLOps: From model-centric to data-centric AI).
  5. Each added component also adds a potential for failure. Therefore, expect failures and log any moving part in your system.
  6. Test your evaluation metric and ensure you understand what “good” looks like (Raschka, 2020. Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning.)

There are many more best practices to follow and they might work differently for each of us. I am curious to hear yours!

Conclusion

In this article, I outlined a structured approach for data science projects. It helps me to channel efforts into projects that fit business goals and choose appropriate methods. Applying complex methods like machine learning independent of business goals risks accruing technical debt and at worst jeopardizes investments.

I propose three steps to take action:

  1. Write a project card that summarizes the objective of a data science case and employs goal-setting tools like OKR to engage business-oriented stakeholders.
  2. Sort projects along value and feasibility to reasonably prioritize.
  3. Iterate around the objective, method, data and value-feasibility and follow some guiding industry principles that emerged over the last years.

The goal is to translate data science use cases into something more tangible, bridging the gap between business and tech. I hope that these techniques empower you for your next journey in data science.

Happy to hear your thoughts!

Materials for download

Download the data science project template, structure and generic roadmap as Power Point slides here. You can also find a markdown of a project template here.

Pose detection to help seabird research – Baltic Seabird Hackathon

Team Solita participated in Baltic Seabird Hackathon in Gothenburg last week. Based on the huge material and data set available, we decided to introduce pose detection as a method to understand seabird behavior and interactions. The results were promising, yet leave still room to improve.

Baltic Seabird Hackathon

Some weeks ago we decided to participate in the Baltic Seabird Hackathon in Gothenburg. Hackathon was organised by AI INNOVATION of Sweden, Baltic Seabird Project, WWF, SLU and Chalmers University of Technology. In practise we spent few weeks preparing ourselves, going through the massive dataset and creating some models to work with the data. Finally we travelled to Gothenburg and spent 2 days there to finalise our models, presented the results and of course just spent time with other teams and networked with nice people. In this post we will dive a bit deeper on the process of creating the prediction model for pose detection and the results we were able to create.

Initially we didn’t know that much about seagulls, but during the couple weeks we got to learn wonderful details about the birds, their living habits and social interaction. I bet you didn’t know that the oldest birds are over 45 years old! During the hackathon days in Gothenburg we had many seabird experts available to discuss and ask more challenging questions about the birds. In addition we were given some machine learning and technical experts to support the work in the provided data factory platform. We decided to work in AWS sandbox environment, because it was more natural choice for us.

Our team was selected to have cross-functional expertise in design, data, data science and software development and to be able to work in multi-site setup. During the hackathon we had 3 members working in Gothenburg and 2 members working remotely from Sweden and Finland.

So what did we try and achieve?

Material available

For the hackathon we received some 2000 annotated images and 100+ hours of video from the 2 different camera locations in the Stora Karlsö island. Cameras were installed first time in 2019 so all this material was quite new. The videos were from the beginning of May when the first birds arrive to the same ledge as they do every year and coveraged the life of the birds until beginning of August when most of them had left already.

The images and videos were in Full HD resolution i.e. 1920×1080, which gives really good starting point. The angle of the cameras was above and most of the videos and images looked like the example below. Annotated birds were the ones on the top ledge. There were also videos and images from night time, which made it a bit more harder to predict.

Our idea and approach

Initial ideas from the seabird experts were related to identifying different events in the video clips. They were interested to find out automatically when egg was laid, when birds were leaving and coming back from fishing trips and doing other activities.

We thought that implementing these requirements would be quite straightforward with the big annotation set and thus decided to try something else and took a little different approach. Also because of personal interests we wanted to investigate what pose detection of the birds could provide to the scientists.

First some groundwork – Object detection

Before being able to detect the poses of the birds one needs to identify where the birds are and what kind of birds there are. We were provided with over 2000 annotated images containing annotations for adult birds, chicks and eggs. The amount of annotated chicks and eggs was far less than adult birds and therefore we decided to focus on adult birds. With the eggs there were also issue with the ledge color being similar to the egg color and thus making it much harder to separate eggs from the ledge.

We decided to use ImageAI (https://github.com/OlafenwaMoses/ImageAI) Python library for object detection. It has been built simplicity in mind and therefore it was fast and easy to take into use given the existing annotation set. All we had to do was to transform the existing annotations into PascalVOC format. After all initial setup we trained the model with about 200 images, because we didn’t want to spent too much time in the object detection phase. There is a good tutorial available how to do it with your custom annotation set: https://github.com/OlafenwaMoses/ImageAI/blob/master/imageai/Detection/Custom/CUSTOMDETECTIONTRAINING.md

Even with very lightweight training we were able to get easily over 95% precision for the detections. This was enough for our original approach to focus on poses rather than the activities. Otherwise we probably had continued to develop the object detection model further to identify different activities happening on the ledge as some of the other teams decided to proceed.

Based on these bounding boxes we were able to create 640×640 clips of each bird. We utilised FFMPEG to crop the video clips.

Now we got some action – Pose detection

For years now there has been research and models on detecting human poses from images and videos. Based on these concepts  Jake Graving and Daniel Chae have developed DeepPoseKit (https://github.com/jgraving/DeepPoseKit) for detecting poses for animals. They have also focused on making the pose detection much faster than in previous libraries. DeepPoseKit is written in Python and uses in the background TensorFlow and Keras. You can read the paper about the DeepPoseKit here: https://elifesciences.org/articles/47994

The process for utilising DeepPoseKit has 4 main steps:

  1. Create annotation set. This will define the resolution and color of the images used as basis for the model. Also the skeleton (joints and their connections to each other) needs to be defined in csv as a parent-child hierarchy. For the resolution it is probably easiest if the annotation set resolution matches close to what you expect to get from the videos. That way you don’t need to adjust the frames during the prediction phase. For the color scale you should at least consider whether the model works more reliable in gray scale or in RGB color space.

  2. Annotate the images in annotation set. This is the brutal work and requires you to go through the images one by one and marking all skeleton keypoints. The GUI DeepPoseKit provides is pretty simple to use.

  3. Train the model. This definitely takes some time even with GPU. There is also support for augmented data, so you can really improve the model during the training.

  4. Create predictions based on the model.

You can later increase the size of the annotated set and add more images to the set. Also the training can be continued based on existing model and thus the library is pretty flexible.

Because the development of DeepPoseKit is still in the early phases there are at least 2 considerable constraints to remember:

  • Library can only detect individual poses and if you have multiple animals in the same frames, you need some additional steps to separate the animals

  • DeepPoseKit only supports image resolutions that can be repeatedly divided by 2 (e.g. 320×320, 640×640)

Because of these limitations and considering our source material, we came up with following process:

So we decided to create separate clips for each identified bird and run pose detection for these clips and then in the end combine the individual pose detection predictions to the original video.

To get started we needed the annotation set. We decided to use the provided sample script (https://github.com/jgraving/DeepPoseKit/blob/master/examples/step1_create_annotation_set.ipynb) that takes in a video and picks random frames from the video. We started originally with 100 images and increased it to 400 during the hackathon.

For the skeleton we ambitiously decided to model 16 keypoints. This turned out to be quite a task, but we managed to do it. In the end we also created a simplified version of the skeleton and the annotations including only 3 keypoints (beck, head and tail). The original skeleton included eyes, different parts of the wings and legs.

This is how the annotations for complex skeleton look like:

The simplified skeleton model has only 3 keypoints:

With these 2 annotation sets we were able to create 2 models (simple with 3 keypoints and complex with 16 keypoints).

To train the model we pretty much followed the sample script provided by the developers of DeepPoseKit (https://github.com/jgraving/DeepPoseKit/blob/master/examples/step3_train_model.ipynb). Due to limited time available we did not have time to work so much with the augmented data, which could have improved the accuracy of the models. Running a epoch with 45 steps took with AWS p3.2xlarge instance (1 GPU) about 5-6 minutes for the complex model. We managed to run around 45 epochs in total given a final validation loss around 25. Because the development is never a such straightforward process, we had to start the training of the model from scratch few times during the hackathon.

The results

When the model was about ready, we run the detections for few different videos we had available. Once again we followed the example in DeepPoseKit library (https://github.com/jgraving/DeepPoseKit/blob/master/examples/step4b_predict_new_data.ipynb). Basically we ran through the individual clips and frame by frame create the skeleton prediction. After we had this data together, we transformed the prediction coordinates resolution (640×640) to match the original video resolution (1920×1080). In addition to the original script we fine-tuned the graphs a bit and included for example order number for each skeleton. In the end we had a csv file containing for each identified object for each frame in the video for each identified skeleton keypoint the keypoint coordinates and confidence percentage. We added also the radius and degrees between keypoints and the distance of connected keypoints. Radius could be later used to analyse for example in which direction the bird is moving its head. In practice for one identified bird this generated 160 rows of data per second. Below is a sample dataset generated.

The results looked more promising when we had more simple setup in the area of the camera. Below is an example of 2 birds’ poses visualised with the complex model and the results seems quite ok. The predictions follow pretty well the movement of a bird.

The challenges are more obvious when we add more birds to the frame:

The problem is clear if you look at one of the identified bird and it’s generated 640×640 clip. Because the birds are so close to each other, one frame contains multiple birds and the model starts to mix parts of the birds together.

The video above also shows that the bird on the right upper corner is not correctly modeled when the bird expands its wings. This is just an indication that the annotation set does not include enough various poses of the birds and thus the model doesn’t learn those poses.

Instead if we take the more simple model in use in the busy video, it behaves a bit better. Still it is far from being optimal.

So at this point we were puzzling how to improve the model precision and started to look for additional methods.

Shape detection to the help?

One of the options that came to our mind was to try leave only the identified bird visible in the 640×640 frames. The core of the idea was that when only individual bird would be visible in the frame then the pose detection would not mess up with other birds. Another team had partly used this method to rule out all the birds in the distance (upper part of the image). Due to the shape of the birds nothing standard such as vignette filter would work out of the box.

So we headed out to look for better alternatives and found out Mask RCNN (https://github.com/matterport/Mask_RCNN). It has a bit similar approach to the pose detection that you first have to annotate a lot of pictures and then train the model. Due to the limited time available we had to try using Mask RCNN just with 20 annotated images.

After very quick training it seemed as the model had really low validation loss. But unfortunately the results were not that good. As you can see from the video below only parts of the birds very identified by the shape detection (shapes marked as blue).

So we think this is a relevant idea, but unfortunately we didn’t have time to verify this idea.

Another idea we had was to detect some kind of pattern that would help identify the birds that are a couple. We played around with the idea that by estimating the density map of each individual bird we could identify the couples that have a high density map. If the birds are tracked then the birds that have a high density output would be classified as a couple. This would end up in a lot of possibilities for the scientists to track the couples and do research on their patterns. For this task first thing we have to do is to put a single marker on each individual bird. So instead of tagging the bird as a whole, we instead tag the head of the bird which is a single point. Image the background as black and the top of the head of each individual bird is marked with the color. The Deep learning architectures used for this were UNET and FCRN(Fully Convolutional Regressional Networks). These are the common architectures used when estimating density maps. We got the idea from this blogpost(https://towardsdatascience.com/objects-counting-by-estimating-a-density-map-with-convolutional-neural-networks-c01086f3b3ec) and how it is used to estimate the density maps which is then used to count the number of objects. We ended up using this to identify both the couples and the number of penguins. Sadly the time was not enough to see some reasonable results. But the idea was very much appreciated by the judges and could be something that they could think about and move forward with.

Another idea would be to use some kind of tagging of the birds. That would work as long as the birds remain on the ledge, but in general it might be a bit challenging as the videos are very long and the birds move around the ledge to some extent.

What next?

Well with the provided results we won the 3rd price in the hackathon. According to the jury biggest achievements were related to the pose detection and the possibilities it opens up for science. It seems that there is not much research done for the social interactions of seagulls and our pose detection model could help on that.

It was clear for all of us that we want to donate the money and have now decided to give it as a scholarship for a student who will take the models and work them further for the benefit of seabird science. Will be interesting to see what the models can tell us about the life of the baltic seabirds, their social interaction and in general socioeconimics of Baltic Sea.

On behalf our Team Solita (Mari Harju, Jani Turunen, Kimmo Kantojärvi, Zeeshan Dar and Layla Husain),

Kimmo & Zeeshan