On the International Day of Forests we would like to shed some light on Random Forest. No, it is not any ‘random forest’, it is the name of a machine learning algorithm that utilises the power of many trees to produce reliable results, just like a real green forest drives its power from the collective value of all its trees. Random forest is made of many decision trees that vote together on the final decision of the whole forest. This algorithm is used in both classification and prediction with great success due to its robustness against overfitting.
Let’s say you came across a site with millions of finds. They are a collection of small items with different shapes, sizes, colours, materials and level of complexity. You know that the each item must fall in one of the following categories:
Class | Shape | Size | Colour | Material | Complexity |
Obelisk | Rectangle | Small | Yellow | Gold | Complex |
Shield | Rectangle | Medium | Brown | Bronze | Complex |
Tablet | Rectangle | Small | Brown | Mud | Simple |
Box | Cubic | Medium | Yellow | Gold | Simple |
Now you want to classify each of your finds automatically based on their characteristics. You can start by building a decision tree similar to this one:
The tree starts by looking at the colour of each item and then classifies the brown items in one group and the yellow items in another group. The second level of the tree has two branches; where one looks at the Material difference between the shields and the tablets, the other side distinguishes the items based on their complexity. This decision tree can classify all the items with 100% success rate given clean and complete data. However, real data is always messy and incomplete. If an object has brown colour but missing the material information then this tree will not provide a conclusive choice between shield and tablet. This problem – and more – are the reason why Random forest shines.
When building a Random Forest the algorithm uses a random subset of data with a random subset of attributes to build each of the trees. The algorithm then uses the whole forest to predict the classification of a given item and averages the votes of each tree to give the final classification result.
The following graph represents an example of a few trees within the same forest:
This forest will be able to classify the item correctly even with the material information missing as it has at least two trees that do not depend on the material attribute at all. The down side of this system is in the expensive computations required to produce a classification from all the trees and then averaging it across the forest.