LION COMMUNITY USAGE CASE
LIONbook: Classification trees and forests: Handwritten digit recognition.
This is an exercise associated with the LIONbook (chapter 6).
Recognizing handwritten digits is a classic and very
useful usage case in learning from data. Here one starts from 16x16 pixel images of digits
- from the USA Postal Service Zip Code Database - labeled
with the corresponding class (0,1,2,...,9), to learn a map which will generalize to new digits,
digits not present in the learning
database. This is why the original set is randomly split into two subsets, one for training, the other one for
testing the performance (zip-train.csv and
Using classification trees and forests in LIONoso is seamless if you are already familiar
with the user interface. Click on the following images to get a larger version.
Step 1: Load data files and classification tree (CART), connect and set parameters
Classification trees are available in LIONoso though the CART tree factory
in the Models/CART Tree folder.
To load data files and the CART tree factory (the creator of classification trees) drag the corresponding "CSV file" and
"CART tree factory" to the workbench to the right, and connect them by drawing an arrow.
The files zip-train.csv and
zip-test.csv contain the intensity features
for all digits (in a 16x16 array indexed as 0.1, 0.1, ..., 1.0, 1.1, ..., 15.15 -
- all intensity levels scaled so that the values are between -1 and 1).
These data were made available by the neural network
group at AT&T research labs (thanks to Yann Le Cunn).
The parameters for the "CART tree factory" are: inputs: 0.0 0.1 ... 15.15, output: Class.
Click "Start training" to create the tree (same icon but no gear symbol).
Connect the zip-test.csv to the just created CART model to produce the table with the
predicted classification results. The output classification will be in the "Class" column, while
the target output (remember we are dealing with supervised classification) will be saved
in the "Class-target" column.
Step 4: Analyze and visualize the classification of the test set
Drag the data column Class-target to create a bar chart,
drag the variable Class onto the "subclass" destination area in the bar chart.
Each subclass corresponding to a different classification of the given input classes
will now appear as a separate bar.
The various classes in input (0,1,...,9) will now be spit by different output classifications.
Most cases are classified correctly, some classification errors are present as expected.
It is interesting to see how the different classes are confused.
For example, class "3" tends to be confused with "5" (in 20 cases), "7" is confused with "4" (in six cases), etc.
A different visualization of the same results can be obtained by stacking the different subclasses.
Each different classification will now appear as a stripe in a single vertical bar.
For a more detailed analysis, drag the Error analyzer
node in the "table manipulation and creation" folder onto the output classification table, to obtain
a detailed error analysis table and a confusion matrix table.
By right-clicking on the confusion matrix and picking new panel -> Heatmap, the confusion
among the various classes can be color-coded as follows.
Experiment with different choices for the coloring, observe how most cases fall on the diagonal
(correct classifications), try to convince yourself that most confusions (cases falling
in off-diagonal boxes) are similar to confusions made also by human readers.
Advanced users can now repeat the above experiment with Democratic forests
[LIONbook] obtained by training many randomized trees and combining their output
classifications in a democratic manner. The appropriate node is the Democratic forest factory.
Are results obtained with democratic forets better than results obtained with trees?
Visualize the tree
It is of interest to see how the tree "works". In particular, one can analyze how the purity
of the set of cases ending up in a node of the tree increases from the root to the leaves.
The above visualization can be done by clicking on the nodes of the tree produced in the
lionmode web service .
Be patient, many data need to be transferred so that the interaction can be slow.
Here a faster
visualization by lionmode on the iris data set
considered in a
previous exercise .
The above visualizations have been produced with the
lionmode web service ,
an automated service running in the cloud to identify the best possible model for a given task
and to deliver feedback about the relevance of the various input features.
Example data files.
[LIONbook] The LION way
Roberto Battiti and Mauro Brunato. LIONlab, University of Trento, Feb 2014.
Download the LIONoso-ready data file: