
    Random Forest Toolbox
    ---------------------

    Roman Juranek <ijuranek@fit.vutbr.cz>
    Department of Computer Graphics and Multimedia FIT, BUT, Brno

    Version 2015-01-28
    
    Development of this software was funded by TA ČR.


This package contains a simple implementation of random forest framework. It
has been designed to be extendible and configurable as much as possible. It
provides only necessary functions.


Requirements
------------

As far as I know, the function use only basic MATLAB functionality. So the
functions may be executable on all MATLAB installations. Octave is not
currently supported.


TODOs and Ideas
---------------

* Finish tutorial slides
* Continue training of existing forest
* README is still not finished
* Utility functions
  forest informations - tree depth, number of leafs, etc.
  tree display
* Advanced examples
  Detection, recognition, etc.
* Use parallel toolbox for acceleration
* GPU acceleration on GPU enabled machines
* Treatment of large scale problems
** Feature evaluation in one-by-one manner
** Distribution to workers


Bugs
----

There are no bugs known to me right now. However, if you experience an unexpected
behaviour, please, let me know. Send me (ijuranek@fit.vutbr.cz) instructions
how to replicate the bug (and if possible suggestions for an update).


Change log
----------

2013-05-03
* Improved MATLAB compatibility (many errors fixed) Better documentation
* Switched to logical indexing in training and prediction - faster and better
* GINI index calculation
* feature evaluators Demo
2013-08-16
* Conic section features added (random_conic, conic_eval functions)
* Demo script update
* Training now supports simultaneous usage of more feature types. I.e. one
  can use linear_features mixed with conic sections in a single classifier
  ('help random_forest_train' for more details).
* random_planes bug fixed (planes are now uniformly sampled)
* Tree structure changed from nested structures to linear arrays.
2014-01-14
* Logical indexing changed back to normal indices.
* Samples can be stored as matrices or structure arrays
* Some parameter names changed - more convenient. But it is not backward compatible.
* Parameters can be passed in sturcture instead as function arguments
* Better tree compacting function
* Splits can be represented by matrices, cell arrays or structure arrays.
* Updates in linear features and conic sections
* Better and faster tree traversal during prediction
2014-03-03
* Some functions moved to private folder
* The way of parametrizing functions in options changed
* Node functions still not fully tested
* Batch evaluation of splits - can be useful in large scale problems
* Better documentation - README update
2015-01-28
* Small bugfixes
* Sample weights (NOTE: Still need to update docs!)

Training and Prediction
-----------------------

Function random_forest_train trains a model T.

    OPTS = random_forest_train;
    T = random_forest_train(X, Y, OPTS);

X, Y are matrices of training instances, OPTS are training options that
configures training process.

    ntrees           [5] Number of trees to train
    splits           [1000] Number of random splits to generate in each node
    batch            [Inf] Number of splits evaluated in one batch
    min_samples      [20] Minimal number of samples that can reach a node
    max_depth        [8] Maximal depth of a node
    split_gen        [{'default',{}}] Generator(s) of random splits
    split_eval       [{'default_eval',{}}] Function(s) that evaluates splits
    impurity         [{'infogain',{}}] Split scoring function(s)
    predictor        [{'class_prob',{}}] Predictor to calculate in leaf nodes
    user_fnc         [{}] User function(s)
    compact          [true] Compact representation of the tree
    verbose          [false] Print info during the training

The model T can be used in prediction function.

    [L,P,I] = random_forest_predict(T, X, TREES);

X is a matric of training instances and TREES indices of trees to evaluate (in
case only a subset of trees have to be used). The output is class labels L,
posterior probabilites P and optionally I cell array of indices of samples reaching
the nodes (which may be required in some applications).

The basic philosofy of the toolbox is that the representation of data and splits is
left completely on user. Therefore, user must unsure that split functions
understands how the data are represented.

A simple case of using the toolbox can be found in demo directory.


Representation of the Forest
----------------------------

During the training, the forest T is represented as a nested structure and it is
linearized in the end.

    T = random_forest_train(X, Y, OPTS);

The structure of T is following
    .classes      Vector of class labels
    .split_eval   Split evaluation function(s)
    .split_params Parameters for evaluation
    .tree         Cell array of trees

Representation of Data Instances
--------------------------------

Data for training or prediction can be represented as:

    1/ NxK matrix (N data instances of K dimensions)
    2/ Structure array of arbitrary dimensions with arbitrary fields

Labels in training must be Nx1 column vector whre N is the number of
training instances. Other representations are also possible but they were
not tested.


Split Generators and Evaluators
---------------------------------

random_forest_train requires functions for random split generation and
evaluation. By default, functions random_planes width default parameters and
lin_features_eval are used. Split generators and their parameters are
specified in a cell array.

opts.split_gen = {func1, {p1, p2,...}, func2, {p1, p2,...}, ...};
opts.split_eval = {func1, {p1, p2,...}, func2, {p1, p2,...}, ...};

The functions must be in the following form:

    function F = generate(N, p1, p2, ...);
    function H = evaluate(X, IND, F, p1, p2, ...)

Example:
    cmp_splits = @(n,dims,max) [randi(dims,n,1),2*max*rand(n,1)-max];
    cmp_eval = @(x,i,f) bsxfun(@gt,x(i,f(:,1)),f(:,2)');
    ...
    opts.split_gen = {'cmp_splits', {2,10}};
    opts.split_eval = {'cmp_eval',{}}

Impurity Measures
-----------------

User can specify one or more impurity measures for scoring splits in 'impurity'
option. During the training of a node, one of dem is selected uniformly at
random. There are two standard measures bundled with the toolbox:

    1/ infogain     Class label entropy
    2/ gini         GINI diversity index

Any function of the following form can be used.

    function U = my_impurity(Y, IND, CLASSES, ...);

Where Y(IND) are class labels reaching the node, CLASSES is a vector of all
class labels.  Options in varargin can be specified in pImpurity option if
required (for example sample weigthts, or any other info). Example:

impurity = {'gini', {1e-4}};


User Data in Nodes
------------------

After training every node (internal or leaf), user functions are executed. They
are specified in the node_fnc option. A user function must follow this
prototype.

    function udata = user_func(X, Y, IND, CLASSES, ...);

Where X, Y are training instances and labels, IND index vector of instances
reaching the node and CLASSES, vector of all classes in the training. Arguments
passes as varargin can be specified in options.

node_fnc = {'func_1', {param1, param2,...}, 'func_2', {param1, param2},...};

Result of the i-th functions are stored in udata{i} field of the node. Example of user
funtion can be storing mean and standard deviation of samples in the node.

function udata = sample_mean(X, ~, IND, ~)
    udata.mu = mean(X(IND));
    udata.std = std(X(IND));

node_fnc = {'sample_mean', {}};
