`c45`

This library implements the C4.5 decision tree learning algorithm. C4.5 is an extension of the ID3 algorithm that uses information gain ratio instead of information gain for attribute selection, which avoids bias towards attributes with many values (see below for implementation details).

The library implements the classifier_protocol defined in the classifier_protocols library. It provides predicates for learning a decision tree from a dataset, optionally prune it, using it to make predictions, and exporting it as a list of predicate clauses or to a file.

Datasets are represented as objects implementing the dataset_protocol protocol from the classifier_protocols library. See test_files directory for examples.

API documentation

Open the ../../apis/library_index.html#c45 link in a web browser.

Loading

To load all entities in this library, load the loader.lgt file:

| ?- logtalk_load(c45(loader)).

Testing

To test this library predicates, load the tester.lgt file:

| ?- logtalk_load(c45(tester)).

Implemented features

Information gain ratio for attribute selection (avoids bias towards attributes with many values)
Handling of discrete (categorical) attributes with multi-way splits
Handling of continuous (numeric) attributes with binary threshold splits (selects the threshold with the highest gain ratio from midpoints between consecutive sorted values)
Handling of missing attribute values (represented using anonymous variables): examples with missing values for an attribute are distributed to all branches during tree construction; gain ratio computation uses only examples with known values for the attribute being evaluated; prediction with missing values uses majority voting by exploring all possible branches and selecting the most common class
Tree pruning using pessimistic error pruning (PEP): estimates error rates using the upper confidence bound of the binomial distribution (Wilson score interval) with a configurable confidence factor (default 0.25) and minimum instances per leaf (default 2); replaces subtrees with leaf nodes when doing so would not increase the estimated error; helps reduce overfitting and improve generalization
Export of learned decision trees as predicate clauses
Pretty-printing of learned decision trees

Limitations

No incremental learning (the tree must be rebuilt from scratch when new examples are added)

References

Quinlan, J.R. (1986). Induction of Decision Trees. Machine Learning, 1(1), 81-106. https://doi.org/10.1007/BF00116251
Quinlan, J.R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers. https://doi.org/10.1016/C2009-0-27846-9
Mitchell, T.M. (1997). Machine Learning. McGraw-Hill. Chapter 3: Decision Tree Learning.

Usage

To learn a decision tree from a dataset:

| ?- c45::learn(play_tennis, Tree).

To prune a learned tree with the default parameters (confidence factor 0.25, minimum instances per leaf 2):

| ?- c45::learn(breast_cancer, Tree),
     c45::prune(breast_cancer, Tree, PrunedTree).

To prune a learned tree with both custom confidence factor and minimum instances per leaf:

| ?- c45::learn(breast_cancer, Tree),
     c45::prune(breast_cancer, Tree, 0.1, 3, PrunedTree).

To export the tree as a list of predicate clauses:

| ?- c45::learn(play_tennis, Tree),
     c45::tree_to_clauses(play_tennis, Tree, classify, Clauses).

To export the tree to a file:

| ?- c45::learn(play_tennis, Tree),
     c45::tree_to_file(play_tennis, Tree, classify, 'tree.pl').

To print the tree to the current output:

| ?- c45::learn(play_tennis, Tree),
     c45::print_tree(Tree).

To predict the class for a new instance (as a list of attribute-value pairs):

| ?- c45::learn(play_tennis, Tree),
     c45::predict(Tree, [outlook-sunny, temperature-hot, humidity-high, wind-weak], Class).

c45