Symbolic Regression employing Genetic Programming

Description

This work concerns the implementation of a Genetic Programming (GP) algorithm for solving the Symbolic Regression problem, which consists in, given a set of inputs, fit a mathematical model that best represents the original function, both in terms of accuracy and simplicity. This approach builds a model by combining building blocks in order to achieve a final solution, measuring its accuracy.

Problem Modeling

One of the most important steps when building up a Genetic Programming algorithm concerns the problem modeling. Hence, as in GP each individual is a mapping from a set of values to one single output value, the representation here is defined as a binary tree, since it provides a simple way to build and analyse functions by simply walking through the tree. Thus, internal nodes represent mathematical operators while leaf nodes represent terminals (variables and constants). However, it is important to mention that Genetic Programming approaches do not necessarily require a binary tree modeling, we have used such a structure in this work simply because it best fits our needs.

The set of operators supported by the regression module is presented below.

Binary Operators
- + : Addition operator;
- - : Subtraction operator;
- * : Multiplying operator;
- / : Division operator.
Unary Operators
- sen : Sin function;
- cos : Cos function;
- log : Logarithmic function;
- sqrt : Square root function;
- pow : Exponential function;

Implementation

This project was developed using Python 2 and the only exception stands for the experimental analysis as Python 3 was used for plotting purposes. In this section we introduce some relevant project decisions concerning the implementation.

The set of mathematical operators used are addition, subtraction, multiplication, division, sine, and cosine, since these operators have presented better results compared to an extended set. The reason for this reduction is to favor sufficiency, closure, and parsimony properties of Genetic Programming approaches. However, the code has been entirely modeled in such a way to extend the set of operators if needed.
Constant terminals are real numbers generated randomically in a range from 0 to 1. Such an interval was chosen due to the fact that they have proved to be enough for the convergency of the solution. Higher constants may lead to withdrawal from the best solution.
Operators such as sine and cosine use only one operand. Therefore, we always use the left tree node as the argument for trigonometric functions and the right node is left empty.
We also established a safe criteria for the division operator: in case of division by zero, the output is always set to zero.
The maximum depth of the tree is set to 7 as default and this value has not been evaluated in the experimental analysis.
We implemented a penalty measure for individuals that exceed the maximum depth of the tree, a situation derived from crossover and mutation. The penalty is applied to the fitness, boosting it to the maximum integer value that can be represented in Python.

Project Structure

The code was entirely designed so that future extensions could be possible, hence providing a generic Genetic Programming algorithm for different problems, which would only require individual and fitness calculus changes. The structure adopted is shown below.

chromossome.py : defines a Chromossome class that represents the tree (i.e the genotype) of an individual. It also provides methods inherent to binary trees such as tree traversal, search and replacement of a node, random subtrees generation, etc.
individual.py : defines an Individual class that represents an individual of a population, which is composed by its genotype (Chromossome) and the associated fitness. It also provides methods for the calculation of the fitness of a given problem instance and for applying apply mutation to its genotype.
population.py : defines a Population class that represents a population of individuals. It provides methods for addition and search among the set of individuals and also methods to generate the initial population, which has also been modularized in order to allow the inclusion of different approaches of population generation, even though Ramped Half-and-Half is the standard choice in this project. Furthermore, this class also provides methods to calculate the fitness of the entire population, which uses those methods defined in the Individual class.
statistics.py : defines a Statistics class that provides a simple and easy way to store and calculate statistics of each algorithm generation.
utils.py : defines an Utils class that provides a variety of common functions used in all other modules, such as checking whether a terminal is a constant, random probability generation, random integer value generation, random real value generation, etc.
ioutils.py : defines an IOUtils class that provides a simple interface to file manipulation.
genprog.py : defines a Genetic Programming class that represents the Genetic Programming itself. It coordinates the algorithm flow: generation of an initial population and the evolution of generations through selection and application of genetic operators. Besides methods regarding the training phase, this module also provides a method to perform the testing phase, which evaluates an instance of the problem using the mathematical model built in the learning process.
syregression.py : is the main program. This is the module that is executed by the user, which is responsible for parsing the parameters and performing modules calls in order to start the training and testing phases through the utilization of a Genetic Programming instance.

Execution & Parameters

In the project root directory, use the following command for running the software:

$ cd src
$ python syregression [−h] −−train TRAIN −−test TEST −−out OUT
                      [−−pop−size POP SIZE] [−−cross −prob CROSS PROB]
                      [−−mut−prob MUT PROB] [−−max−depth MAX DEPTH]
                      [−−seed SEED] [−−nvariables NVARIABLES] [−−ngen NGEN]
                      [−−tour−size TOUR SIZE] [−−elitism ELITISM]

The parameters listed above are:

--h, --help: Show help menu.
--train: Training file path (Required)
--test: Test file path (Required)
--out: Output file path (Required)
--pop-size: Population size (Default: 30)
--cross-prob: Probability of crossing-over (Default: 0.90)
--mut-prob: Probability of mutation (Default: 0.05)
--max-depth: Maximum tree depth (Default: 7)
--seed: Seed for random numbers generation (Default: 1)
--nvariables: Number of variables used to build up the mathematical function (Default: 2)
--ngen: Number of generations (Default: 30)
--tour-size: K size defined for tournament selection (Default: 2)
--elitism: Enable the utilization of elitism operators (Default: True)

Input & Output

The input files are given in CSV (Comma-Separated Values) format and their contents are all floating-point values. Hence, let n be the number of entries in a row of the input file. The first n - 1 values correspond to the input variables of a function, that is x₁, x₂, ..., x_{n - 1} , while the last entry refers to the y output of the function when variables x₁, x₂, ..., x_{n - 1} are applied to it.

The output consists of two files: out.txt and out.txt.csv, where out is the output filed defined by --out parameter. The first one stores a log in human-readable format with pertinent information regarding all Genetic Programming generations. The second one produces a CSV file with the same information, which is the input to the scripts used to perform the experimental analysis. We show below an output example.

Results

For information on the results please check the in which we present graphics and the set of experiments run in order to validate the algorithm.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
datasets		datasets
doc		doc
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Symbolic Regression employing Genetic Programming

Description

Problem Modeling

Implementation

Project Structure

Execution & Parameters

Input & Output

Results

About

Releases

Packages

Languages

License

caiozanatelli/SymbolicRegression

Folders and files

Latest commit

History

Repository files navigation

Symbolic Regression employing Genetic Programming

Description

Problem Modeling

Implementation

Project Structure

Execution & Parameters

Input & Output

Results

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages