Skip to content

ICDM2018Submission/VFDT-split-time-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

VFDT-split-time-prediction

Installation:

  • Clone the adapted MOA repository from here
  • MOA provides several algorithms derived from the Hoeffding-Tree. Therefore, they are adapted as well and are able to use the local split-time prediction. The main split-time prediction algorithms are located in HoeffdingTree.java.
  • If you want to get only the modified files to integrate them into your local MOA version, they are located here
  • Build MOA (The easiest way is to use an IDE such as IntelliJ)

Using the local split-time prediction

Select the type of split-time-prediction you want to use in the properties of the Hoeffding-Tree and run your experiments.

Datasets

Artificial:

Gaussian distributions with random initial positions, weights and standard deviations are generated in d-dimensional space. The weight controls the partitioning of the examples among the Gaussians.

This dataset was generated using MOA with the following parameters: 10 Million instances, 100 dimesions, 50 Gaussians, 50 classes, 100 centroids.

The RTG in MOA constructs a decision tree by randomly splitting along the attributes as well as assigning random classes to each leaf. Numeric and nominal attributes are supported and the tree depth can be predefined. Instances are generated by uniform sampling along each attribute. %Traversing the tree with the instance determines the corresponding class label.

This dataset was generated using MOA with the following parameters: 5 Million instances, 100 numeric dimesions, 100 nominal dimensions, 25 classes, max tree depth 15.

This generator yields instances with 24 boolean features with 17 of them being irrelevant. The remaining features corresponds to segments of a seven-segment LED display. The goal is to predict the digit displayed on the LED display, where each feature has a 10% chance of being inverted. Drift is generated by swapping the relevant features with irrelevant ones. We used the LEDDrift generator in MOA (7 drifting dimensions, 10% noise).

Real-world:

Ten of the colorful buildings next to the famous Rialto bridge in Venice are encoded in a normalized 27-dimensional RGB histogram. The images were obtained from time-lapse videos captured by a webcam with fixed position. The recordings cover 20 consecutive days during may-june 2016. Continuously changing weather and lighting conditions affect the representation, generating natural concept drift.

The Airline data set was inspired by the regression data set from Ikonomovska. The task is to predict whether a given flight will be delayed or not based on seven attributes encoding various information on the scheduled departure. This dataset is often used to evaluate concept drift classifier.

Assigns cartographic variables such as elevation, slope, soil type, ... of 30 x 30 meter cells to different forest cover types. Only forests with minimal human-caused disturbances were used, so that resulting forest cover types are more a result of ecological processes. It is often used as a benchmark for drift algorithms. We used the normalized version as it also can be found [here] (http://moa.cms.waikato.ac.nz/datasets/).

One million randomly drawn poker hands are represented by five cards each encoded with its suit and rank. The class is the resulting poker hand itself such as one pair, full house and so forth. This dataset has in its original form no drift, since the poker hand definitions do not change and the instances are randomly generated. However, we used the version presented in PAW, in which virtual drift is introduced via sorting the instances by rank and suit. Duplicate hands were also removed. We used the normalized version as it also can be found here.

Loosli et al. used pseudo-random deformations and translations to extended the well known MNIST database to eight million instances. The ten handwritten digits are encoded in 782 binary features.

This dataset consists of eleven million simulated particle collisions. The goal of this binary classification problem is to distinguish between a signal process producing Higgs bosons and a background process. The data consist of low-level kinematic features recorded as well as some derived high-level indicators.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages