Skip to content
This repository has been archived by the owner on Nov 22, 2018. It is now read-only.

Add some tutorial for unsupervised learning #76

Open
ddovod opened this issue Nov 26, 2016 · 20 comments
Open

Add some tutorial for unsupervised learning #76

ddovod opened this issue Nov 26, 2016 · 20 comments
Assignees

Comments

@ddovod
Copy link

ddovod commented Nov 26, 2016

Hi!
I have some problem with understanding of unsipervised learning api (IBody class).
Could you please provide some information about it? Tutorial section or documentation for this class would be nice!
Thank you a lot!

@jeremystucki
Copy link
Collaborator

jeremystucki commented Nov 27, 2016

We are currently working on a better documentation.
As for now I added a quick draft that should help you.

Feel free to ask any questions.

We will keep this issue open until we have a better documentation.

@janhohenheim janhohenheim changed the title Add some tutorial for unsipervised learning Add some tutorial for unsupervised learning Nov 27, 2016
@ddovod
Copy link
Author

ddovod commented Nov 27, 2016

Yes, this is what I'm looking for! I'm the very beginner in this things and any info is useful.
And thank you guys for this project, seems to be very clean and useful!

@ddovod
Copy link
Author

ddovod commented Nov 27, 2016

Another question about performance. Is it okay if supervised learning lasts a lot of time with simple iris dataset on a core i5 6600 cpu? Didn't wait till it finish, but it lasted for about 15+ minutes

@janhohenheim
Copy link
Owner

Which Dataset?
Can you send us your categorization code?

It definitely shouldn't take that long.

@Mafii
Copy link
Collaborator

Mafii commented Nov 27, 2016

@ddovod If you're using Visual Studio, did you compile in Debug or Release mode? Running Hippocrates in Debug mode will reduce its performance by a big margin. Just re-compile in Release mode and try it again.

@ddovod
Copy link
Author

ddovod commented Nov 28, 2016

I'm using cmake and after set(CMAKE_BUILD_TYPE Release) it runs faster on simple tasks, but on the iris dataset it still lasts for a long time.
I'm using classic dataset from here https://ocw.mit.edu/courses/sloan-school-of-management/15-097-prediction-machine-learning-and-statistics-spring-2012/datasets/iris.csv, just replacing last column with digital class marks.
My code is:

void loadIris(Training::Data<IrisResult>& trData, Training::Data<IrisResult>& testData)
{
    std::vector<std::vector<float>> data;
    std::ifstream file("iris.csv");
    std::string buf;
    while (std::getline(file, buf)) {
        data.push_back({});
        std::stringstream ss(buf);
        float val;
        while (ss >> val) {
            data.back().push_back(val);
            if (ss.peek() == ',')
                ss.ignore();
        }
    }

    std::random_shuffle(data.begin(), data.end());

    for (int i = 0; i < data.size() * 0.8; i++) {
        Training::Data<IrisResult>::Set set;
        set.input = std::vector<float>(data[i].begin(), data[i].end() - 1);
        set.classification = static_cast<IrisResult>(std::round(data[i].back()));
        trData.AddSet(set);
    }
    for (int i = data.size() * 0.8; i < data.size(); i++) {
        Training::Data<IrisResult>::Set set;
        set.input = std::vector<float>(data[i].begin(), data[i].end() - 1);
        set.classification = static_cast<IrisResult>(std::round(data[i].back()));
        testData.AddSet(set);
    }
}

int main()
{
    Training::Data<IrisResult> trData;
    Training::Data<IrisResult> testData;
    loadIris(trData, testData);
    
    Training::NeuralNetworkTrainer trainer;
    auto champ = trainer.TrainSupervised(trData, 150);
    std::cout << "Finished training in " << trainer.GetGenerationsPassed() << " generations\n";
    std::cout << "Result: " << Tests::TestingUtilities::TestNetwork(champ, testData) << std::endl;
    return 0;
}

It lasts for a 20 minutes or so and doesn.t finished yet.

I've been trying this code on a reduced version of this dataset (20 random objects for training, 10 - for testing), and this is my output:

ddovod@/build: time ./neat
Finished training in 4693 generations
Result: 1

real	1m2.466s
user	0m42.032s
sys	0m3.176s

Maybe I'm doing something wrong?
Thank you a lot!

@janhohenheim
Copy link
Owner

I'm gonna look at this in more detail today, but at first glance it seems that your inputs are not between -1.0 and 1.0, which is asumed by our library. The intended usage would be to divide by the theorecally highest value, like in here.

But uppon thinking about this I decided that this is not a clean solution and your code should be able to work. I'm gonna change the lib accordingly in the next few hours (#77). I would appreciate it if you could wait a moment and not change your code so you can beta test the new feature.

@ddovod
Copy link
Author

ddovod commented Nov 28, 2016

Yes, of course, I can check it on the evening today

@ddovod
Copy link
Author

ddovod commented Nov 28, 2016

I have divided all the values by 10.0 except the class marks and it still consumes a lot of time. For reduced dataset (20 training/10 testing) I have the following output:

ddovod@/build: time ./neat
Finished training in 3417 generations
Result: 1

real	0m48.722s
user	0m48.696s
sys	0m0.012s

Is it okay to produce 3417 generations for this dataset? Maybe you have some numbers for classic problems, i.e. "For binary classification with 100 training objects 500 generations should be enough"? It will be very helpful

@janhohenheim
Copy link
Owner

janhohenheim commented Nov 28, 2016

Thank you for your feedback, know that it means a lot to us!

I have just updated the development branch so your original code without division should work.
Would you mind sharing your new results with us? Let's hope they're a bit faster this time :)

If there are no visible improvements I'm gonna implement #81 and then compare the results.

@ddovod
Copy link
Author

ddovod commented Nov 28, 2016

Ok, it works much faster now!
Reduced dataset (faster and without errors):

ddovod@/build: time ./neat 
Finished training in 285 generations
Result: 0

real	0m0.587s
user	0m0.584s
sys	0m0.000s

But with full dataset (120/30) there's a lot of wrong answers (maybe it is overfitting, afaik ANNs are very overfittable for linear classification tasks), but it runs significant faster ("Result" here is the number of bad predictions on the test data):

ddovod@/build: time ./neat 
Finished training in 1183 generations
Result: 28

real	0m20.655s
user	0m20.652s
sys	0m0.000s

There's one more issue with NeuralNetwork class. My compiler is g++-6.2, and I compilation error:

[ 68%] Building CXX object CMakeFiles/neat.dir/src/main.cpp.o
In file included from /home/ddovod/_private/_ml/practice/neat/src/main.cpp:3:
In file included from /home/ddovod/_private/_ml/practice/neat/Hippocrates/Tests/TestingUtilities/Sources/Headers/testing_utilities.hpp:7:
In file included from /home/ddovod/_private/_ml/practice/neat/Hippocrates/Core/Sources/Headers/training/neural_network_trainer.hpp:6:
/home/ddovod/_private/_ml/practice/neat/Hippocrates/Core/Sources/Headers/trained/classifier.hpp:11:17: error: call to implicitly-deleted default constructor of 'Hippocrates::Trained::NeuralNetwork'
        Classifier() : NeuralNetwork(){ };
                       ^
/home/ddovod/_private/_ml/practice/neat/Hippocrates/Core/Sources/Headers/trained/neural_network.hpp:6:23: note: default constructor of 'NeuralNetwork' is implicitly deleted because base class 'Phenotype::NeuralNetwork' has no default constructor
class NeuralNetwork : public Phenotype::NeuralNetwork {
                      ^
1 error generated.

It can be fixed by adding the default ctor and then it works fine

@ddovod
Copy link
Author

ddovod commented Nov 28, 2016

Sorry, my bad, the right full dataset result:

ddovod@/build: time ./neat 
Finished training in 439 generations
Result: 0

real	0m0.961s
user	0m0.936s
sys	0m0.000s

Looks like it works fine! Thank you a lot! I will further experiment with it and maybe will ask some dummy questions here, is it okay?)

@janhohenheim
Copy link
Owner

janhohenheim commented Nov 28, 2016

Oh wow, these results really make me proud :)

I will look to add your dataset as an integration test.
Is it fine if I use some of the code that you provided in your snipped?

We are all more then happy if you experiment around and ask silly questions. We still did not invite beta testers and so require a lot of beginner feedback.
If you have any questions on the usability or find parts of the library to be confusing, please ask.

@jeremystucki
Copy link
Collaborator

jeremystucki commented Nov 28, 2016

Thank you for helping us.

This test looks ideal for the project.
I would like to use your code in our tests, if you are ok with that.

You could also open a pull request if you want to add the test yourself.

@ddovod
Copy link
Author

ddovod commented Nov 28, 2016

Yes, sure, thank you a lot)
I'm seeing some stranger things and will be investigating it. So I'll return with results a bit later in this week.
Your project is very interesting for me, it almost the only maintainable NEAT-related project on the github, so I will glad if can be useful for it. And I can open a pull-resuest with iris dataset and related classification test tomorrow

@ddovod
Copy link
Author

ddovod commented Nov 29, 2016

Ok guys, another question. Why did you restricted this library usage only for c++1z? There's no much place where it's really needed, and maybe c++14 would be enough, and it has good support in gcc and clang (and libc++)

@ddovod
Copy link
Author

ddovod commented Nov 29, 2016

Here just a few things I cannot use to work with Hippocrates:

  • Clion (actually I write code in emacs, but sometimes I use Clion for some debugging tasks)
  • clang and related utilities (sanitizers)
  • Some people use Xcode (sometimes me too, mostly at work)
    Right now I want to compile and run some tests with address sanitizer, but libc++ has no support of filesystem and a lot of other features in experimental directory, that's why it's sad a bit

@ddovod
Copy link
Author

ddovod commented Nov 29, 2016

And maybe it was my mistake, but it still spends a lot of time on my initial problem. I don't know why, but today I cleaned/compiled the test again and it spends a lot of time and memory
https://travis-ci.org/ddovod/Hippocrates/jobs/179871731
Have no idea about the reason

@janhohenheim
Copy link
Owner

janhohenheim commented Nov 29, 2016

As you can see here, C++1z is just as well supported.
libc++ has indeed support for experimental, you just have to build it yourself and provide a certain build parameter, which is a pain in the ass. Perhaps we will change logging and JSON parsing to use well tested libraries that do not use experimental TS, which would eliminate those dependencies.

CLion has very deprecated syntax highlighting since 3 years and is not planned to be supported, as this would bottleneck our coding style heavily.

@ddovod
Copy link
Author

ddovod commented Nov 29, 2016

Okay, it is not big problem I think.
I just compiled anf run all tests with address sanitizer (clang-3.9 + libstdc++) and it seems ok. I want to compare the results of iris problem solution with this library. I'll post results here soon

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants