Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor output stream and / or python plotting script, with conversion format for other plotting scripts outside the lib #110

Open
2 tasks
beniz opened this issue Jan 30, 2015 · 29 comments

Comments

@beniz
Copy link
Collaborator

beniz commented Jan 30, 2015

A few issues:

  • regular and surrogates default output streams differ and thus at the moment require plotting with different scripts;
  • there are existing plotting scripts with nice visual goodies, and it would make sense to format the output for them

EDIT: relevant comment, #106 (comment)

@nikohansen
Copy link
Collaborator

I volunteer to move the plotting functionality from the cma.py code into a stand alone python module.

@beniz
Copy link
Collaborator Author

beniz commented Jan 30, 2015

Thanks!

It can also be the right moment to decide upon a standard format for CMA output. Typically what is the reason for using a file per subfigure as you described in #106 (comment) ?

At the moment the multiple file design does not fit well with the libcmaes output model. The lib allows for custom and distinct progress and output functions, with defaults provided anyways. The output function writes to a single output stream and in terms of design and performances I'd favor keeping it this way. But there may be other elements to consider as well.

@nikohansen
Copy link
Collaborator

I am open to any considerations about the best format. The many-files format is, I agree, a little ugly, but makes reading in vectors with unknown length simple, in particular when future extension cannot be ruled out (for example, I added recently an output file for the eigenvalues of the correlation matrix). Otherwise one needs to have a syntax or special format to discover the dimension and possibly identify groups.

I am a little stuck with the described format, because all of my code complies with it (5 implementations of CMA-ES and 3 implementations of plotting the data). I am not likely to change all eight implementations unless for a very good reason. It should be simple though to write a transformer one-to-many-files and/or the other way around.

@beniz
Copy link
Collaborator Author

beniz commented Jan 31, 2015

Understood. It is possible to describe the format in a generic higher level language and have it parsed the way we want to one or many files and back into memory. One widely spread tool for defining structured format across most platforms and languages are 'protocol buffers', https://code.google.com/p/protobuf/ (and Python tutorial https://developers.google.com/protocol-buffers/docs/pythontutorial)

Format descriptions are independent of language and platform, and provide objects to be filled out into memory and written to file. The format is evolutive and accomodates optional as well as new variables and structures without breaking compatibility.

I am totally familiar with protocol buffers and can make a format description proposal that would match the legacy one (yours) while retaining the ability to choose between one or more files, as well as to re-acquire the data without need to bother about the number of columns etc...

In short, the description for the first few columns of you outcmaesfit.dat file could be something like:

message CMAFitLine {
  required int32 iteration = 1;
  required double feval = 2;
  required double sigma = 3;
}

message CMAFit {
  repeated CMAFitLine fitline = 1;
}

One drawback is that in serialized (and possibly compressed) form the data file would not be human readable anymore, though some implementations do support writing raw data. The serialized form would be a plus in high dimensions however.

@nikohansen
Copy link
Collaborator

How do you describe a field with variable variable length / number of data?

@nikohansen
Copy link
Collaborator

I am generally not quite in favor of writing encoded/compressed data. The use case to have a quick look at the data file is just too common.

@beniz
Copy link
Collaborator Author

beniz commented Jan 31, 2015

The use case to have a quick look at the data file is just too common.

Totally agreed, this is one thing I'd need to check per implementation. Also, not saying we must go down this road, just a proposal at this stage.

@beniz
Copy link
Collaborator Author

beniz commented Jan 31, 2015

How do you describe a field with variable variable length / number of data?

The repeated keyword does this, as in the very short example above. The size is then dynamically obtained from the in-memory object obtained from parsing the file.

@beniz
Copy link
Collaborator Author

beniz commented Feb 4, 2015

Below is a first proposal for an extendable output format based on protocol buffers. The following decisions & assumptions apply:

  • ability to read / write with protocol buffer code (i.e. one line of code) in serialized form
  • ability to write in non-serialized form, but losing the ability to read it back without custom code
  • usage of a column-based storage, i.e. every array (keyword repeated) stores values in time, as opposed to a row-based representation as in the previous tiny 'sketch'. This is open for discussion of course, I simply deemed it more practical for plotting (because getting a full vector at once instead of having to read everything back line by line)
  • removed the void and 0 in your legacy format since I understand they are used to mark the beginning of the vector entries. However they can easily be added back when writing the output in human readable form
  • put sigma in header even if not present in outcmaesxmean.dat, just thought it'd make better sense
  • ability to extend format with custom output (see example with accuracy, e.g. optimizing for a machine learning application)

Here is the format proposal:

message Header
{
 repeated int32 iteration = 1;
 repeated int32 evaluation = 2;
 repeated double sigma = 3;
 optional int32 seed = 4;
 optional string date = 5;
}

message CMAFit
{
 optional Header head = 1;
 repeated double axis_ratio = 3;
 repeated double bestever = 4;
 repeated double best = 5;
 repeated double median = 6;
 repeated double worst = 7;
 repeated double more_data = 8; // XXX: or use extensions
}

message CMAXRecentBest
{
 optional Header head = 1;
 repeated double fitness = 3;
 repeated double xbest = 4;
}

message CMAXMean
{
 optional Header head = 1;
 repeated XMean xmean = 2;
}

message CMAAXLen
{
 optional Header head = 1;
 repeated double max_axis_length = 3;
 repeated double min_axis_length = 4;
 repeated SqrtEigenVals all_axes_length = 5;
}

message CMAStdDev
{
 optional Header head = 1;
 repeated Stds stds = 3;
}

message XMean
{
 repeated double x = 1;
}
message SqrtEigenVals
{
 repeated double sqrteigenval = 1;
}

message Stds
{
 repeated double std = 1;
}

message LegacyCMAOutput
{
 required CMAFit fit = 1;
 required CMAXRecentBest recentbest = 2;
 required CMAXMean xmean = 3;
 required CMAAXLen axlen = 4;
 required CMAStdDev std = 5;
}

message UniqueCMAOutput
{
 required Header head = 1;
 required CMAFit fit = 2;
 required CMAXRecentBest recentbest = 3;
 required CMAXMean xmean = 4;
 required CMAAXLen axlen = 5;
 required CMAStdDev std = 6;
 extensions 100 to 150; // for custom output additions
}

and example of a custom extension:

import "out.proto";

extend UniqueCMAOutput
{
 repeated double accuracy = 100;
}

Besides discussion, corrections and improvements, a next step could be for me to open a new independent git repository with support for the output format, protocol buffers with Python and C++code procedures for using the format in typical CMA implementation.

@nikohansen
Copy link
Collaborator

I spotted two possible additions:

  • The Header could have optionally the dimension
  • CMAXMean could have optionally the fitness of the mean

I wouldn't put sigma in the header, it fits best into CMAStdDev and second in CMAAXLen. The main reason why the legacy has sigma and axis_ratio in CMAFit is that they do not depend on dimension. That is, there is a single file holding all possibly relevant data that do not depend on dimension and are therefore easy to manage also with very large dimension.

Having max_axis_length and min_axis_length instead of axis_ratio is maybe better. The legacy has (a) chosen data which are often plotted without further processing, that is, I can do the plotting in two Python lines or so and (b) adheres to the (weird) first-five-columns-are-meta-data rule also for CMAFit. That's why we see axis_ratio.

I guess my concern about human readability remains.

@beniz
Copy link
Collaborator Author

beniz commented Feb 5, 2015

I guess my concern about human readability remains.

A human readable output could be worked out for both the single and multiple files formats. In this case, of course one of the only remaining advantages of a structured format such as the one above is the clarity within the code.

Taking a look at the future, a few points that could be taken into consideration in the present discussion:

  • ability to store the search state of the optimizer in serialized form (e.g. in-memory)
  • ability to exchange search states across machines for distributed computations
  • ability to reuse search states in other applications, including optimizers.

Probably there's no full use for this in a very near future, but I am considering ways that it could be a building block of later use.

@beniz
Copy link
Collaborator Author

beniz commented Feb 13, 2015

The extendable output format comes with some difficulties, one of which is to keep the ability to serialize it to disk incrementally, i.e. without keeping the full history object into memory. I believe this is a good thing to have in the mid-term, but that right now, there's enough to do to not introduce such a big piece of code immediately. Plus I'd like to release the series of bug fixes as a new release.

Therefore, I am implementing a first path to fulfilling this issue as follows:

  • in-lib capability of a 'full' output function that writes to a single file all data required by the legacy format (i.e. worst candidate, best ever, ...);
  • a python script containing a conversion function from a single file to the multiple files of the legacy format.

This should allow to work along with the minimal Python workflow of #116.

INFO: in 'legacy_106' branch (106 is a mistake, should have called it 116 or 110...)

beniz pushed a commit that referenced this issue Feb 13, 2015
…rt the legacy format to plotting CMA-ES results + worst candidate, ref #110
beniz pushed a commit that referenced this issue Feb 13, 2015
@nikohansen
Copy link
Collaborator

one of which is to keep the ability to serialize it to disk incrementally

I assume this doesn't prevent us to monitor a run online, right? In practice, this is what I always do, unless the objective function is extremely cheep (which is virtually never the case).

In general, if the output format will not allow incremental writing, I have doubts that it will ever meet the performance objectives you have for the library.

@beniz
Copy link
Collaborator Author

beniz commented Feb 14, 2015

Yes this is correct, the output to multiple files will not be incremental for now, precluding the online monitoring with the legacy plotting functions. This is until the full generic format gets implemented with incremental serialization to disk.

The reason why I switched to an easier immediate solution yesterday is that for incremental serialization to function properly, the format above needs a full 'line-based' refactoring, which will complicate the plotting code as well. I still believe this is the way to go in the future, but not immediately as I need to focus on more important tasks, such as the profile likelihood in eigenspace.

My immediate target is the simple workflow in Python along with (not online) legacy plotting capability so that results can be more easily compared across implementations.

Let me know if you believe this is not a good intermediate decision.

@nikohansen
Copy link
Collaborator

But, if I understand correctly then, that prevents online monitoring altogether, e.g. of a remote job?

@beniz
Copy link
Collaborator Author

beniz commented Feb 14, 2015

Depends if I can get the one to multiple file conversion script to work on a partially filled output. Should be able to though...

@beniz beniz self-assigned this Feb 14, 2015
beniz pushed a commit that referenced this issue Feb 16, 2015
@beniz
Copy link
Collaborator Author

beniz commented Feb 16, 2015

Added a legacy format conversion tool to branch 'legacy_106', as python/cma_legacyplt.py. It has a convert function, and can be used as well as:

python cma_legacyplt.py ros_full.dat

where ros_full.dat is obtained with:

./tests/test_functions -fname rosenbrock -dim 20 -full_fplot -fplot ros_full.dat

At the moment, I am able to plot from the converted files with the plotcmaesdat script for Octave, but not from the Python code, i.e. with

import cma
cma.plot()

which yields the following error:

WARNING (module=cma, class=CMADataLogger, method=load):  reading from file "outcmaesaxlencorr.dat" failed
WARNING (module=cma, class=CMADataLogger, method=load):  no data for outcmaesaxlencorr.dat
/home/beniz/research/siminole/dev/libcmaes/python/cma.py:6403: RuntimeWarning: invalid value encountered in less
  dfit[dfit < 1e-98] = np.NaN
/home/beniz/research/siminole/dev/libcmaes/python/cma.py:6430: RuntimeWarning: invalid value encountered in less
  sgn[np.abs(dat.f[:, 5]) < 1e-98] = 0
/home/beniz/research/siminole/dev/libcmaes/python/cma.py:6431: RuntimeWarning: invalid value encountered in less
  idx = np.where(sgn < 0)[0]
/home/beniz/research/siminole/dev/libcmaes/python/cma.py:6437: RuntimeWarning: invalid value encountered in less
  start_idx = 1 + np.where((dsgn < 0) * (sgn[1:] < 0))[0]
/home/beniz/research/siminole/dev/libcmaes/python/cma.py:6438: RuntimeWarning: invalid value encountered in greater
  stop_idx = 1 + np.where(dsgn > 0)[0]
Traceback (most recent call last):
  File "test_legacy.py", line 2, in <module>
    cma.plot()
  File "/home/beniz/research/siminole/dev/libcmaes/python/cma.py", line 6795, in plot
    x_opt, fontsize)
  File "/home/beniz/research/siminole/dev/libcmaes/python/cma.py", line 6153, in plot
    self.plot_divers(iabscissa, foffset)
  File "/home/beniz/research/siminole/dev/libcmaes/python/cma.py", line 6491, in plot_divers
    text(dat.f[idx, iabscissa][-1], dfit[idx][-1],
IndexError: index out of bounds

beniz pushed a commit that referenced this issue Feb 16, 2015
@nikohansen
Copy link
Collaborator

For some reason I have no ./tests/test_functions (anymore).

@beniz
Copy link
Collaborator Author

beniz commented Feb 16, 2015

The only logical explanation would be that you are missing gflags and therefore test_functions doesn't get built.

@nikohansen
Copy link
Collaborator

Right, how can I reproduce the problem then? If you can provide ros_full.dat, I should be fine.

beniz pushed a commit that referenced this issue Feb 16, 2015
@beniz
Copy link
Collaborator Author

beniz commented Feb 16, 2015

Now, you can do it with python with p.set_full_fplot(True) where p is a CMAParametersXX object (XX=NB, PB, ...). This is on branch legacy_106.

@nikohansen
Copy link
Collaborator

OK, the reason for the failure is that the first f-value is nan. I will prepare a fix and also fix the runtime warnings.

@nikohansen
Copy link
Collaborator

The median and largest f-value are both 0.0 in iteration 0, which I would consider to be a semi-bug.

@nikohansen
Copy link
Collaborator

On the contrary, the axis ratio in iteration 0 could rather be 1.0 instead of nan (this was the reason for one of the runtime warnings).

@nikohansen
Copy link
Collaborator

Fix for plotting with Python is available here.

@beniz
Copy link
Collaborator Author

beniz commented Feb 17, 2015

The two commits above do fix the initial values for initial median, worst and condition number.

I've tested the new plotting script, and it works just fine, though I am still experiencing the plot to disappear too quickly to be seen. Here is what I do:

import cmaplt
cmaplt.plot()

In order to see the plot, I do remove the pyplot.ion() call at https://github.com/CMA-ES/plotting-cma-data/blob/master/src/cmaplt.py#L1071

@nikohansen
Copy link
Collaborator

I can't reproduce this and don't quite understand why this is the case :-( Does cmaplt.pyplot.show() or cmaplt.pyplot.gcf() have any effect? For the latter, do you see a figure coming up and what is the figure number in the window? Does cmaplt.pyplot.ioff(); cmaplt.pyplot.show() work (after cmaplt.plot())?

@beniz
Copy link
Collaborator Author

beniz commented Feb 23, 2015

So, in lcmaes_interface.py, replacing the plot high level function https://github.com/beniz/libcmaes/blob/dev/python/lcmaes_interface.py#L83 with the one below does the trick (no other combination would work on my machine):

def plot(file=None):
    cmaplt.plot(file if file else fplot_current)
    cmaplt.pylab.ioff() 
    cmaplt.pylab.show()

The same applies to simple.py, so I'll commit this change.

beniz pushed a commit that referenced this issue Feb 23, 2015
beniz pushed a commit that referenced this issue Feb 23, 2015
@beniz
Copy link
Collaborator Author

beniz commented Feb 23, 2015

The legacy format generator and conversion tools are now in 'dev' branch, ready for next release. I've tested again against your new cmplt.py and it works just fine.

I've added a python/README.legacy file with short explanations on how to plot in legacy format. I believe this is mostly for comparison with other existing implementations and because the graphs are a bit more informed and beautiful.

Unless there are other details, I believe this ticket should be fulfilled for now.

andrewsali pushed a commit to andrewsali/libcmaes that referenced this issue Jan 31, 2016
…rt the legacy format to plotting CMA-ES results + worst candidate, ref CMA-ES#110
andrewsali pushed a commit to andrewsali/libcmaes that referenced this issue Jan 31, 2016
andrewsali pushed a commit to andrewsali/libcmaes that referenced this issue Jan 31, 2016
andrewsali pushed a commit to andrewsali/libcmaes that referenced this issue Jan 31, 2016
andrewsali pushed a commit to andrewsali/libcmaes that referenced this issue Jan 31, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants