Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make mesa scalable! #798

Closed
rithwikjc opened this issue Feb 26, 2020 · 22 comments
Closed

Make mesa scalable! #798

rithwikjc opened this issue Feb 26, 2020 · 22 comments

Comments

@rithwikjc
Copy link

What's the problem this feature will solve?
Currently mesa is a great tool for visualizing and studying ABMs (and the best in python), but in my experience it like other tools I've tried prevents us from taking full advantage of ABMs. For this what is required is scalability. ABMs are currently a hot area of research, to make sure Mesa doesn't die out, mesa should be able to deal with

  • Larger number of agents
  • Larger time steps

Currently the mesa basic package simply crashes for large number of agents and large number of steps, mainly owing to the datacollector getting overloaded.

Describe the solution you'd like
An alternate class of datacollector could be made which could be used to store/write data periodically so as to not crash the system when it runs out of memory. More support and documentation of parallel processing. Documentation here is important, as without that currently anything implemented just goes unused. I think both these need to be addressed quickly and systematically. I am willing to help in any way to this making mesa future-proof and a proper research tool.

Additional context
I was working on my masters thesis which needs ABMs to run up to $10^6 - 10^8$ steps and mesa simply failed at this. I've had to write my own code to even just run the model without memory overload.

@Corvince
Copy link
Contributor

Corvince commented Feb 27, 2020

Hey there,

first of all if you really want to scale up ABMs I think Python is the wrong modelling language. That said I fully agree that the "cost of mesa" should be as small as possible. From my experience there are several performance gains for the Grid implementation and I am currently working on a different implementation of them.

For the datacollection I don't see an urgent problem. If I understand you correctly you were simply running out of memory. However, data collection in mesa is pretty explicit. You have to call datacollector.collect() yourself whenever you want to collect data. So if you only want to do it periodically you can do so yourself. The datacollector itself has no sense of time and implementing that functionality would require rewriting how it works (not saying it is a bad idea, but I don't see how the work would pay off).

An alternative would be to write the data of the collector to the disk or a database. But again, this is something that can already be done explicitly. But I agree that this could be done more user-friendly with a dedicated function of the datacollector (with the question if values should also remain in memory or not). Note, however, that this will likely incur some performance cost, because IO and/or writing to a database is rather slow (compared to doing nothing).

Lastly I had some good experiences with pickling whole models. This also allows you to restart your model from a later state, if it ever crashes.

@jackiekazil
Copy link
Member

@rithwikjc to add to @Corvince's comments...

Any chance you can create a "hello world" like model that illustrates the break down? We have heard various feedback here and there, but have never had a model illustrate it. This would be helpful. (Or maybe modify one of the models in the model folder?)

RE: Datacollector updates -- I could see that. (I think our issue is that we have a lot of things in motion and not enough contributors, so it is harder to prioritize something like that.)

RE: Parallel Processing -- This is a tricky one, because splitting a model over multiple cores is known to introduce unintended artifacts to the outcome of the models. I am sure there is research out there about mitigating this, to integrate this kind of thing, I wouldn't want to build something off of one paper since real research would depend on how this thing works. This would be a very intensive research project to ensure the only artifact are speed gains and if there were other artifacts that a model creator was aware of those and choose to live their best lives with them. (I hope a PhD student comes along and builds a dissertation off of this and contributes it back. #shameless)

@jackiekazil
Copy link
Member

Also... probably not what you are looking for, but BatchRunnerMP is a mult-processing class for running batches.

@Corvince
Copy link
Contributor

Corvince commented Mar 5, 2020

RE: Parallel Processing -- At first glance multiprocessing seems like a natural fit for ABMs since a lot of the tasks of the agents could theoretically be done in parallel. However, the interesting aspects of ABMs (and where they differentiate themselves from other models) comes from the interaction of agents. And this is known to be tricky to be done in parallel. And I don't think it is usually worth it. Even if you would manage to increase model run times by an amount x (equal to the available cores), usually you conduct several model runs. And then there are no more speed gain if you compare x model runs sequentially or in parallel (thus, BatchRunnerMP).

@rithwikjc
Copy link
Author

Hey @Corvince

Just curious. Which language would be suited for something like ABM with lets say agents in the order of 100s and time steps in the order of millions? Since the interactions have to happen sequentially in the population I don't see how a different language can really improve things in many cases.

Indeed, it was a memory error. 😄
Agreed, I am currently using my own data writing function which is activated at a certain frequency, than the datacollector class. Although I believe the datacollector should atleast have an option of writing to a file periodically than leaving that as a manual task for the library users I believe.

Woah, pickling whole models as in, writing the entire model at some state to a pkl? Do you suggest that as a good way of storing the state of a model? I have never really used pickle before and general advice was to stick to simpler file I/O.

@rithwikjc
Copy link
Author

@jackiekazil

Sure, makes sense. I would retract my suggestion to improve the parallel processing part for now. 😄
And I get the issues with the contributions. I'll try to help. What exactly did you mean by this? What do you mean by the breakdown?

Any chance you can create a "hello world" like model that illustrates the break down?

@rithwikjc
Copy link
Author

rithwikjc commented Mar 5, 2020

usually you conduct several model runs. And then there are no more speed gain if you compare x model runs sequentially or in parallel

@Corvince Thanks, I've wondered about this.

@Corvince
Copy link
Contributor

Corvince commented Mar 7, 2020

Agreed, I am currently using my own data writing function which is activated at a certain frequency, than the datacollector class. Although I believe the datacollector should atleast have an option of writing to a file periodically than leaving that as a manual task for the library users I believe.

Just to put the amount of work it requires for a user/modeler and the amount of work it would require for mesa developer into perspective:

As a modeler, I would do something like this:

#inside the model.step function
if self.schedule.steps % 1000 == 0:
    df_out = self.datacollector.get_agent_vars_dataframe()
    df_out.to_csv(f"awesome_model_run_{self.schedule.steps}.csv")
    self.datacollector = DataCollector(self.datacollector.model_reporters, self.datacollector.agent_reporters)

That is exactly 4 lines of code. However, there are several design decisions I took that might not work for everyone or are unsafe. Let`s go through the lines

if self.schedule.steps % 1000 == 0:

I want to output every 1000 steps. Obviously this would be different for everyone. As I said before I definitely see the general use of this functionality, but right now the DataCollector class does not know of the model or the schedule so there is no straight forward way to implement this

df_out = self.datacollector.get_agent_vars_dataframe()

I am only interested in the agent variables, but if I wanted to save also model variables things get more complicated regarding file names (see below)

df_out.to_csv(f"awesome_model_run_{self.schedule.steps}.csv")

This is the crucial line. I made a total of four design decisions here:

  1. filename. Trivial enough
  2. file extension/format. We are just saving a pandas dataframe, so we should probably allow all of those formats?
  3. Saving to a new file every time. If we want to append we would need to write things differently. For csv appending is trivial, but not for every file format
  4. Overwriting any exisiting files. This is something that should not be the default. Users should be required to somewhat make this decision consciously and mesa should not potentially cause data loss. However, this means we need to handle existing file names. Should we just abort? Append _1 to the filename? Have an "overwrite" option?
self.datacollector = DataCollector(self.datacollector.model_reporters, self.datacollector.agent_reporters)

I am resetting the datacollector every time I am saving to file. This might not always be desired so it should be optional.

I hope this shows that something as simple as four lines of modelers work translate into much more work for a library developer, if done properly. Also it would mean we are potentially offering so many customization options that it requires more time to read and understand the datacollector.save doc string than it takes time to just save things in a custom way. And do not forget that we would also need tests for this function, which is a bit of work when dealing with I/O.

Just to put @jackiekazil comment about a lack of contributors into perspective. But contributions are always welcomed!

@Corvince
Copy link
Contributor

Corvince commented Mar 7, 2020

Woah, pickling whole models as in, writing the entire model at some state to a pkl? Do you suggest that as a good way of storing the state of a model? I have never really used pickle before and general advice was to stick to simpler file I/O.

If you want to save the "real" state of a model this is actually the only (very easy) way to do so!
Simple enough:

import pickle
with open("filename.p", "wb") as f:
    pickle.dump(model, f)

This also happens quite fast and later you can do

import pickle
with open("filename.p", "rb") as f:
    model = pickle.load(f)

And you can continue your model from the saved state (just continue with model.step(). I think if you have your model running 1e6 steps it is quite advisable to save your complete state so you don't have to re-run everything again if you want to run it further or forget to collect some attributes. I would just remove/reset the datacollector, otherwise you are saving a lot of redundant data.

The only caveats are that pickle files are insecure (but not if you use your own files) and are not inter-operable outside of Python

*
I just tested this and needed to increase the recursion limit with

import sys
sys.setrecursionlimit(10000)```

@pmbaumgartner
Copy link

I'm going to add a few thoughts on here based on my brief experience with Mesa.

  1. Has there been any profiling done in the mesa codebase? Or are there plans to make profiling tools available for models? I ask because as a beginner, it was hard for me to understand which parts of the code were taking the longest time and why. In practice, it was always the get_agent_vars_dataframe, but I'm not sure which part of that code is actually taking so long or consuming the most memory.
  2. Have the maintainers thought about building a flavor of mesa that uses Cython to define the agents and models? Having an option to code models using Cython, while more work for the modeler, might be the way to increase the scalability when necessary.

@Corvince
Copy link
Contributor

Thank you for your feedback!

  1. That sounds like a good addition for the "useful snippets" section of the docs. It's a bit dangerous since it might encourage premature optimizations, but can also be a good way to identify bottlenecks.
    In your case btw I guess you are frequently creating a dataframe (every step?). Maybe it is sufficient to do this after the model has finished?
  2. I don't know if this has ever been discussed, or Cython would really be a good way to do this. As I said before I don't expect Mesa to be a bottleneck for most models and if really high performance is required one should probably do everything in C or another language (the amount of work done by Mesa is relatively small)

@rithwikjc
Copy link
Author

Thank you @Corvince for the very elaborate comments. These are surely helpful. In fact I am using something like,

if self.schedule.steps % write_frequency == 0:

you mentioned to limit the data writing. I also understand the trade-off between complexity and learning-curve/ease-of-use. However I felt that if mesa is to be used for research purposes, it is better to have much more scalability built into it, as optional features. But I understand the problems you are raising. I will try to add some contributions to the repository, if I find elegant ways to tackle these issues.

A small and significant improvement in the meanwhile could be adding tips such as the ones you've mentioned here in the documentation, so that new users don't feel completely lost or helpless when coding models that operate on higher scales. I will see about augmenting the documentation as well, as I think that would be one of the best additions at this point.

Thanks for the helpful comments regarding pickling as well. I imagine it can be very helpful for my application (and again could possibly be added to the documentation as helpful tips). I was very apprehensive about pickling since almost everywhere it is shunned upon to pickle objects for safety/security reasons. I feel however safer to proceed now and try it, as I guess the benefits could be huge for me.

@pmbaumgartner also has made some really good suggestions. Some profiling can be immensely helpful for models that work at high scales to improve performance. I have used snakeviz to improve performance on my model drastically. It could be a valuable addition for research applications. I am not savy enough to comment on Cython however.

@snunezcr
Copy link

snunezcr commented Apr 2, 2020

I second the spirit of the starting post. My recent efforts to provide a reasonable model for COVID-19 require representing a large number of agents to make it realistic.

https://github.com/snunezcr/COVID19-mesa

While I understand Python may not be the best tool > 10^5 agents, there may be significant opportunities to utilize multiprocessing libraries.

@Corvince
Copy link
Contributor

Corvince commented Apr 4, 2020

Hi @snunezcr thanks for your feedback and sharing your model. It looks very interesting.

I took a look at the performance of your model and I actually don't see any model slowness caused by mesa. It runs relatively slow, because you are generating a lot of random numbers, which is rather expensive (especially in a setting where you only generate a lot of single numbers, i.e. generating 1000 x 1 random number is much slower than generating 1 x 1000 random numbers).

On my machine running your starter model for 50 steps takes about 7 seconds. Profiling revealed that in your move function you have this line

self.curr_dwelling = poisson(self._model.avg_dwell).rvs()

If I change that to

self.curr_dwelling = poisson.rvs(self._model.avg_dwell)

the run time goes down to 3.5 seconds. Apparently "freezing" distributions in scipy is rather expensive, interestingly because the docstring is always being generated. If you do it only once there is not much difference for rvs, but you are doing it quite a lot.

Furthermore changing agents susceptible stage step function from this:

if (self.detection.rvs()) and (self.astep >= self.model.days_detection):
    pass

to first checking the second condition you don't need to generate the first random number:

if (self.astep >= self.model.days_detection) and (self.detection.rvs()):
    pass

my run time for 50 steps goes down to 0.5 seconds.

However, the visualization is indeed much slower than that. It seems to be related to the chart, if you deactivate the chart, the grid view produces minimal overhead. Maybe we could investigate into why the charts are relatively slow.

@snunezcr
Copy link

snunezcr commented Apr 4, 2020

Hello @Corvince ,

Thank you for looking into the profiling aspects of the model. This is extremely useful. I am surprised about the behavior of distributions in scipy as well. Did not expect that. I do worry at the back of my mind about whether the cost is related to some guarantees that ensure proper behavior of the distribution. Small experiments I have performed using your method do not indicate significant differences, so, for the moment, I will go ahead with it.

While the charts are useful, my primary concern is to scale up to ensemble computations with fixed parameters up to a 100 instances, and then compute averages.

Thanks again.

@a-mpch
Copy link

a-mpch commented May 2, 2020

Hey @Corvince

An alternative would be to write the data of the collector to the disk or a database. But again, this is something that can already be done explicitly. But I agree that this could be done more user-friendly with a dedicated function of the datacollector (with the question if values should also remain in memory or not). Note, however, that this will likely incur some performance cost, because IO and/or writing to a database is rather slow (compared to doing nothing).'

New around here but I think that I can take a function that does that. Would you recommend something to take into account?

@Corvince
Copy link
Contributor

Corvince commented May 2, 2020

There has been some initial work by @dmasad on this thread:
https://groups.google.com/forum/m/?utm_medium=email&utm_source=footer#!msg/projectmesa-dev/5U6x3vVFwR4/x2NUF8pYAgAJ

There is a link to a GitHub gist and note my improvements in the first reply (you can ignore my other comments cheering on a non-existent solution).

Apart from that, I think the most difficult thing is to decide if and when you want to override data in the database. Sometimes you just change parameters and want to compare, but sometimes your model itself changes and you want to get rid of that old data. So I would say it is easy to store data itself, but not to identify the model behind the data. Be sure to keep this in mind.

@a-mpch
Copy link

a-mpch commented May 2, 2020

@Corvince The link does not work. Could you link it again or paste the gist and your comments here?

@Corvince
Copy link
Contributor

Corvince commented May 3, 2020

Oh the link only works if I am logged in. You can search for "data collection profiling" here

Direct link to the gist:
https://gist.github.com/dmasad/ea6416772a66601e2eddd1ee379da46b

My changes:

    def record_data(self):
        self.insert_sql = "INSERT INTO agent_data VALUES (?, ?, ?, ?)"

        self.c.execute("BEGIN TRANSACTION;")

        values = [(self.schedule.steps, a.unique_id, a.x, a.y) for a in self.schedule.agents]

        self.c.executemany(self.insert_sql, values)

        self.conn.commit()

@EwoutH
Copy link
Member

EwoutH commented Jan 21, 2024

For everyone interested in this, there is now an effort to create an vectorized subset of Mesa:

@jackiekazil
Copy link
Member

@EwoutH thank you for connecting dots!

@EwoutH
Copy link
Member

EwoutH commented Sep 3, 2024

I'm going to close this issue as completed, now that we have mesa-frames officially under the Mesa umbrella:

If there are any specific issues or ideas for Mesa performance or scalability, feel free to open a new issue or discussion!


For anyone encountering this issue, there's a ChatGPT 4o generated summary:

  1. Scalability Challenges: The primary challenge identified was the limitation of Mesa in handling large numbers of agents and time steps, primarily due to memory overload caused by the DataCollector. Suggestions were made to implement features like periodic data writing to alleviate this issue.
  2. Alternative Solutions: It was discussed that users could manually implement solutions for periodic data saving, such as writing data to disk periodically within the model's step() function. While this approach is feasible, it requires custom implementation by each user.
  3. Parallel Processing: There was agreement that parallel processing in ABMs is complex and could introduce unintended artifacts, making it less effective in practice. Batch processing was recommended as an alternative.
  4. Pickling and Performance Optimization: The use of pickling to save the entire model state was highlighted as a valuable tool for large-scale simulations. Additionally, optimization techniques like avoiding unnecessary computations and profiling were suggested to improve performance.
  5. Documentation Enhancements: The need for better documentation was emphasized, particularly to help new users manage larger models and utilize performance-enhancing techniques. Contributions to the documentation were encouraged.
  6. Mesa-Frames Initiative: The development of mesa-frames was introduced as a significant step towards improving Mesa's scalability. This new subset of Mesa aims to offer faster performance through data-frame-based operations, and it has now been officially integrated under the Mesa project umbrella.

@EwoutH EwoutH closed this as completed Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants