Skip to content
This repository has been archived by the owner on Dec 2, 2020. It is now read-only.

Infochimps merge #129

Closed
wants to merge 93 commits into from
Closed

Infochimps merge #129

wants to merge 93 commits into from

Conversation

Marsup
Copy link

@Marsup Marsup commented Jul 25, 2013

As discussed in #123, here is a full merge from infochimps-labs/cube@master to the current master.

Considering the size of the merge, I hope to involve more people into the review so that we don't miss any critical part, even though the tests continue to pass (minus the one already failing).

I hope especially people from @infochimps-labs can give their feedback since they know their codebase much more than I do.

Marsup and others added 30 commits August 10, 2012 10:14
metalog:
* metalog.info('foo', {...}) for progress recording -- sent to log by default
* metalog.minor('foo', {...}) for verbose recording -- sent nowhere by default
* metalog.event('foo', {...}) cubifies the event and sends it to info
  - metalog.event('foo', {...}, 'silent') to cubify but not log
* retargetable:
  - metalog.loggers.minor = metalog.log to log minor events
  - metalog.loggers.info  = metalog.silent to quash logging

also separated db test helpers into own file and added fixture ability.
* authentication.authenticator(name) gives you the requested authenticator
  - server.js uses options['authenticator']
  - 'allow_all' is default in bin/collector-config.js etc.

* authenticator.check(request, auth_ok, auth_no)
  - calls auth_ok if authenticated, auth_no if rejected
  - staples an 'authorized' member to the request object:
    - eg we use 'request.authorized.admin' to govern board editing
    - in the mongo_cookie authenticator, it's the user record

* mongo_cookie authenticator compares bcrypted cookie to a stored hashed secret
  - you must set the cookie and store that db record in your host client; see
    'test/authenticator-test.js' for format. Rails+devise snippet available on request.
* pr2_authentication:
  Added authentication: allow_all, read_only, or mongo_cookie
* marsup/mongodb-native-parser:
  Upgrade mongodb driver and use native BSON parser

Conflicts:
	package.json
moved test.js to test_helper.js, renamed it in use.
Some mixture of these changes and the upgraded mongo npm package seem to have made the dancing sometimes-failures in test/metrics-test.js go away.
Untangled and commented the code, so if you are still seeing those bugs here maybe is a better stepping-off point.
…ting server.

* test_helper.with_server -- starts the server, run tests once server starts, stops server when tests are done
* merged test/test_db.js into test/test_helper.js, documentated it good.
…inct, all tiers cascade to tensec, horizons)
@Marsup
Copy link
Author

Marsup commented Sep 20, 2013

For the record I still experience partially unresponsive evaluator if I ask for data past horizon, I think the computation never gets done because start and end date are equal, so any further request on the same collection will stall.

Another thing I noticed is pretty poor performance on grouped queries, a mongodb distinct on several million un-indexed rows is supposed to take a while but still...

@RandomEtc
Copy link
Collaborator

I hear that. I'm running this with personal data but I haven't thrown
square data at it yet. How do you feel about merging into a 0.3 branch and
refining things there?

@Marsup
Copy link
Author

Marsup commented Sep 20, 2013

Have you given some thoughts on my new configuration proposal ?
Do you want to complete your check-list up there before moving to 0.3.x or will it be done along the way ?
I think it's not a perfect release and there are still known bugs, but at least we'd move on with tinier patches rather than this long running commit list, so why not, but maybe not publish to npm just yet.

@Marsup
Copy link
Author

Marsup commented Oct 24, 2013

@RandomEtc After running the service for a while, I was forced to remove a "feature" infochimps added. Cascading cache to the lowest tiers is a very bad idea. It sucks the MongoDB storage like hell, way too expensive for insignificant benefit considering a few seconds/minutes is not that long to re-compute, so I came back to the way things were in the current cube release. Might want to consider this before doing a release...

@hustonhoburg
Copy link

I understand your feedback and that's definitely a valid concern, but want to clarify a little. First, as far as speed, a response time of seconds to minutes per query, given 20 to 30 queries on a page, wasn't acceptable for our UI requirements. So, stored metrics did offer some speed improvements. Although it was an added bonus, our intent was not to cache calculations for speed, but to store data. Hopefully I can offer some insights on why we did it that way.

For our use, event data vastly outsized metric data, so we purposefully capped the events collection to make event records fall out. To preserve the data contained in those events, we saved the metrics at the lowest tier. With the lowest tier, we could build back up a higher tier metric, like 5 minute or 1 hour, using those low tier 10 second metrics. We stored our permanent data in metrics with ephemeral events, as opposed to the previous situation of permanent events with ephemeral metric caches.

So assuming that one has large event record data sizes with many events per 10 second tier, storing only the metrics should use much less data. In our use case, it meant we were able to roll up thousands of multiple kB events into a handful of small, sub kB sized metrics per query. We also had a separate "cleaner" cron job to remove metrics older than a day, or so, to keep the data size down. We wrote our version with the intent of optimizing for high throughput while keeping a small, unsharded mongo. Storing metrics actually ended up being significantly more storage efficient for us.

I can see how for other data shapes / use cases, storing all metrics may not make sense. We definitely cut a couple corners to fulfill our use case because not everyone wants to predefine queries, drop events, and handle dense data. We lost some of the flexibility offered by cube in order to meet our needs. It sounds like our version didn't fit your use case. I'm glad you were able to change it to better meet your needs.

I hope that cleared up our intentions. If not, I'm happy to clarify further.

@Marsup
Copy link
Author

Marsup commented Oct 24, 2013

Hello Houston !

First, reading it back, my previous comment seems more accusing that it was meant to be, so sorry for that.

Now I can see why you would do that, in my case I keep everything, no capped collection at all, and I also have many events for any given time, our situation should be similar, so you might understand my pain seeing metrics grow horribly fast :)

The difference might be I'm on a sharded mongo, with many evaluators to answer queries at the same time, and we mostly stream metrics (which is a difference of our fork) so full time ranges are not queried that often.

Your version definitely improved many things and I'm grateful for that, I'm just worried such a default setting would disappoint newcomers as it fills up several GB for only a few days/weeks of metrics. I think ideally this cascading aspect should be configurable, but that'll be the subject of another pull request ;)

Anyway it's nice to see you're still following things here !

@RandomEtc
Copy link
Collaborator

Thanks both for keeping the discussion going. I'm sorry I've been silent on most matters. I haven't had a chance to try this new branch on real data so I'm hesitant to express too strong of an opinion about it.

I'm open to landing it as a 0.3-pre branch here so there's a clearer target for new contributions/optimizations/docs. What do you think?

@Marsup
Copy link
Author

Marsup commented Oct 24, 2013

Well, this thread has lasted long enough I think :)

You have raised many concerns along the way so I would say do that branch and close this pull request, but let's not forget anything here, and maybe create a bunch of separate issues to track every doubt/task that needs to be dealt with before final release.

@RandomEtc
Copy link
Collaborator

Sounds like a plan. Thanks for your help triaging issues!

@simonlopez
Copy link

when is it planned to be merged?

@jeffhuys
Copy link

This merge is (in my opinion) very important, why hasn't it been merged yet?

@RandomEtc
Copy link
Collaborator

Hi @jeffhuys - this branch is stalled mainly because I ran out of time/bandwidth for the project. But also because we only have one person (@Marsup) who has run this code to date. I'd like to run it before I merge it but I haven't had time.

If you've followed our discussion above, and the related issue where I asked for community input into the merge, you can see I was very optimistic about bringing all the Infochimps changes into the Square cube repo, but the actual process of doing this was a lot more complex and time consuming than I imagined. We don't run Cube in an official/production capacity at Square any more, so it's more or less a volunteer side-project for me.

Major apologies to @Marsup for not using this integration work yet.

Next step remains setting up a new branch here for this work, and getting a few more people to try it out for their use-case. I'll try to get that done soon, including trying it out at Square. Until then, please comment on this thread if you've run @Marsup's version and let us know how it goes.

@consense
Copy link

Hi,
running https://github.com/Marsup/cube/tree/full-merge in a testing/dev environment for the last weeks and so far no problems. Amount of data is relatively low though - ~100,000 events.

Just as a sidenote the revert to plain js object for the config in that branch made my live a lot easier.
Thanks everyone for the effort in this.

@Marsup
Copy link
Author

Marsup commented Jan 23, 2014

I'm confident as well, I had this branch running in production since September (with a few modifications since then as the commit history will tell you), but beware it doesn't only contain infochimps modifications, so not everything is documented as it should be.

I'm also glad I came to reason for the config, imposing cfg in cube's core was not very clever, even though it's a very nice module and I still use it for my cube runners.

@aganov
Copy link

aganov commented Feb 27, 2014

I'm going to test this branch with 150K events/day @Marsup can you tell me what is the job of the "warmer" and is there any way to use only one config, instead of three, which are almost the same?

@Marsup
Copy link
Author

Marsup commented Feb 27, 2014

I'm not the one who conceived it so I'll try to describe it as best as I know it, I'm not using it either.
If you store your expressions in a specific collection, it will regularly take them and keep your metrics cache up-to-date even without a client asking for it.
The way to store expressions is inherited from the time cube had a kind of dashboard, there is no documentation about it AFAIK.

As for configuration, nope afraid not, you can still use cube as a module and do the slight variations programmatically.

@ticean
Copy link

ticean commented Apr 1, 2014

Hello. I've been using the current Cube version for some time. Great work and thanks for open sourcing this project.

Question 1: What's the level of confidence and timeline that the InfoChimps branch will be merged? I need to add some additional features in our project (authentication). Since this contains a pluggable authentication system, it makes sense for me to go ahead and use what's here if it will be mainlined. Looks like there's been a lot of energy put into this branch, but it's long-running and hard for me to judge what's going on from the outside looking in. :)

Question 2: How can I override the configuration when using Cube as a library now? Looks like this line will always include the configuration file from Cube. Maybe I'm missing some cfg functionality that handles this? I'm trying to override with env vars according to cfgs readme, but they don't seem to register. A simple example would really help.

Thanks.

@RandomEtc
Copy link
Collaborator

@ticean my intention was to make a merge branch and update our readme to encourage people to try it out. Unfortunately Cube has become less and less of my day-job here at Square and since starting this merge, despite the heroic efforts of @Marsup (thank you!) I haven't carved out the time to make much progress.

Also since we started this project infochimps was acquired, so I suspect they haven't been able to give it the attention they wanted either.

I still have this on a TODO list, and hope to get to it one day soon, though I realize we are very likely to be losing goodwill and attention by letting this branch linger.

Enough excuses...

For using cube as a library, here's an example collector script that we have in our internal cube repo:

#!/usr/bin/env node

var options = require("../config/collector"),
    cube = require("cube"),
    server = cube.server(options);

server.register = function(db, endpoints) {
  cube.collector.register(db, endpoints);
};

server.start();

The require for cube is the stock one from npm. The ./config/collector.js file looks something like:

module.exports = {
  "mongo-host": "127.0.0.1",
  "mongo-port": 27017,
  "mongo-database": "cube_development",
  "http-port": 1080
};

Hope that gets you started. If you have a chance to checkout @Marsup's branch please do, any feedback on that will help others work out which version to use. Until then, now we have 3 versions...

https://xkcd.com/927/

@ticean
Copy link

ticean commented Apr 2, 2014

Hi @RandomEtc. Thanks for the help and the quick reply! You helped me realize that I was testing with the wrong branch. This PR is based on Marsup:infochimps-merge but I'd mistakenly branched square:infochimps-merge for testing. So my bad there. 😁

Now using @Marsup's branch, I'm able to override the config like I need to (that wasn't possible in square:infochimps-merge)

I had some problems with the horizon feature not returning results when the request is "past_horizon". I see a metalogging output, but the server doesn't return a response and hangs. Don't think I'm interested in this feature anyway, so I was able to work around by removing the horizon configuration. This disables the feature. Some docs about horizons would be helpful, but you've already mentioned this a few times in the thread so I know you know that. :)

Things are otherwise working well now that I'm using this branch. I'll keep testing and let you know if anything else comes up.

@ticean
Copy link

ticean commented Apr 3, 2014

Ok, after more hands-on time with this code I found some issues.

  1. Collections aren't created automatically now. This comment informs me that I have to manually create them. Cube's flexibility to create collections as it gets events is a really good feature. Really sad to see this feature's getting dropped.
  2. More evaluator hangs. I've hit cases where metalogging logs an error and subsequent requests aren't handled. I think it's because callbacks aren't called on error? Like here. It would be better for the process to crash than hang.

@Marsup
Copy link
Author

Marsup commented Apr 4, 2014

  • I have had similar issues with horizons during my tests and haven't found a way to make it work, but I don't use this feature and so it's highly unlikely I'll spend time on it, though I encourage you to give it a try.
  • I never ever had to create collections manually, you shouldn't have to either, I don't understand the meaning of this comment since there are no schema files anywhere in the project.
  • You seem to have troubles with your mongodb, never encountered this error/hang, this doesn't mean the code is right but you should check your mongo.

Beware that full-merge is not the exact same thing as this pull request, I've piled up other modifications for my own needs.

@zuk
Copy link

zuk commented Jul 7, 2014

Hey guys, can someone explain the status of this merge for those of us wanting to start using cube with all of the infochimps work merged in? What would be the best place to start? Clone this branch and go from there? Use the Marsup fork? GitHub is kind of useless right now in situations where the repo network gets complicated like this one :(

@RandomEtc
Copy link
Collaborator

I am declaring Cube-maintenance-failure for myself. I have updated the README here to indicate that nobody at Square is actively developing or maintaining Cube. Since I have failed to make progress on this branch I encourage people to help @Marsup with his integration branch and fork if you have any new features or bug fixes. I will be closing all issues here in a moment.

@RandomEtc RandomEtc closed this Jul 30, 2014
@zuk
Copy link

zuk commented Jul 31, 2014

For the record, I'm up and running with the @Marsup branch. Working great so far.

@RandomEtc
Copy link
Collaborator

Excellent, that's great news @zuk. I'm open to updating the status and the Cube homepage with more info in future if you, @Marsup and others want to publish a new version. Thanks for letting us know!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.