Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

array-extras adds set methods #299

Closed
wants to merge 3 commits into from
Closed

Conversation

mattparker
Copy link

Adds methods intersect(), union() and diff() to Y.Array as part of the array-extras module. Addresses ticket #2532209.

Intersection of two or more arrays.

@method intersect
@param {Array} First array (or array-like thing - is passed through Y.Array() first.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@param tags require a type and a parameter name, not just type and description. Also, do you think you could format your doc comments to match the formatting of the other doc comments in array-extras.js? I worked hard to keep them consistent. :)

arrays. This function will receive two arguments: the first
will be an item from the first array, the second will be an item
from the second array. This might be useful for arrays of objects,
for example. Note that the return value may contain duplicate values - you
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems unintuitive to me that the returned array might include duplicate values. If a value exists in the intersection, there's no reason (other than implementation laziness) to re-add it to the intersection if it's encountered again.

Edit: Urk, the diff changed before I posted this comment, so GitHub added it to the wrong line. Sorry.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of my test cases is 'numbers near'. it's a definite edge case, although with dates might not be so.

E.g.
A.intersect([1,5,7], [2, 5, 9], function (a, b) {return Math.abs(a-b) < 2;});

should give [1,2,5]. but it goes wrong if you try and use unique using the same comparison function - in this case, 1 and 2 will be considered the same and 2 will be removed.

With numbers it looks a bit unlikely, but I could see a 'dates +- 2 days' type use case.

(Edit: sorry, I won't push again!)

@rgrove
Copy link
Contributor

rgrove commented Oct 8, 2012

I have some reservations about the complexity of the implementations of intersect() and diff(), but it's going to take me a bit to write up my thoughts and associated sample code. I'll try to get that done by the end of the day.

@mattparker
Copy link
Author

OK, thanks Ryan. It's already a little while past the end of my day...

@rgrove
Copy link
Contributor

rgrove commented Oct 8, 2012

I felt like intersect() should reuse more of the code that already exists in array-extras, so I took a pass at doing that. Here's my take on an implementation that tries to reduce the complexity by increasing reuse:

A.intersect = function (array) {
  var args = Array.prototype.slice.call(arguments, 1),
      i, len, testFn;

  if (typeof args[args.length - 1] === 'function') {
    testFn = args.pop();
  }

  return A.filter(A.unique(array, testFn), function (value) {
    return A.every(args, function (otherArray) {
      if (testFn) {
        for (i = 0, len = otherArray.length; i < len; i++) {
          if (testFn(value, otherArray[i])) {
            return true;
          }
        }
      } else {
        return A.indexOf(otherArray, value) > -1;
      }
    });
  });
};

(note that this ensures uniqueness of the items in the first array, which I think makes sense, but I think @mattparker disagrees).

I think something similar could be done for diff().

I'd also recommend that you mention in the API docs that these methods are going to be really slow for large arrays, or for large numbers of arrays, since the default implementation relies on indexOf(). In cases where people want to compute the intersection of arrays that contain only strings, or only numbers, a custom implementation using an object hash would be much faster.

@mattparker
Copy link
Author

There were two reasons why I ended up as it is, although they may not be good reasons. I'd initially thought of this as a standalone sub-module (array-set or something) and wasn't going to depend on array-extras. Obviously if they end up all together this is irrelevant (and may be in any case).

Secondly, this was where I realised how shocking the performance of native array methods like each() are, compared to simple for() loops. I did a bit of testing on on jsperf and then came across others and (for each(), anyway), it was pretty significant. So I decided to bloat a bit for the sake of speed. Again, that may be wrong.

I've done some simple jsperf tests again now: http://jsperf.com/a-intersect-variations The bloated, non-reuse is up to twice as fast on the few tests I've run so far. So I think there's a trade-off between code reuse and performance on the page (or k-weight vs on-the-page) - I'm happy to go either way. To be consistent I might want to look at use of indexOf too (http://jsperf.com/indexof-native)

It did also make me wonder whether we could/should add a switch in to Y.Array (Y.Array.useNative(false) to turn off the use of the native each() etc implementations where/while their performance remains poor).

On the uniqueness thing: I think we're talking about slightly different things. From your snippet Ryan it seems like you're unique-ing the first array passed in, before intersecting with the second. I was talking about not unique-ing the result after intersection.

I'll try a better use-case example of why I don't think we should unique() anything in this method, but leave it up to the user. I have two arrays of objects: the first has objects which represent flowers; the second has objects which represent cars. Both flowers and cars have a property 'colour'. I want to use intersect to get an array of things where I know there's a car with a matching coloured flower (1).

var matchingThings = Y.Array.intersect(flowers, cars, function (oFlower, oCar) {
    return oFlower.colour === oCar.colour;
});

If I use my comparison function in a call to unique on the first array before intersecting, I'll reduce all my flowers down to a set that has just one of each colour. If I use it afterwards, I'll merge all my cars and flowers. Neither are what I want.

I'm actually using these methods with Models and ModelLists, and my past experience has been: 1. expect objects (especially dates), and 2. don't assume too much.

In general, the comparison function we pass into this method is for comparing two items from different arrays, not for comparing two items from the same array. Granted that with strings and numbers and in lots of object use-cases these could be the same; but I don't think we should assume it.

We should note in the docs that calling unique first on your input arrays might reduce their size and increase performance; and that they may want to afterwards too.

(1) So I can book the publicity photos, because as we all know flowers sell cars.

@rgrove
Copy link
Contributor

rgrove commented Oct 9, 2012

So I decided to bloat a bit for the sake of speed. Again, that may be wrong.

As I mentioned above, this intersection algorithm is inherently slow regardless of how the iteration is performed. The only way to make it significantly faster would be to have knowledge of the contents and sortedness of the arrays, which are things we can't assume if we want it to be generic.

The specific speed differences in the jsperf test cases drop significantly if you remove the A.unique() call (for parity) and use a for loop instead of A.every(), and the implementation remains simpler.

I'm not overly concerned about k-weight here. I'm concerned about the readability and maintainability of the code.

I'm actually using these methods with Models and ModelLists, and my past experience has been: 1. expect objects (especially dates), and 2. don't assume too much.

There are much more efficient algorithms that could be used to determine the intersection of two ModelLists. I'm very wary of trying to stretch a generic intersect() implementation so that it can support a use case like that when using it for that use case would actually be a bad idea.

I think there's a compelling case for a generic intersect() method that relies on indexOf(), but I'm wary of the direction you've taken it with the comparison function. That functionality seems like it belongs in a different function somewhere else in the library, not as a low-level utility in array-extras.

@mattparker
Copy link
Author

Maybe I'm over-arguing this!

The specific speed differences in the jsperf test cases drop significantly if you remove the A.unique() call (for parity) and use a for loop instead of A.every(), and the implementation remains simpler.

Yes, fair enough, sorry.

There are much more efficient algorithms that could be used to determine the intersection of two ModelLists. I'm very wary of trying to stretch a generic intersect() implementation so that it can support a use case like that when using it for that use case would actually be a bad idea.

(My use case examples have got muddled. and I forgot what I actually did. I'm not doing that for real! Sorry).

I suppose my worry is that once an intersect() is there it gets used but then doesn't do enough for reasonable use-cases. I really think we ought to be able to accommodate arrays of Dates or other simple objects without making it too awful to read and maintain.

What do you think about:

  • using existing array-extras as you outline
  • retaining optional comparison function
  • not doing any unique() calls - leave that to the user if they need it before and/or after
  • limiting to 2 arrays at a time (ie. not supporting intersect(arr1, arr2, arr3, arr4))

Also, do you have a view on the methods that ought to be in this: intersect(), union() and diff() seem reasonable to me; the original ticket mentioned xor too, for example?

And separately, do you have a view on being able to optionally 'turn off' use of native methods in Y.Array?

Matt

@rgrove
Copy link
Contributor

rgrove commented Oct 9, 2012

I suppose my worry is that once an intersect() is there it gets used but then doesn't do enough for reasonable use-cases. I really think we ought to be able to accommodate arrays of Dates or other simple objects without making it too awful to read and maintain.

The methods in array-extras are intended to be low-level utilities that can be layered or used in custom implementations to do more complex things. Trying to do too much here would mean we'd have to do it poorly, which is why I think intersect() would be better off serving the generic case of doing strict equality comparisons via indexOf() and leaving more specific use cases up to other more efficient implementations elsewhere in the library or in userland code.

What do you think about:

using existing array-extras as you outline

+1, obviously. :)

retaining optional comparison function

I think I've come around to not liking the comparison function. It encourages the use of this low-level intersect() implementation for specific cases where a better algorithm would often be more efficient. I'm not violently opposed to it, but I think I'd prefer to omit it.

not doing any unique() calls - leave that to the user if they need it before and/or after

I still think that returning a result that contains duplicate values is unexpected behavior and less useful than returning a deduped result set, but I'm interested in hearing what others think about this.

limiting to 2 arrays at a time (ie. not supporting intersect(arr1, arr2, arr3, arr4))

Definitely opposed to this. Much of the value of intersect() comes from being able to compute the intersection of any number of arrays. What would be the benefit in limiting it?

Also, do you have a view on the methods that ought to be in this: intersect(), union() and diff() seem reasonable to me; the original ticket mentioned xor too, for example?

intersect(), union(), and diff() seem like plenty to me. I don't think xor() would be widely useful.

And separately, do you have a view on being able to optionally 'turn off' use of native methods in Y.Array?

This is already possible. You can set the global useNativeES5 config option to false to force YUI to use shim methods instead of ES5 methods.

These are my opinions as former maintainer of the array utils, but I'm no longer the gatekeeper. I'd love to hear opinions from the YUI team or anyone else.

@mattparker
Copy link
Author

OK.

limiting to 2 arrays at a time (ie. not supporting intersect(arr1, arr2, arr3, arr4))

Definitely opposed to this. Much of the value of intersect() comes from being able to compute the intersection of any number of arrays. What would be the benefit in limiting it?

Fine. Benefit was some slight simplification of the code, and I didn't think there was going to be any performance benefit in accepting lots. But I'd prefer to keep it with lots.

This is already possible. You can set the global useNativeES5 config option to false to force YUI to use shim methods instead of ES5 methods.

Thank you, didn't know that.

These are my opinions as former maintainer of the array utils, but I'm no longer the gatekeeper. I'd love to hear opinions from the YUI team or anyone else.

Yes. I'll pause now until we get some other input...

@rgrove
Copy link
Contributor

rgrove commented Dec 6, 2012

We discussed this in today's YUI Open Roundtable. The recording should show up eventually at http://www.youtube.com/watch?v=rKrzggbBpfY

General consensus among those on the hangout was that intersect(), union() and diff() will probably be useful as core utilities in array-extras, but that we should keep the implementations low-level and not support custom comparison functions, at least initially. If it turns out that people really want the comparison functions, we can add them later or implement them in a higher-level utility.

@mattparker
Copy link
Author

Hey,

Thanks, I watched the video - sorry I didn't realise this would be on the agenda.

I'm happy to work on these along the lines discussed. I still think that even the low-level version should support arrays of Dates and object literals with a custom comparator, but am quite happy to be out-voted.

The original idea for this was as a separate array-set gallery module. I might re-introduce this, with versions with comparator functions. I might also start thinking about more performant versions when something is known about the data. And an array-set module could include even more esoteric set operations like xor or whatever.

Matt

@rgrove
Copy link
Contributor

rgrove commented Dec 7, 2012

Sounds good Matt! I like the idea of a separate higher-level module with more functionality.

I wish we'd had you in the hangout. It'd be good if the agenda for these things was set further in advance to ensure that everyone can attend; I only heard about the agenda in IRC a few minutes before the hangout began.

@davglass
Copy link
Member

@mattparker @rgrove Would a new PR work for this or do you want to keep it here? I figured a new one on a new branch would be best since we don't get all the old history.

Trying to clean up older Pull Requests

@mattparker
Copy link
Author

Yes a new PR would work: I got a bit confused with which branch I should be on which is partly why nothing's happened. Should it be in dev-3.x?

@davglass
Copy link
Member

Since this is new functionality, issue it against dev-3.x, thanks!

Just make sure you reference this one in your new one so we have "history".

@davglass davglass closed this Jan 16, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants