Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash typeclass #1712

Merged
merged 87 commits into from
Oct 4, 2017
Merged

Hash typeclass #1712

merged 87 commits into from
Oct 4, 2017

Conversation

ctongfei
Copy link
Contributor

@ctongfei ctongfei commented May 30, 2017

The following modifications are made:

  • A Hash type class that extends Eq
    • Hash#on is not implemented to avoid the diamond inheritance problem. Use the method Hash.by in the companion object instead.
  • Common Hash instances that are compatible with Scala's default are written for nearly all types in cats.kernel.instances:
    • For simple types T, their Hash instances are directly coded using T#hashCode;
    • For generic types (e.g. List, Stream etc.), their Hash instances require Hash instances on the type of their elements, and the Hash instances on generic types is equivalent to the Scala's default hashCode if the Hash instances on element types are from the universal T#hashCode;
    • For tuples, their Hash instances follow the same rule mentioned above, but are automatically generated in the KernelBoiler.scala file.
    • For Set, it is not implemented yet because there's an issue (SetPartialOrder[A] uses the universal == instead of Eq #1710) regarding the Order/Eq instances on Sets. Implemented using universal equals/hashCode.
  • Added automatic test cases in cats.kernel.laws.
  • Added a Contravariant instance in cats-core for Hash.
  • Added a type alias in the package object of cats for Hash.
  • Adding Hash results in ambiguous implicit resolution for Eq (since both Hash and Order are Eqs). Resolved by introducing precedence of implicits in KernelBoiler.scala. This file now produces 3 files with different implicit precedences.
  • Added a hash method on any type that has a Hash instance.

@ctongfei ctongfei changed the title Hash typeclass (#1690) Hash typeclass May 30, 2017
@ctongfei
Copy link
Contributor Author

ctongfei commented Jun 4, 2017

I need some help here: why does the JS version builds but the JVM version failed on serializability?

@durban
Copy link
Contributor

durban commented Jun 6, 2017

I guess it's due to Java serialization not existing on JS. Or at least it isn't tested: https://github.com/typelevel/cats/blob/master/laws/src/main/scala/cats/laws/SerializableLaws.scala#L33.

@ctongfei
Copy link
Contributor Author

ctongfei commented Jun 6, 2017

@durban So basically it means that the Hash typeclass current does not serialize correctly on JVM. Do you have any idea how to fix this?

@@ -17,7 +17,9 @@ class BigDecimalGroup extends CommutativeGroup[BigDecimal] {
override def remove(x: BigDecimal, y: BigDecimal): BigDecimal = x - y
}

class BigDecimalOrder extends Order[BigDecimal] {
class BigDecimalEq extends Order[BigDecimal] with Hash[BigDecimal] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we not change the class name to preserve binary compatibility?

@@ -4,8 +4,8 @@ package instances
package object bigDecimal extends BigDecimalInstances // scalastyle:ignore package.object.name

trait BigDecimalInstances {
implicit val catsKernelStdOrderForBigDecimal: Order[BigDecimal] =
new BigDecimalOrder
implicit val catsKernelStdEqForBigDecimal: Order[BigDecimal] with Hash[BigDecimal] =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we not change this name for binary compatibility?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I will revert these.

@@ -4,8 +4,8 @@ package instances
package object bigInt extends BigIntInstances // scalastyle:ignore package.object.name

trait BigIntInstances {
implicit val catsKernelStdOrderForBigInt: Order[BigInt] =
new BigIntOrder
implicit val catsKernelStdEqForBigInt: Order[BigInt] with Hash[BigInt] =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same binary compat issue.

@@ -17,8 +17,9 @@ class BigIntGroup extends CommutativeGroup[BigInt] {
override def remove(x: BigInt, y: BigInt): BigInt = x - y
}

class BigIntOrder extends Order[BigInt] {
class BigIntEq extends Order[BigInt] with Hash[BigInt] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same.

@@ -6,14 +6,16 @@ import scala.collection.immutable.BitSet
package object bitSet extends BitSetInstances

trait BitSetInstances {
implicit val catsKernelStdPartialOrderForBitSet: PartialOrder[BitSet] =
new BitSetPartialOrder
implicit val catsKernelStdEqForBitSet: PartialOrder[BitSet] with Hash[BitSet] =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please don't change name

case Left(_) => false
case Right(yy) => B.eqv(xx, yy)
}
case Left(xx) => Left(A.hash(xx)).##
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we not allocate just to get the hashcode.

Something like Left(xx) => A.hash(xx) * 65521 Right(xx) => B.hash(xx) seems fine:
mixing the left with a prime about half the size of the int space (taken from the largest 16 bit here):
http://www.risc.jku.at/people/hemmecke/AldorCombinat/combinatse20.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to have the default Hash instance for Either[A, B] perform the exact same way as the universal hash (##) given the Hash instance for A and B are universal hashes.
So maybe I'll copy the function implementation for case classes in Scala?
Basically I want EitherHash[A, B](DefaultHashInstance[A], DefaultHashInstance[B])(Left(a)) == Left(a).##.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe I'll copy the function implementation for case classes in Scala?

👍

@durban
Copy link
Contributor

durban commented Jun 21, 2017

@ctongfei Well, since Hash itself is Serializable, we should find out which object is the one which causes the serialization failure ... -Dsun.io.serialization.extendedDebugInfo=true is a useful system property for debugging things like this (it prints the path of references from the original serialized object to the one which is not serializable).

@ctongfei ctongfei changed the title Hash typeclass Hash typeclass (WIP) Aug 1, 2017
/**
* Returns the hash code of the given object under this hashing scheme.
*/
def hash(x: A): Int
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider using Long and move away from the JVM legacy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fommil What are the justifications of using Long? JVM-default Int works good enough I think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Object#hashCode: Int in Java is basically used for distribution within Set or Map-like data structures and for that purpose Int will remain sufficient for some time, for one because the size of an array in Java can't exceed Int.MaxValue either and if you instantiate an Array<Int> of size Int.MaxValue that's around ~ 8 GB of RAM.

So I'm not seeing the size of an array evolving to a Long, for one because of backwards compatibility, but also it's probably counter productive to allow contiguous memory blocks bigger than that. For now at least.
I'm guessing that's the reason other languages use 32 bit integers (or sometimes less) for their hash codes as well. For example: https://hackage.haskell.org/package/hashable-1.2.6.1/docs/Data-Hashable.html

The questions are:

  1. besides doing distribution in data structures like HashMap / HashSet, are there any other use cases?
  2. doesn't Long have performance implications? it's an important question to answer, as HashMaps tend to be abused
  3. are there any languages using Long for their universal hashCode already and if so, I'd be interested in why?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

64 bit JVMs store both Int and Long as 64 bit numbers, so there is no advantage to use Int for space preserving reasons. However if you want to make use of existing hashCode algorithms that return Int then that would be a good reason to keep it that way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that for now we should stay with what Object#hashCode is doing: returning an Int. Agree with @alexandru .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree about using Int. If you have a Long you can easily get an Int from int, but not the other way around. Some nice algorithms using hashes want more than 32 bits. For instance hyperloglog with only 32 bits would not be great.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to using Long here. If you want to just give me an Int upcast to Long that's on you, but it would be better to enable better algorithms.

@kailuowang kailuowang added this to the 1.0.0-RC1 milestone Aug 9, 2017
Copy link
Member

@alexandru alexandru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the whole PR seems fine to me.

trait HashInstances {
implicit val catsFunctorContravariantForHash: Contravariant[Hash] =
new Contravariant[Hash] {
/** Derive an `Hash` for `B` given an `Hash[A]` and a function `B => A`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nitpick, I obsess over details, but comment style isn't compatible with either ScalaDoc or JavaDoc. This title needs a new-line.


def hash(implicit A: Hash[A]): HashProperties = new HashProperties(
name = "hash",
parent = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this have EqLaws as parent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OrderLaws's parent is also None. But EqLaws are listed below in the props. I copied that behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like we should have EqLaws as a parent.

Copy link
Contributor Author

@ctongfei ctongfei Sep 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find EqLaws in the kernel-laws package.

/**
* Returns the hash code of the given object under this hashing scheme.
*/
def hash(x: A): Int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Object#hashCode: Int in Java is basically used for distribution within Set or Map-like data structures and for that purpose Int will remain sufficient for some time, for one because the size of an array in Java can't exceed Int.MaxValue either and if you instantiate an Array<Int> of size Int.MaxValue that's around ~ 8 GB of RAM.

So I'm not seeing the size of an array evolving to a Long, for one because of backwards compatibility, but also it's probably counter productive to allow contiguous memory blocks bigger than that. For now at least.
I'm guessing that's the reason other languages use 32 bit integers (or sometimes less) for their hash codes as well. For example: https://hackage.haskell.org/package/hashable-1.2.6.1/docs/Data-Hashable.html

The questions are:

  1. besides doing distribution in data structures like HashMap / HashSet, are there any other use cases?
  2. doesn't Long have performance implications? it's an important question to answer, as HashMaps tend to be abused
  3. are there any languages using Long for their universal hashCode already and if so, I'd be interested in why?

case Left(_) => false
case Right(yy) => B.eqv(xx, yy)
}
case Left(xx) => Left(A.hash(xx)).##
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe I'll copy the function implementation for case classes in Scala?

👍

@durban
Copy link
Contributor

durban commented Aug 26, 2017

@ctongfei If you move EitherInstances1#EitherEq to the toplevel (i.e., make it so that it isn't a nested class), that fixes the serialization problem. (Since the problem is caused by the outer class reference not being serializable.)

@codecov-io
Copy link

codecov-io commented Aug 29, 2017

Codecov Report

Merging #1712 into master will increase coverage by 0.02%.
The diff coverage is 98.47%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #1712      +/-   ##
=========================================
+ Coverage   95.57%   95.6%   +0.02%     
=========================================
  Files         248     252       +4     
  Lines        4454    4570     +116     
  Branches      115     117       +2     
=========================================
+ Hits         4257    4369     +112     
- Misses        197     201       +4
Impacted Files Coverage Δ
core/src/main/scala/cats/syntax/hash.scala 0% <0%> (ø)
core/src/main/scala/cats/instances/hash.scala 100% <100%> (ø)
...aws/src/main/scala/cats/kernel/laws/HashLaws.scala 100% <100%> (ø)
...nel/src/main/scala/cats/kernel/instances/set.scala 100% <100%> (ø) ⬆️
...src/main/scala/cats/kernel/instances/boolean.scala 100% <100%> (ø) ⬆️
...el/src/main/scala/cats/kernel/instances/char.scala 100% <100%> (ø) ⬆️
kernel/src/main/scala/cats/kernel/Hash.scala 100% <100%> (ø)
.../src/main/scala/cats/kernel/instances/stream.scala 100% <100%> (ø) ⬆️
testkit/src/main/scala/cats/tests/Helpers.scala 100% <100%> (ø) ⬆️
.../src/main/scala/cats/kernel/instances/double.scala 75% <100%> (+1.66%) ⬆️
... and 27 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a915afb...601e18d. Read the comment docs.

…s.kernel.instances.CharOrder in trait cats.kernel.instances.CharInstances has a different result type in current version, where it is cats.kernel.Order rather than cats.kernel.instances.CharOrder
@ctongfei
Copy link
Contributor Author

It seems that the current version has some binary incompatibility issues with Scala 2.10/2.11 as is indicated by Travis CI:

filter with: ProblemFilters.exclude[ReversedMissingMethodProblem]
("cats.kernel.instances.MapInstances.catsKernelStdHashForMap")

etc. because these are new methods added to existing traits.
Should we just wait until cats 1.0.0?

@kailuowang
Copy link
Contributor

as of now we haven't decided if we want to break kernel bin compat in 1.0.0. @johnynek thoughts?

@ctongfei
Copy link
Contributor Author

ctongfei commented Sep 18, 2017

Apart from binary incompatibility issues with JVM Scala 2.10/2.11, I think that this PR is ready to be merged now.

There are several lines not covered by tests (I don't think they should):
cats/syntax/hash.scala Line 9: This line should be inlined by the macro.
cats/laws/discipline/Eq.scala Line 93: This line should never be run: it means that arbitrary values are not properly generated.

Thanks :-)

@johnynek
Copy link
Contributor

can you add the mima exclusions as was done in #1925 ?

@ctongfei
Copy link
Contributor Author

@johnynek I don't fully understand the rules of these binary checks? Could you elaborate? Thanks!

@kailuowang
Copy link
Contributor

@ctongfei in sbt console if you run

++2.11.11
project kernelJVM
mimaReportBinaryIssues

you will get a report of the violating classes as well as the filters needed to make them exceptions
Add those filters to here https://github.com/typelevel/cats/blob/master/build.sbt#L239 and you should be good to go.

@kailuowang
Copy link
Contributor

@ctongfei do you want to merge in master?

johnynek
johnynek previously approved these changes Oct 4, 2017
Copy link
Contributor

@johnynek johnynek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great.

When we merge with master, I think we are green.

Thanks for carrying this one so far! I look forward to someone building HashSet and HashMap based on this Hash typeclass (not to mention small bloom filters or hll).

@LukaJCB LukaJCB merged commit c131fe4 into typelevel:master Oct 4, 2017
@alexandru
Copy link
Member

👍

@ctongfei
Copy link
Contributor Author

ctongfei commented Oct 4, 2017

Thanks everyone :-)

@kailuowang
Copy link
Contributor

Thank you @ctongfei !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants