-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GAWB-2056] The status of the last run is showing green when there are errors in them #709
Conversation
…most recent submission, not the most recent workflow
@@ -731,14 +759,6 @@ trait WorkspaceComponent { | |||
WorkspaceGroups(toGroupMap(realmAclRecs), toGroupMap(accessGroupRecs)) | |||
} | |||
} | |||
|
|||
private def groupByWorkspaceId(runningSubmissions: Seq[(UUID, Int)]): Map[UUID, Int] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These methods seemed useful so I cats-ified and generalized them and moved to DriverComponent
.
* }}} | ||
*/ | ||
def groupPairsK[F[_], A, B](pairs: Seq[(A, F[B])])(implicit M: MonoidK[F]): Map[A, F[B]] = | ||
groupPairs(pairs)(M.algebra[B]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something tells me it'll be a while before this PR gets merged! (i have to wrap my head around this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sorry if this got a little fancy.. it's not really too bad. Check out:
https://github.com/typelevel/cats/blob/master/docs/src/main/tut/typeclasses/semigroupk.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bunch of feedback here.
- Check out
CollectionUtils
- seems like a good place for these to go. - I don't understand this.
- Link to the HTML version of the documentation: it contains hugely important comment lines that aren't rendered inside GitHub!
- sbt console gives me this when I try to define groupPairsK:
<pastie>:18: warning: higher-kinded type should be enabled
by making the implicit value scala.language.higherKinds visible.
This can be achieved by adding the import clause 'import scala.language.higherKinds'
or by setting the compiler option -language:higherKinds.
See the Scaladoc for value scala.language.higherKinds for a discussion
why the feature should be explicitly enabled.
groupPairs
returns a map. You're then dereferencing the key defined byM.algebra[B]
? I don't know what that does and I'd have thought that this would give you a result type ofF[B]
.- Your
groupPairsK
example is equivalent totoMap
, which doesn't help me understand. Perhaps
scala> groupPairsK(Seq(("a", Foo(1).some), ("b", Foo(2).some), ("b", Foo(3).some)))
res0: Map[String,Option[Foo]] = Map(b -> Some(Foo(2)), a -> Some(Foo(1)))
would be more helpful, assuming I've read the Cats documentation (and therefore know that in the absence of a quantified F
(= Option
) means it'll just do orElse
).
TLDR: I am 👍 on groupPairs
and groupTriples
, though they should go into CollectionUtils
. I am 👎 on the K-versions; they rely on non-obvious behaviour (e.g. orElse
for Option
) that make them hard to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback here. After reflecting a bit, maybe it would have been better to leave groupByWorkspaceIdThenStatus
and groupByWorkspaceId
as is -- I didn't really need to change them for this bug. I'm still learning how to scala in a large team, and readability is important.
Digging in to your points:
- +1
- see below
- ok
- Huh, I haven't seen that warning before. I had to enable the
-feature
compiler flag to see it. I guess it's complaining about theF[_]
type parameter. The scaladoc is kind of funny:
Why control it? Higher kinded types in Scala lead to a Turing-complete type system, where compiler termination is no longer guaranteed. They tend to be useful mostly for type-level computation and for highly generic design patterns. The level of abstraction implied by these design patterns is often a barrier to understanding for newcomers to a Scala codebase. Some syntactic aspects of higher-kinded types are hard to understand for the uninitiated and type inference is less effective for them than for normal types. Because we are not completely happy with them yet, it is possible that some aspects of higher-kinded types will change in future versions of Scala. So an explicit enabling also serves as a warning that code involving higher-kinded types might have to be slightly revised in the future.
4,5. So here's why I added the K version. In this case I wanted to group triples where the map value is an Option[java.sql.Timestamp]
. Therefore I need a Monoid instance for java.sql.Timestamp
, which doesn't exist. It wouldn't make much sense to define a Monoid for timestamps -- what would you do, add them together? However, in this case I'm guaranteed to have no key conflicts because of the SQL structure (group by
, etc). So I can just use a Monoid for Option which just takes one or the other, regardless of the value inside the "box". That's exactly what MonoidK[Option]
does.
Furthermore, I'm not sure how I would implement this in terms of groupTriples
without using MonoidK
. Perhaps this would be more clear?
private def groupByWorkspaceIdThenStatus(workflowDates: Seq[(UUID, String, Option[Timestamp])]): Map[UUID, Map[String, Option[Timestamp]]] = {
// bla bla comment about bringing a monoid into scope with Option.orElse behavior
implicit val optionUniversalMonoid: Monoid[Option[Timestamp]] = MonoidK[Option].algebra[Timestamp]
CollectionUtils.groupTriples(workflowDates)
}
Then at least the MonoidK
stuff is localized in a private method with some explanation specific to the use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and pushed a commit with ^ those changes, please take a look.
val workflowDatesByWorkspaceByStatus: Map[UUID, Map[String, Option[Timestamp]]] = groupByWorkspaceIdThenStatus(workflowDates) | ||
val runningSubmissionCountByWorkspace: Map[UUID, Int] = groupByWorkspaceId(runningSubmissions) | ||
val submissionDatesByWorkspaceByStatus = groupTriplesK(submissionDates) | ||
val runningSubmissionCountByWorkspace = groupPairs(runningSubmissions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the rename from groupByWorkspaceIdThenStatus
-> groupTriplesK
and groupByWorkspaceId
-> groupPairs
loses a lot of readability here. Can you keep the function definitions for readability, even if they just point to other ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
* }}} | ||
* */ | ||
def groupPairs[A, B: Monoid](pairs: List[(A, B)]): Map[A, B] = | ||
pairs.foldMap { case (a, b) => Map(a -> b) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
groupByTuplesFlatten
above could call this, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also these should probably prefer Seq
to List
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re Seq: sure
I guess this would work:
//A saner group by than Scala's.
def groupByTuples[A, B]( tupleSeq: Seq[(A,B)] ): Map[A, Seq[B]] = {
tupleSeq.toList.foldMap { case (a, b) => Map(a -> Seq(b)) }
}
def groupByTuplesFlatten[A, B]( tupleSeq: Seq[(A, Seq[B])] ): Map[A, Seq[B]] = {
groupPairs(tupleSeq)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would it prefer Seq
? Do you expect it to be taking arbitrary Seq
s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seq is less specific and we use it all over the place. We'd have to start jamming .toList
everywhere were we to use List.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe people shouldn't have overused Seq
and then that wouldn't be an issue
We have the same problem too and it's annoying. We occasionally run into issues where the thing really only works properly on List but things compile due to Seq and someone jammed something bad in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant just groupByTuplesFlatten
anyway, but oh well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rtitle Cuz the Cats folk understand the value in saying what you mean ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your input, Jeff! When we wholesale switch from Seq to List I'll let you know so you can say I told you so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change these to take Seq
(and just call toList
on them).
I won't touch groupByTuples/groupByTuplesFlatten because that would require introducing monoids and I don't want to break any code.
// structure (group by, etc). | ||
// | ||
// TL/DR: The following line brings into scope a Monoid[Option[Timestamp]] which combines values | ||
// using Option.orElse. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think part of my confusion here is that you're introducing Monoid, which is used to combine things, using a non-combining operation (orElse
), in a situation where you never need to combine two of them anyway because the UUID/String combo is guaranteed to be unique. This is a pretty head-bendy way to achieve the desired result, even if it does work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least the orElse
is explicit now. Before, in this code:
...mapValues { case Seq((_, _, timestamp)) => timestamp })
it's still expecting a unique UUID/String combo, and if that were not the case, there would be a runtime MatchError
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think this is less readable than doing it explicitly but I'm going on vacation tomorrow and thus can't afford to spend more time arguing about it :) Instead, if you replace your comment with the following one, I'll shut up and thumb this. In the long-term, we need a centralised place for these explanations, because throwing them in comments on their first use won't help new hires understand them if they happen upon the uncommented second or third use first. But we can worry about that (a little) later.
The function groupTriples, called below, transforms a Seq((T1, T2, T3)) to a Map(T1 -> Map(T2 -> T3)). It does this by calling foldMap, which in turn requires a monoid for T3. In our case, T3 is an Option[Timestamp], so we need to provide an implicit monoid for Option[Timestamp].
There isn't really a sane monoid implementation for Timestamp (what would you do, add them?). Thankfully it turns out that the UUID/String pairs in workflowDates are always unique, so it doesn't matter what the monoid does because it'll never be used to combine two Option[Timestamp]s. It just needs to be provided in order to make the compiler happy.
To do this, we use the universal monoid for Option, MonoidK[Option]. Note that the inner Option takes no type parameter: MonoidK doesn't care about the type inside Option, it just calls orElse on the Option for its "combine" operator. Finally, the call to algebra[Timestamp] turns a MonoidK[Option] into a Monoid[Option[Timestamp]] by leaving the monoid implementation alone (so it still calls orElse) and poking the Timestamp type into the Option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok :)
// join workflow on workflow.submissionId = submission.id | ||
// where submission.workspaceId in (:workspaceIds)) v | ||
// group by 1, 2 | ||
// having (status = 'Failure' or (status = 'Succeeded' and count(v.*) = 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@helgridly based on our conversation how does this revised query look to you?
select workspaceId, status, max(subEndDate)
from (
select submission.id, submission.workspaceId, workflow.status, max(workflow.statusLastChangedDate) as subEndDate
from submission
join workflow on workflow.submissionId = submission.id
where submission.workspaceId in (:workspaceIds)
group by 1, 2, 3) v
group by 1, 2
having (status = 'Failure' or (status = 'Succeeded' and count(v.id) = 1))
Explanation:
- inner query returns the most recent workflow status change date, per workflow status and submission
- outer query returns the most recent workflow status change date where:
- the status is Failure; or
- the status is Succeeded and that is the only status in the submission
I can code this up and try to break it with tests too, just thought I'd post the SQL first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's right, though SQL isn't my native language!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually after some testing I think it's not right. SQL is not my native language either (or maybe it's my slick mapping). :( Still working through it, expect another iteration...
@helgridly I think this is ready for another look, thanks |
}.groupBy { case (submissionId, workspaceId, _, _) => | ||
(submissionId, workspaceId) | ||
}.map { case ((submissionId, workspaceId), recs) => | ||
(submissionId, workspaceId, recs.map(_._3).max, recs.map(_._4).max) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The third element is supposed to be a count, right? recs.length
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I meant max
. sum
would work as well. This is here so the value will be >0 if the submission contains any failed workflows, and it will be 0 if it contains all successful workflows.
count
would not work since it would just count all the rows, regardless of the workflow status (in SQL count(0) == count(1) == count(*) since 0 and 1 are both non-null values).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I meant sum
-- sorry. This jumped out at me because the outerSubmissionDateQuery
line calls this value numFailures
. If it's just a 1/0 then maybe just call exists
here so the later line can be hasFailures
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, +1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure .exists
would work in this context -- I just changed it to .count
to better reflect the variable name/type.
Could you explain the explain plan diagram? :p Looking at it I see the words "full table scan" twice which makes me think maybe we're missing an index somewhere. |
Sure:
The below diagram is how it is executed in code:
|
// using Option.orElse. | ||
|
||
implicit val optionTimestampMonoid: Monoid[Option[Timestamp]] = MonoidK[Option].algebra[Timestamp] | ||
CollectionUtils.groupTriples(workflowDates.toList) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need the toList
here any more, now you've swapped them to use Seq
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks
👍 pending two final nitpicks |
https://broadinstitute.atlassian.net/browse/GAWB-2056 (see latest comments)
Problem:
In "get workspace" calls, Rawls was returning the timestamps of the most recent successful and failed workflows. This is used in the UI to display either green or red in the workspace summary and list pages (red if latest failed workflow timestamp > latest successful workflow timestamp; green otherwise).
However, these workflows might be in the same submission. In the UI, a submission is red if any of its workflows failed. This led to confusing behavior: a workspace might be green in the Summary tab, but have only failed submissions in the Monitor tab.
Solution:
Change semantics of
WorkspaceSubmissionStats
to return the timestamps of the most recent successful and failed submission, taking into account the logic that a submission is a failure if any of its workflows are a failure.Tested via unit tests and manually using a UI. Also confirmed the generated slick query matches the query in the comments.
Also, query explain plan:
dev
and
alpha
DBs in Google Cloud Console