Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add responders improvements #3128

Merged
merged 48 commits into from
Oct 27, 2023
Merged

Add responders improvements #3128

merged 48 commits into from
Oct 27, 2023

Conversation

joeyorlando
Copy link
Contributor

@joeyorlando joeyorlando commented Oct 5, 2023

What this PR does

https://www.loom.com/share/c5e10b5ec51343d0954c6f41cfd6a5fb

Summary of backend changes

  • Add AlertReceiveChannel.get_orgs_direct_paging_integrations method and AlertReceiveChannel.is_contactable property. These are needed to be able to (optionally) filter down teams, in the GET /teams internal API endpoint (here), to just teams that have a "contactable" Direct Paging integration
  • engine/apps/alerts/paging.py
    • update these functions to support new UX. In short direct_paging no longer takes a list of ScheduleNotifications or an EscalationChain object
    • add user_is_oncall helper function
    • add _construct_title helper function. In short if no title is provided, which is the case for Direct Pages originating from OnCall (either UI or Slack), then the format is f"{from_user.username} is paging <team.name (if team is specified> <comma separated list of user.usernames> to join escalation"
  • engine/apps/api/serializers/team.py - add number_of_users_currently_oncall attribute to response schema (code)
  • engine/apps/api/serializers/user.py - add is_currently_oncall attribute to response schema (code)
  • engine/apps/api/views/team.py - add support for two new optional query params only_include_notifiable_teams and include_no_team (code)
  • engine/apps/api/views/user.py
    • in the GET /users internal API endpoint, when specifying the search query param now also search on teams__name (code; this is a new UX requirement)
    • add support for a new optional query param, is_currently_oncall, to allow filtering users based on.. whether they are currently on call or not (code)
    • remove check_availability endpoint (no longer used with new UX; also removed references in frontend code)
  • engine/apps/slack/scenarios/paging.py and engine/apps/slack/scenarios/manage_responders.py - update Slack workflows to support new UX. Schedules are no longer a concept here. When creating a new alert group via /escalate the user either specifies a team and/or user(s) (they must specify at least one of the two and validation is done here to check this). When adding responders to an existing alert group it's simply a list of users that they can add, no more schedules.
  • add Organization.slack_is_configured and Organization.telegram_is_configured properties. These are needed to support this new functionality in the AlertReceiveChannel model.

Summary of frontend changes

  • Refactor/rename EscalationVariants component to AddResponders + remove grafana-plugin/src/containers/UserWarningModal (no longer needed with new UX)
  • Remove grafana-plugin/src/models/user.ts as it seemed to be a duplicate of grafana-plugin/src/models/user/user.types.ts

Related to https://github.com/grafana/incident/issues/4278

TODO

  • make frontend changes
  • update Slack backend functionality
  • update public documentation
  • add/update e2e tests

Post-deploy To-dos

  • update dev/ops/production Slack bots to update /escalate command description (should now say "Direct page a team or user(s)")

Checklist

  • Unit, integration, and e2e (if applicable) tests updated
  • Documentation added (or pr:no public docs PR label added if not required)
  • CHANGELOG.md updated (or pr:no changelog PR label added if not required)

@joeyorlando joeyorlando requested a review from a team October 5, 2023 17:20
for schedule in schedules:
# pass user list to list_users_to_notify_from_ical
schedule_oncall_users = list_users_to_notify_from_ical(schedule, events_datetime=events_datetime)
oncall_users.update({schedule.pk: schedule_oncall_users})
oncall_users.update({schedule: schedule_oncall_users})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's much easier here to just return the schedule object in case we need to reference other fields on it (like in the TeamSerializer where we need to reference schedule.team)

@@ -219,10 +221,24 @@ class UserView(
"^username",
"^slack_user_identity__cached_slack_login",
"^slack_user_identity__cached_name",
"^teams__name",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one of the new requirements for the user search functionality in the Direct Paging dialog box will be to search for users based on the teams they currently belong to

@joeyorlando joeyorlando marked this pull request as draft October 5, 2023 17:39
Comment on lines 137 to 175
def check_user_availability(user: User) -> typing.List[AvailabilityWarning]:
"""Check user availability to be paged.

Return a warnings list indicating `error` and any additional related `data`.
"""
warnings: typing.List[AvailabilityWarning] = []
if not user.notification_policies.exists():
warnings.append(
{
"error": PagingError.USER_HAS_NO_NOTIFICATION_POLICY,
"data": {},
}
)

is_on_call = False
schedules = OnCallSchedule.objects.filter(
Q(cached_ical_file_primary__contains=user.username) | Q(cached_ical_file_primary__contains=user.email),
organization=user.organization,
)
schedules_data: ScheduleWarnings = {}
for s in schedules:
# keep track of schedules and on call users to suggest if needed
oncall_users = list_users_to_notify_from_ical(s)
schedules_data[s.name] = set(u.public_primary_key for u in oncall_users)
if user in oncall_users:
is_on_call = True
break

if not is_on_call:
# user is not on-call
# TODO: check working hours
warnings.append(
{
"error": PagingError.USER_IS_NOT_ON_CALL,
"data": {"schedules": schedules_data},
}
)

return warnings
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we no longer need this (nor the related API endpoint) since there is now an attribute on the User model, is_currently_oncall, that achieves this same functionality

"username": user.username,
"avatar": user.avatar_url,
"avatar_full": user.avatar_full_url,
"important": important,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this important key here is really the only reason I modified this method to return typing.List[PagedUser] instead of the original typing.List[User] (obviously don't have the "important" context in the prior approach)

@@ -242,8 +242,6 @@ def get_limited_alerts(self, obj):
alerts = obj.alerts.order_by("-pk")[:100]
return AlertSerializer(alerts, many=True).data

@extend_schema_field(UserShortSerializer(many=True))
@extend_schema_field(PagedUserSerializer(many=True))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change basically just adds the "important" field on these user objects (necessary for the new mockups/UX)

Comment on lines -43 to -46
schedules = ScheduleReferenceSerializer(many=True, required=False, default=list)
context: SerializerContext

escalation_chain_id = serializers.CharField(required=False, default=None)
escalation_chain = serializers.HiddenField(default=None) # set in DirectPagingSerializer.validate
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • get rid of schedules + escalation_chain_id (and by virtue escalation_chain), these concepts are no longer relevant
  • introduce team

@@ -68,7 +68,7 @@ def get_warnings(self, obj):

def get_on_call_now(self, obj):
# Serializer context is set here: apps.api.views.schedule.ScheduleView.get_serializer_context
users = self.context["oncall_users"].get(obj.pk, [])
users = self.context["oncall_users"].get(obj, [])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oncall_users set on this context object now has User objects as keys rather than the string pks

Comment on lines +40 to +41
except DirectPagingUserTeamValidationError:
raise BadRequest(detail=DirectPagingUserTeamValidationError.DETAIL)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validates that the user specifies alteast a team or one user (can specify both of course, but both can't be null)

Comment on lines +277 to +282
if is_currently_oncall_query_param == "true":
# client explicitly wants to filter out users that are on-call
queryset = queryset.filter(pk__in=_get_oncall_user_ids())
elif is_currently_oncall_query_param == "false":
# user explicitly wants to filter out on-call users
queryset = queryset.exclude(pk__in=_get_oncall_user_ids())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure whether behaviour like this is better suited to live in the FilterSet class? I put it here because I was unsure how, within the UserFilter class, to get access to self.schedules_with_oncall_users (or at a minimum, self.request.user)

@@ -168,6 +168,7 @@ function renderFormControl(
}
}

// TODO: make this generic to accept form data type
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grafana/grafana-oncall-frontend is there an easy way to update a class component to be generic? (I'd prefer not to rewrite the component in this PR (as it makes use of React.Comonent.forceUpdate))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean for the data to provide it under <T> or so - or some other params as well? Something like class GForm<T> extends React.Component<GFormProps<T>, {}> { where T will be the type for data.

I'm afraid for that is going to take some time until you figure out all the types passed from the other places such as schedules, maintenance and so on... Atm it's not very clear what their shape looks like. Feel free to open a tech debt/refactoring issue, we can work on it in the near future.

Copy link
Contributor Author

@joeyorlando joeyorlando Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. I tried making the component generic using the approach you proposed and it mostly works but it'd require a bit of refactoring on some parts of the existing component. I created this issue to refactor this in a separate PR 👍

(I'll go ahead and remove my TODO comment here)

"avatar": user.avatar_url,
"avatar_full": user.avatar_full_url,
"important": important,
"teams": [{"pk": t.public_primary_key, "name": t.name} for t in user.teams.all()],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

teams is needed here as Grafana Incident uses this for some of it's UI/UX

@joeyorlando
Copy link
Contributor Author

going to update the /teams and /users internal API route Swagger docs in a subsequent PR (as this one is already too large 🙈 )

Copy link
Contributor

@matiasb matiasb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes make sense to me, added a few questions/comments.
Also, wondering what the plan would be for the existing Teams implementation of the paging feature (yet to be enabled).

engine/apps/alerts/models/alert_receive_channel.py Outdated Show resolved Hide resolved
Comment on lines 248 to 282
@classmethod
def get_orgs_direct_paging_integrations(
cls, organization: "Organization"
) -> models.QuerySet["AlertReceiveChannel"]:
return cls.objects.filter(
organization=organization,
integration=AlertReceiveChannel.INTEGRATION_DIRECT_PAGING,
)

@property
def is_contactable(self) -> bool:
"""
Returns true if:
- the integration has more than one channel filter associated with it
- the default channel filter has at least one notification method specified or an escalation chain associated with it
"""
if self.channel_filters.count() > 1:
return True

default_channel_filter = self.default_channel_filter
if not default_channel_filter:
return False

notify_via_slack = self.organization.slack_is_configured and default_channel_filter.notify_in_slack
notify_via_telegram = self.organization.telegram_is_configured and default_channel_filter.notify_in_telegram

notify_via_chatops = notify_via_slack or notify_via_telegram
custom_messaging_backend_configured = default_channel_filter.notification_backends is not None

return (
default_channel_filter.escalation_chain is not None
or notify_via_chatops
or custom_messaging_backend_configured
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this only makes sense for direct paging integrations? Should any other integrations is_contactable value be True if there is some schedule or user to be notified?

engine/apps/alerts/paging.py Outdated Show resolved Hide resolved
engine/apps/alerts/paging.py Outdated Show resolved Hide resolved
engine/apps/api/serializers/user.py Outdated Show resolved Hide resolved
@joeyorlando
Copy link
Contributor Author

So this only makes sense for direct paging integrations? Should any other integrations is_contactable value be True if there is some schedule or user to be notified?

@matiasb it's a new concept that, for now, is only applicable for direct paging integrations, correct. If you are concerned that this might cause confusion, I could move it instead as a utility method in apps.alerts.paging.

Copy link
Contributor

@matiasb matiasb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks 👍

@joeyorlando joeyorlando merged commit 697248d into dev Oct 27, 2023
20 of 21 checks passed
@joeyorlando joeyorlando deleted the jorlando/direct-paging branch October 27, 2023 16:12
@joeyorlando joeyorlando mentioned this pull request Oct 30, 2023
3 tasks
joeyorlando added a commit that referenced this pull request Oct 31, 2023
# Which issue(s) this PR fixes

Closes grafana/support-escalations#8143

Fix a few minor issues introduced in #3128:

- Fix slow `GET /users` internal API endpoint related to [this
change](https://github.com/grafana/oncall/blob/dev/engine/apps/api/views/user.py#L239)
- Fix slow `GET /teams` internal API endpoint. Introduced a `short`
query parameter that only invokes
`apps.schedules.ical_utils.get_oncall_users_for_multiple_schedules` when
`short=false`.
- Order results from `GET /teams` internal API endpoint by name
(ascending)
- Fix search issue when searching for teams in the add responders popup
window (this was strictly a frontend issue)
- CSS changes to add responders dropdown to fix lonnnggg results list:
  **Before**
<img width="377" alt="Screenshot 2023-10-31 at 10 06 20"
src="https://github.com/grafana/oncall/assets/9406895/246c7c3b-7bea-44a1-afec-a38144c2c2d1">
  **After**
<img width="444" alt="Screenshot 2023-10-31 at 10 48 12"
src="https://github.com/grafana/oncall/assets/9406895/b5602a22-c691-4dc7-bd3d-e4d6b76d04a0">



## Still todo

The `apps.schedules.ical_utils.get_oncall_users_for_multiple_schedules`
method is still very slow when an instance has a lot of users (ex.
`ops`). Ideally we should refactor this method to be more efficient
because we still need to call this method under some circumstances. Ex.
to populate this dropdown when Direct Paging a user (note that it didn't
finish loading here on `ops`):
<img width="1037" alt="Screenshot 2023-10-30 at 18 14 59"
src="https://github.com/grafana/oncall/assets/9406895/9d91a57c-5db8-4ff9-862a-cd3755f52690">



## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
brojd pushed a commit that referenced this pull request Sep 18, 2024
# What this PR does

https://www.loom.com/share/c5e10b5ec51343d0954c6f41cfd6a5fb

## Summary of backend changes
- Add `AlertReceiveChannel.get_orgs_direct_paging_integrations` method
and `AlertReceiveChannel.is_contactable` property. These are needed to
be able to (optionally) filter down teams, in the `GET /teams` internal
API endpoint
([here](https://github.com/grafana/oncall/pull/3128/files#diff-a4bd76e557f7e11dafb28a52c1034c075028c693b3c12d702d53c07fc6f24c05R55-R63)),
to just teams that have a "contactable" Direct Paging integration
- `engine/apps/alerts/paging.py`
- update these functions to support new UX. In short `direct_paging` no
longer takes a list of `ScheduleNotifications` or an `EscalationChain`
object
  - add `user_is_oncall` helper function
- add `_construct_title` helper function. In short if no `title` is
provided, which is the case for Direct Pages originating from OnCall
(either UI or Slack), then the format is `f"{from_user.username} is
paging <team.name (if team is specified> <comma separated list of
user.usernames> to join escalation"`
- `engine/apps/api/serializers/team.py` - add
`number_of_users_currently_oncall` attribute to response schema
([code](https://github.com/grafana/oncall/pull/3128/files#diff-26af48f796c9e987a76447586dd0f92349783d6ea6a0b6039a2f0f28bd58c2ebR45-R52))
- `engine/apps/api/serializers/user.py` - add `is_currently_oncall`
attribute to response schema
([code](https://github.com/grafana/oncall/pull/3128/files#diff-6744b5544ebb120437af98a996da5ad7d48ee1139a6112c7e3904010ab98f232R157-R162))
- `engine/apps/api/views/team.py` - add support for two new optional
query params `only_include_notifiable_teams` and `include_no_team`
([code](https://github.com/grafana/oncall/pull/3128/files#diff-a4bd76e557f7e11dafb28a52c1034c075028c693b3c12d702d53c07fc6f24c05R55-R70))
- `engine/apps/api/views/user.py`
- in the `GET /users` internal API endpoint, when specifying the
`search` query param now also search on `teams__name`
([code](https://github.com/grafana/oncall/pull/3128/files#diff-30309629484ad28e6fe09816e1bd226226d652ea977b6f3b6775976c729bf4b5R223);
this is a new UX requirement)
- add support for a new optional query param, `is_currently_oncall`, to
allow filtering users based on.. whether they are currently on call or
not
([code](https://github.com/grafana/oncall/pull/3128/files#diff-30309629484ad28e6fe09816e1bd226226d652ea977b6f3b6775976c729bf4b5R272-R282))
- remove `check_availability` endpoint (no longer used with new UX; also
removed references in frontend code)
- `engine/apps/slack/scenarios/paging.py` and
`engine/apps/slack/scenarios/manage_responders.py` - update Slack
workflows to support new UX. Schedules are no longer a concept here.
When creating a new alert group via `/escalate` the user either
specifies a team and/or user(s) (they must specify at least one of the
two and validation is done here to check this). When adding responders
to an existing alert group it's simply a list of users that they can
add, no more schedules.
- add `Organization.slack_is_configured` and
`Organization.telegram_is_configured` properties. These are needed to
support [this new functionality
](https://github.com/grafana/oncall/pull/3128/files#diff-9d96504027309f2bd1e95352bac1433b09b60eb4fafb611b52a6c15ed16cbc48R271-R272)
in the `AlertReceiveChannel` model.

## Summary of frontend changes
- Refactor/rename `EscalationVariants` component to `AddResponders` +
remove `grafana-plugin/src/containers/UserWarningModal` (no longer
needed with new UX)
- Remove `grafana-plugin/src/models/user.ts` as it seemed to be a
duplicate of `grafana-plugin/src/models/user/user.types.ts`

Related to grafana/incident#4278

- Closes #3115
- Closes #3116
- Closes #3117
- Closes #3118 
- Closes #3177 

## TODO
- [x] make frontend changes
- [x] update Slack backend functionality
- [x] update public documentation
- [x] add/update e2e tests

## Post-deploy To-dos
- [ ] update dev/ops/production Slack bots to update `/escalate` command
description (should now say "Direct page a team or user(s)")

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
brojd pushed a commit that referenced this pull request Sep 18, 2024
# Which issue(s) this PR fixes

Closes grafana/support-escalations#8143

Fix a few minor issues introduced in #3128:

- Fix slow `GET /users` internal API endpoint related to [this
change](https://github.com/grafana/oncall/blob/dev/engine/apps/api/views/user.py#L239)
- Fix slow `GET /teams` internal API endpoint. Introduced a `short`
query parameter that only invokes
`apps.schedules.ical_utils.get_oncall_users_for_multiple_schedules` when
`short=false`.
- Order results from `GET /teams` internal API endpoint by name
(ascending)
- Fix search issue when searching for teams in the add responders popup
window (this was strictly a frontend issue)
- CSS changes to add responders dropdown to fix lonnnggg results list:
  **Before**
<img width="377" alt="Screenshot 2023-10-31 at 10 06 20"
src="https://github.com/grafana/oncall/assets/9406895/246c7c3b-7bea-44a1-afec-a38144c2c2d1">
  **After**
<img width="444" alt="Screenshot 2023-10-31 at 10 48 12"
src="https://github.com/grafana/oncall/assets/9406895/b5602a22-c691-4dc7-bd3d-e4d6b76d04a0">



## Still todo

The `apps.schedules.ical_utils.get_oncall_users_for_multiple_schedules`
method is still very slow when an instance has a lot of users (ex.
`ops`). Ideally we should refactor this method to be more efficient
because we still need to call this method under some circumstances. Ex.
to populate this dropdown when Direct Paging a user (note that it didn't
finish loading here on `ops`):
<img width="1037" alt="Screenshot 2023-10-30 at 18 14 59"
src="https://github.com/grafana/oncall/assets/9406895/9d91a57c-5db8-4ff9-862a-cd3755f52690">



## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants