Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maintenance windows (Fleet in your calendar) #17230

Closed
11 of 15 tasks
noahtalerman opened this issue Feb 28, 2024 · 40 comments
Closed
11 of 15 tasks

Maintenance windows (Fleet in your calendar) #17230

noahtalerman opened this issue Feb 28, 2024 · 40 comments
Assignees
Labels
~csa Issue was created by or deemed important by the Customer Solutions Architect. #g-endpoint-ops Endpoint ops product group P2 Prioritize as urgent :product Product Design department (shows up on 🦢 Drafting board) story A user story defining an entire feature
Milestone

Comments

@noahtalerman
Copy link
Member

noahtalerman commented Feb 28, 2024

Goal

User story 1
As an IT admin,
I want Fleet to create an event in my end users' calendars if they're failing policies
so that I don't need to nudge them at inconvenient times when they're failing policies.
User story 2
As a security engineer,
I want Fleet to create an event in my end users' calendars if they're failing policies
so that I don't have to allowlist the CEO and worry that they'll never update.

Context

Changes

Product

  • UI changes: Figma
  • REST API changes: Draft PR
  • Permissions changes: If the permissions for choosing which policies trigger calendar events is the same as choosing which policies fire tickets/create webhooks, then let's use the same line in the Manage access table. If the permissions are different, break out a new line.
  • Outdated documentation changes:
    • REST API docs: See draft PR
    • Update policy automations docs with a new "Calendar events" section. Keep this section as short as possible and link to the article.
    • Scan the policy automations docs to see if there's now outdated language (ex. reference to outdated UI elements).
  • Website redirects:
    • fleetdm.com/learn-more-about/creating-service-accounts
    • fleetdm.com/learn-more-about/calendar-events: Article on fleetdm.com
  • Changes to paid features or tiers: Calendar integration is available in Fleet Premium

Engineering

  • Technical discussion is summarized in this document.
  • Database schema migrations: TODO
  • Load testing: TODO

ℹ️  Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".

QA

In addition to what's below, see section 10 of the eng doc

Risk assessment

This feature adds:

  • DB migration and tables.

  • A job to go over all hosts and:

    • schedule events (as Calendar meeting slots).
    • Monitor slots of all host for changes
  • Risk level: High

  • Risk description:
    The main risk will be at the performance level.
    Other Risk will be at logical bugs level, or potentially interference with other jobs on base of performence and DB access.

Manual testing steps

  • Requires load testing: Yes. Need to validate:
    • No harm done to other jobs and DB access.
    • The feature works properly with many hosts.
      • Max 20,000 hosts with calendar event on the same time and webhook firing for all. (@noahtalerman to update this number if needed)

New things we will need to check/do:

  • Have a lot of agents report bad policies so we can schedule addressing slots for them. Agents will need to switch from bad to good.
  • create a google calendar environment with many users (thousands?)
  • A way to check that a meeting slots were actually set or when moved they are addressed. ( @xpkoala to add )

Configuring load test with real calendar

  • Enabling plus addressing (so user+1@example.com is treated like user@example.com) by setting the undocumented env variable FLEET_GOOGLE_CALENDAR_PLUS_ADDRESSING=1
  • Create a team policy that will always fail with osquery-perf hosts containing query: select 0
  • Enroll osquery-perf hosts to that team
  • Decide on real people who will donate their calendars for load testing
  • Update the emails on the hosts using a script calling PUT fleet/hosts/:id/device_mapping and using plus addressing to ensure emails are unique
  • Enable calendar integration globally by using JSON from 1Password (Fleet in your calendar service account)
  • Enable team calendar integration for the failing policy
  • Were all events created in a reasonable time?
  • Is cron job running every 5 minutes, or does it need much longer to finish?

Configuring load test with mock calendar

  • Create a team policy that will always fail with osquery-perf hosts containing query: select 0
  • Enroll osquery-perf hosts to that team
  • Update the emails on the hosts using a script calling PUT fleet/hosts/:id/device_mapping
  • Start up and configure mock calendar server. See /tools/calendar/README.md
  • After events are created, move them to the current time to test webhooks firing. Did all webhooks fire in a reasonable time?
    • Note: The calendar cron job only checks calendar events every 30 minutes. May need to update MySQL calendar_events updated_at time to force a sooner check.

Modifying calendar event

The user should be able to modify the calendar event. Some situations to test:

  • Move event to the past -> Fleet should create a new event
  • Make event all-day -> Fleet should create a new event
  • Make the event 0 minutes long -> Fleet should fire webhook within the first 5 minutes of event starting
  • Add a guest to the event and decline yourself -> Fleet still treats this event as valid
  • Change timezone of the event -> Fleet still treats this event as valid, and webhook should fire at the right time
  • Move the event to a different calendar -> Fleet should create a new event (Fleet only has access to user's primary calendar)

Cleanup

  • if global setting is removed, all calendar events from MySQL DB are removed
  • if team setting is disabled, all calendar events for that team are deleted
  • calendar_events that have not been updated in 48 hours are deleted (updated_at column)

Interesting corner cases

  • User (email) has 2 hosts on separate teams that are failing policies -- only 1 event for 1 host should be created, and 1 webhook fired.
  • Host email changes to another user -- the cleanup job should delete the existing event. A new one is created if one doesn't exist already.
  • Host transferred to another team -- the cleanup job should delete the existing event.

Testing notes

Confirmation

  1. Engineer (@____): Added comment to user story confirming successful completion of QA.
  2. QA (@____): Added comment to user story confirming successful completion of QA.
@noahtalerman noahtalerman added story A user story defining an entire feature :product Product Design department (shows up on 🦢 Drafting board) labels Feb 28, 2024
noahtalerman added a commit that referenced this issue Mar 1, 2024
- Add "Fleet gets in your calendar" (#17230)
- "Declaration (DDM) profiles" (#14550) before "App deployment" (#14921) 
  - Deploy apps => Deploy security agents
  - Pushes deploy security agents to Q2 (2024-04-22)

Note: Upcoming activity (unified queue) won't guarantee first-in-first-out in Q1
@nonpunctual
Copy link
Contributor

is there a point at which this will add .ics files sent to users as part of the feature? The vast majority of enterprises use Microsoft email services.

@noahtalerman
Copy link
Member Author

is there a point at which this will add .ics files sent to users as part of the feature?

@nonpunctual yes. It's likely Outlook will likely come after Google Calendar.

@noahtalerman
Copy link
Member Author

Hey @sharon-fdm I moved your Figma comments below.

Please ask questions and make comments in the GitHub issue here so they're all in one place and easy to find :)

Screenshot 2024-03-06 at 10 35 02 AM

Yes. The plan is to create one meeting for each end user. Even if they have more than one host.

Screenshot 2024-03-06 at 10 36 12 AM

Correct.

In Fleet, a host can have many emails (end users) associated with it.

Fleet will filter this list of emails by emails w/ the matching domain configured by the IT admin (see where the IT admin will configure the domain in Figma here)

cc @getvictor

@noahtalerman
Copy link
Member Author

cc @rachaelshaw ^^ (forgot to @ mention you)

@sharon-fdm
Copy link
Collaborator

Thanks @noahtalerman.

@noahtalerman
Copy link
Member Author

noahtalerman commented Mar 6, 2024

Screenshot 2024-03-06 at 10 41 38 AM

@getvictor you made me realize we could simplify the current plan:

Screenshot 2024-03-06 at 10 41 51 AM

Instead of only scheduling calendar events on Monday (after first enabling), if the end user doesn't have a calendar event, we always schedule one on the upcoming Friday.

Here's how that would look:

Screenshot 2024-03-06 at 10 48 29 AM

This makes the experience more consistent for the end user and IT admin. The expectation becomes, once the end user starts failing one or more policies (no matter what day it is), the calendar event is going to show up on Friday.

@rachaelshaw and Victor, what do y'all think?

@noahtalerman
Copy link
Member Author

Screenshot 2024-03-06 at 10 50 43 AM

@getvictor this makes sense to me.

If we removed the calendar event, after the event has started then I think I would be confused as an end user. Did the IT team do it's thing?

@nonpunctual
Copy link
Contributor

nonpunctual commented Mar 6, 2024

Sorry to barge in here... maybe I am misunderstanding the intent.

My opinion is that Friday is not a great day to use as a default. Lots of orgs:

  • don't do a lot on Fridays
  • don't have people come in on Fridays
  • have specific policies against doing important things on Fridays
    • this includes NOT doing stuff to people's computers
    • why? because if something goes wrong it means someone will have to work on the weekend
  • Microsoft Patch Tuesday is a thing for a reason...

Ideally, the feature should allow admins to pick their default / starting day.

Thanks.

@getvictor getvictor added the Epic DO NOT USE. Auto-created by ZenHub, cannot be disabled. label Mar 6, 2024
Sampfluger88 pushed a commit that referenced this issue Mar 7, 2024
- Add "Fleet gets in your calendar" (#17230)
- "Declaration (DDM) profiles" (#14550) before "App deployment" (#14921)
  - Deploy apps => Deploy security agents
  - Pushes deploy security agents to Q2 (2024-04-22)

Note: Upcoming activity (unified queue) won't guarantee
first-in-first-out in Q1
...
@nonpunctual
Copy link
Contributor

nonpunctual commented Mar 7, 2024

I did not go looking for this article... It's # 1 on Hacker News: https://deploybot.com/blog/no-deployments-on-fridays-a-good-practice-for-software-development-teams

@noahtalerman noahtalerman removed the Epic DO NOT USE. Auto-created by ZenHub, cannot be disabled. label Mar 7, 2024
@getvictor getvictor added :release Ready to write code. Scheduled in a release. See "Making changes" in handbook. #g-endpoint-ops Endpoint ops product group P2 Prioritize as urgent labels Mar 7, 2024
lucasmrod added a commit that referenced this issue Mar 27, 2024
#17230

Fix for the following scenarios:
- Team has only one policy with calendar enabled. Events are created on
user calendars. Then the user disables the calendar on such policy.
Expected behavior: Events on the user calendar should be cleaned up in
that scenario.
- Policy `platform` is edited (which removes `policy_membership`
entries) and we'd like to have the calendar event removed for the hosts
that do not apply anymore.

To cover these scenarios I changed `ds.GetTeamHostsPolicyMemberships` so
that it also returns hosts that have a calendar event AND have no
results on policies (returned as passing=1).
E.g. this could happen if there ARE calendar events for a team but with
a platform that doesn't match the host (so it has no results).
roperzh pushed a commit that referenced this issue Apr 1, 2024
- In Fleet 4.48, we'll ship declaration (DDM) profiles (#14550)
- OS updates w/ DDM (#17230) will ship in 4.49
- Update error message so users know OS updates w/ DDM are coming soon.
Figma is also updated
[here](https://www.figma.com/file/t3j8CGAHR1x1YGjuFLlMst/%2314550-Add-declaration-(DDM)-profiles-for-macOS?type=design&node-id=476%3A11294&mode=design&t=aMjkgv7PGEbePjmH-1).
- In the [Figma wireframes
here](https://www.figma.com/file/JDbJcLRGRs7c7gKDxAfios/%2317295-Use-new-Software-Update-(DDM)-for-macOS-Sonoma-(14)-and-higher?type=design&node-id=348%3A892&mode=design&t=kkpRKOYrvJxfFbM5-1)
for (#17295) add designs for new error message copy so we make the
change when we ship OS updates w/ DDM.
getvictor added a commit that referenced this issue Apr 1, 2024
…rallel. (#17987)

#17230 

This fix addresses the unreleased bug where calendar cleanup job can
take too long, causing the subsequent job to miss a calendar event. The
event deletions now occur in parallel.

Also, reducing max bandwidth for accessing Google calendar by 10% to
prevent potential rate-limiting corner cases.

# Checklist for submitter

- [ ] Changes file added for user-visible changes in `changes/` or
`orbit/changes/`.
See [Changes
files](https://fleetdm.com/docs/contributing/committing-changes#changes-files)
for more information.
- [ ] Added/updated tests
- [x] Manual QA for all new/changed functionality
@lukeheath lukeheath added :product Product Design department (shows up on 🦢 Drafting board) and removed :release Ready to write code. Scheduled in a release. See "Making changes" in handbook. labels Apr 4, 2024
@nonpunctual
Copy link
Contributor

From customer conversation:

Screenshot 2024-04-05 at 4 54 30 PM

@noahtalerman
Copy link
Member Author

noahtalerman commented Apr 11, 2024

@noahtalerman, reminder to update the docs w/ link to videos to set up Fleet in your calendar: https://www.loom.com/share/9fbdff2998be4877b95ec6702c6c062c?sid=6602e703-aa5a-450a-b092-b5d28eb6e311

rachaelshaw added a commit that referenced this issue Apr 12, 2024
REST API updates for #17230.

---------

Co-authored-by: Noah Talerman <47070608+noahtalerman@users.noreply.github.com>
@noahtalerman
Copy link
Member Author

TODO: Worth doing a scan of the policy automations docs to see if there's now outdated language (ex. reference to outdated UI elements).

TODO: What would happen if you enable calendar automations for a "No team" policy? Add something to GitOps that adds a global policy w/ automations enabled.

TODO: If the permissions for choosing which policies trigger calendar events is the same as choosing which policies fire tickets/create webhooks, then let's use the same line in the Manage access table. If the permissions are different, break out a new line.

@rachaelshaw when you get the chance, can you please take these on? Thanks!

@noahtalerman
Copy link
Member Author

What would happen if you enable calendar automations for a "No team" policy?

Hey @getvictor do you know what happens?

@noahtalerman
Copy link
Member Author

noahtalerman commented May 8, 2024

fleetdm.com/learn-more-about/google-workspace-service-accounts: Google help article

It looks like we ended up adding the following redirect instead: https://github.com/fleetdm/fleet/blob/main/website/config/routes.js#L489

fleetdm.com/learn-more-about/creating-service-accounts

I updated the issue description to reflect this.

cc @rachaelshaw

@getvictor
Copy link
Member

What would happen if you enable calendar automations for a "No team" policy?

Hey @getvictor do you know what happens?

Nothing would happen. We don't display calendar automation for global policies in the UI, and we don't process global policies in our calendar cron job.

We could either return an error when someone tries to set calendar for a global policy, or simply always set it disabled for a global policy (without returning an error).

@noahtalerman noahtalerman changed the title Fleet in your calendar Maintenance windows (Fleet in your calendar) May 9, 2024
@noahtalerman
Copy link
Member Author

FYI @noahtalerman ^^

@noahtalerman
Copy link
Member Author

Hey @marko-lisica when you get the chance (break during wireframes), can you please take on the TODOs in the "Permissions changes" and "Documentation changes" sections?

Thanks!

@noahtalerman
Copy link
Member Author

can you please take on the TODOs in the "Permissions changes" and "Documentation changes" sections?

Hey @marko-lisica I can take this!

@noahtalerman
Copy link
Member Author

Doc PR is here: #19232

No permissions doc changes needed. Policy automations are already a row in the permissions table:

Screenshot 2024-05-23 at 12 56 52 PM

noahtalerman added a commit that referenced this issue May 23, 2024
Doc updates for the "Maintenance windows (Fleet in your calendar)" story
(#17230)
@noahtalerman
Copy link
Member Author

Docs are merged!

@fleet-release
Copy link
Contributor

Calendar events bloom,
Policy fails find their room,
Fleet makes deadlines swoon.

@nonpunctual
Copy link
Contributor

related: #21351

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
~csa Issue was created by or deemed important by the Customer Solutions Architect. #g-endpoint-ops Endpoint ops product group P2 Prioritize as urgent :product Product Design department (shows up on 🦢 Drafting board) story A user story defining an entire feature
Development

No branches or pull requests

9 participants