Fleet product groups employ scrum, an agile methodology, as a core practice in software development. This process is designed around sprints, which last three weeks to align with our release cadence.
Each sprint is marked by five essential ceremonies:
- Sprint kickoff: On the first day of the sprint, the team, along with stakeholders, select items from the backlog to work on. The team then commits to completing these items within the sprint.
- Daily standup: Every day, the team convenes for updates. During this session, each team member shares what they accomplished since the last standup, their plans until the next meeting, and any blockers they are experiencing. Standups should last no longer than fifteen minutes. If additional discussion is necessary, it takes place after the standup with only the required partipants.
- Weekly estimation sessions: The team estimates backlog items once a week (three times per sprint). These sessions help to schedule work completion and align the roadmap with business needs. They also provide estimated work units for upcoming sprints. The EM is responsible for the point values assigned to each item and ensures they are as realistic as possible.
- Sprint demo: On the last day of each sprint, all engineering teams and stakeholders come together to review completed work. Engineers are allotted 3-10 minutes to present their accomplishments, as well as any pending tasks. (These meetings are recorded and posted publicly to YouTube or other platforms, so participants should avoid mentioning customer names. For example, instead of "Fastly", you can say "a publicly-traded hosting company", or use the customer's codename.)
- Sprint retrospective: Also held on the last day of the sprint, this meeting encourages discussions among the team and stakeholders around three key areas: what went well, what could have been better, and what the team learned during the sprint.
Each product group has a dedicated sprint board:
New tickets are estimated, specified, and prioritized on the roadmap:
Our scrum boards are exclusively composed of four types of scrum items:
User stories: These are simple and concise descriptions of features or requirements from the user's perspective, marked with the
storylabel. They keep our focus on delivering value to our customers. Occasionally, due to ZenHub's ticket sub-task structure, the term 'epic' may be seen. However, we treat these as regular user stories.
Sub-tasks: These smaller, more manageable tasks contribute to the completion of a larger user story. Sub-tasks are labeled as
~sub-taskand enable us to break down complex tasks into more detailed and easier-to-estimate work units. Sub-tasks are always assigned to exactly one user story.
Timeboxes: Tasks that are specified to complete within a pre-defined amount of time are marked with the
timeboxlabel. Timeboxes are research or investigation tasks necessary to move a prioritized user story forward, sometimes called "spikes" in scrum methodology. We use the term "timebox" because it better communicates its purpose. Timeboxes are always assigned to exactly one user story.
Bugs: Representing errors or flaws that result in incorrect or unexpected outcomes, bugs are marked with the
buglabel. Like user stories and sub-tasks, bugs are documented, prioritized, and addressed during a sprint. Bugs may be estimated or left unestimated, as determined by the product group's engineering manager.
Our sprint boards do not accommodate any other type of ticket. By strictly adhering to these four types of scrum items, we maintain an organized and focused workflow that consistently adds value for our users.
- Sprint ceremonies
- User story discovery
- Eng together
- Group weeklies
- Eng leadership weekly
- Eng product bi-weekly
- Product development process review
- Stay in alignment across the whole organization.
- Build teams, not groups of people.
- Provide substantial time for engineers to work on "focused work."
- Support the Maker Schedule by keeping meetings to a minimum.
- Each individual must have a weekly or biweekly sync 1:1 meeting with their manager. This is key to making sure each individual has a voice within the organization.
- Favor async communication when possible. This is very important to make sure every stakeholder on a project can have a clear understanding of what’s happening or what was decided, without needing to attend every meeting (i.e., if a person is sick or on vacation or just life happened.)
- If an async conversation is not proving to be effective, never hesitate to hop on or schedule a call. Always document the decisions made in a ticket, document, or whatever makes sense for the conversation.
This meeting is to disseminate engineering-wide announcements, promote cohesion across groups within the engineering team, and connect with engineers (and the "engineering-curious") in other departments. Held monthly for one hour.
Everyone at the company is welcome to attend. All engineers are asked to attend. The subject matter is focused on engineering.
- Engineering KPIs review
- “Tech talks”
- At least one engineer from each product group demos or discusses a technical aspect of their recent work.
- Everyone is welcome to present on a technical topic. Add your name and tech talk subject in the agenda doc included in the Eng Together calendar event.
- Structured and/or unstructured social activities
User story discovery meetings are scheduled as needed to align on large or complicated user stories. Before a discovery meeting is scheduled, the user story must be prioritized for product drafting and go through the design and specification process. When the user story is ready to be estimated, a user story discovery meeting may be scheduled to provide more dedicated, synchronous time for the team to discuss the user story than is available during weekly estimation sessions.
All participants are expected to review the user story and associated designs and specifications before the discovery meeting.
- Product Manager
- Product Designer
- Engineering Manager
- Backend Software Engineer
- Frontend Software Engineer
- Product Quality Specialist
- Product Manager: Why this story has been prioritized
- Product Designer: Walk through user journey wireframes
- Engineering Manager: Review specifications and any defined sub-tasks
- Software Engineers: Clarifying questions and implementation details
- Product Quality Specialist: Testing plan
A chance for deeper, synchronous discussion on topics relevant across product groups like “Frontend weekly”, “Backend weekly”, etc.
Anyone who wishes to participate.
- Discuss common patterns and conventions in the codebase
- Review difficult frontend bugs
- Write engineering-initiated stories
Engineering leaders discuss topics of importance that week. Prepare agenda, announcements, and tech talks before the monthly Eng Together meeting.
- Engineering Managers
- Director of Product Development
- Review Engineering KPIs.
- Review each product group's ZenHub board.
- Proceed to agenda.
- Engineer hiring
- Process discussion
- New documentation needs
Engineering and product bi-weekly sync to discuss process, roadmap, and scheduling.
- Head of Product
- Product Managers (optional)
- Director of Product Development
- Engineering Managers (optional)
- Product to engineering handoff process
- Q4 product roadmap
- Optimizing development processes
A once-per-sprint review of the bugs, drafting, and sprint boards to make sure that the current state of the boards reflects the process as defined in the handbook, or if any changes are needed to the documented process.
- Head of Product
- Product Operations
- Director of Product Development
- Review bugs board
- Review drafting board
- Review sprint boards
- How is the process working? Are any changes needed?
Engineering-initiated stories are types of user stories created by engineers to make technical changes to Fleet. Technical changes should improve the user experience or contributor experience. For example, optimizing SQL that improves the response time of an API endpoint improves user experience by reducing latency. A script that generates common boilerplate, or automated tests to cover important business logic, improves the quality of life for contributors, making them happier and more productive, resulting in faster delivery of features to our customers.
It is important to frame engineering-initiated user stories the same way we frame all user stories. Stay focused on how this technical change will drive value for our users.
Engineering-initiated stories follow the user story drafting process. Once your user story is created using the new story template, add the
~engineering-initiated label, assign it to yourself, and work with an EM or PM to progress the story through the drafting process.
We prefer the term engineering-initiated stories over technical debt because the user story format helps keep us focused on our users.
- Create a new feature request issue in GitHub.
- Ensure it is labeled with
~engineering-initiatedand the relevant product group. Remove any
- Assign it to yourself. You will own this user story until it is either prioritized or closed.
- Schedule a time with an EM and/or PM to present your story. Iterate based on feedback.
- You, your EM or PM can bring this to Feature Fest for consideration. All engineering-initiated changes go through the same drafting process as any other story.
We aspire to dedicate 20% of each sprint to technical changes, but may allocate less based on customer needs and business priorities.
Fleet's documentation for contributors can be found in the Fleet GitHub repo.
This section outlines the release process at Fleet.
The current release cadence is once every three weeks and is concentrated around Wednesdays.
To ensure release quality, Fleet has a freeze period for testing beginning the Tuesday before the release at 9:00 AM Pacific. Effective at the start of the freeze period, new feature work will not be merged into
Bugs are exempt from the release freeze period.
To begin the freeze, open the repo on Merge Freeze and click the "Freeze now" button. This will freeze the
main branch and require any PRs to be manually unfrozen before merging. PRs can be manually unfrozen in Merge Freeze using the PR number.
Any Fleetie can unfreeze PRs on Merge Freeze if the PR contains documentation changes or bug fixes only. If the PR contains other changes, please confirm with your manager before unfreezing.
Before kicking off release QA, confirm that we are using the latest versions of dependencies we want to keep up-to-date with each release. Currently, those dependencies are:
- Go: Latest minor release
- Check the version included in Fleet.
- Check the latest minor version of Go. For example, if we are using
go1.19.8, and there is a new minor version
go1.19.9, we will upgrade.
- If the latest minor version is greater than the version included in Fleet, file a bug and assign it to the release ritual DRI and the current oncall engineer. Add the
~release blockerlabel. We must upgrade to the latest minor version before publishing the next release.
- If the latest major version is greater than the version included in Fleet, create a story and assign it to the release ritual DRI and the current oncall engineer. This will be considered for an upcoming sprint. The release can proceed without upgrading the major version.
In Go versioning, the number after the first dot is the "major" version, while the number after the second dot is the "minor" version. For example, in Go 1.19.9, "19" is the major version and "9" is the minor version. Major version upgrades are assessed separately by engineering.
- macadmins-extension: Latest release
- Check the latest version of the macadmins-extension.
- Check the version included in Fleet.
- If the latest stable version of the macadmins-extension is greater than the version included in Fleet, file a bug and assign it to the release ritual DRI and the current oncall engineer.
- Add the
Note: Some new versions of the macadmins-extension include updates that require code changes in Fleet. Make sure to note in the bug that the update should be checked for any changes, like new tables, that require code changes in Fleet.
Our goal is to keep these dependencies up-to-date with each release of Fleet. If a release is going out with an old dependency version, it should be treated as a critical bug to make sure it is updated before the release is published.
We merge bug fixes and documentation changes during the freeze period, but we do not merge other code changes. This minimizes code churn and helps ensure a stable release. To merge a bug fix, you must first unfreeze the PR in Merge Freeze, and click the "Unfreeze 1 pull request" text link.
It is sometimes necessary to delay the release to allow time to complete partially merged feature work. In these cases, an exception process must be followed before merging during the freeze period.
- The engineer requesting the feature work merge exception during freeze notifies their Engineering Manager.
- The Engineering Manager notifies the QA lead for the product group and the release ritual DRI.
- The Engineering Manager, QA lead, and release ritual DRI must all approve the feature work PR before it is unfrozen and merged.
After each product group finishes their QA process during the freeze period, the EM @ mentions the release ritual DRI in the #help-qa Slack channel. When all EMs have certified that they are ready for release, the release ritual DRI begins the release process.
Documentation on completing the release process can be found here.
After each Fleet release, the new release is deployed to Fleet's dogfood (internal) instance.
How to deploy a new release to dogfood:
- Head to the Tags page on the fleetdm/fleet Docker Hub: https://hub.docker.com/r/fleetdm/fleet/tags
- In the Filter tags search bar, type in the latest release (ex. v4.19.0).
- Locate the tag for the new release and copy the image name. An example image name is "fleetdm/fleet:v4.19.0".
- Head to the "Deploy Dogfood Environment" action on GitHub: https://github.com/fleetdm/fleet/actions/workflows/dogfood-deploy.yml
- Select Run workflow and paste the image name in the The image tag wished to be deployed. field.
Note that this action will not handle down migrations. Always deploy a newer version than is currently deployed.
Note that "fleetdm/fleet:main" is not a image name, instead use the commit hash in place of "main".
Immediately after publishing a new release, we close out the associated GitHub issues and milestones.
- Rename current milestone: In GitHub, change the current milestone name from
Remove milestone from unfinished items: If you see any items in columns other than "Ready for release" tagged with the current milestone, remove that milestone tag. These items didn't make it into the release.
Prep release items: Make sure all items in the "Ready for release" column have the current milestone and sprint tags. If not, select all items in the column and apply the appropriate tags.
Move user stories to drafting board: Select all items in "Ready for release" that have the
storylabel. Apply the
:productlabel and remove the
:releaselabel. These items will move back to the product drafting board.
Confirm and close: Make sure that all items with the
storylabel have left the "Ready for release" column. Select all remaining items in the "Ready for release" column and move them to the "Closed" column. This will close the related GitHub issues.
Confirm and celebrate: Now, head to the Drafting board. Find all
storyissues with the current milestone (these are the ones you just moved). Move them to the "Confirm and celebrate" column. Product will close the issues during their confirm and celebrate ritual.
Close GitHub milestone: Visit GitHub's milestone page and close the current milestone.
Create next milestone: Create a new milestone for the next versioned release,
Remove the freeze: Open the repo in Merge Freeze and click the "Unfreeze" button.
mainis unfrozen and the milestone has been closed in #help-engineering.
See the internal Google Doc for the engineers in the rotation.
Fleet team members can also subscribe to the shared calendar for calendar events.
New engineers are added to the oncall rotation by their manager after they have completed onboarding and at least one full release cycle. We aim to alternate the rotation between product groups when possible.
The oncall rotation may be adjusted with approval from the EMs of any product groups affected. Any changes should be made before the start of the sprint so that capacity can be planned accordingly.
The oncall engineer is a second-line responder to questions raised by customers and community members.
The community contact (Kathy) is responsible for the first response to GitHub issues, pull requests, and Slack messages in the #fleet channel of osquery Slack, and other public Slacks. Kathy and Zay are responsible for the first response to messages in private customer Slack channels.
We respond within 1-hour (during business hours) for interactions and ask the oncall engineer to address any questions sent their way promptly. When Kathy is unavailable, the oncall engineer may sometimes be asked to take over the first response duties. Note that we do not need to have answers within 1 hour -- we need to at least acknowledge and collect any additional necessary information, while researching/escalating to find answers internally. See Escalations for more on this.
Response SLAs help us measure and guarantee the responsiveness that a customer can expect from Fleet. But SLAs aside, when a Fleet customer has an emergency or other time-sensitive situation ongoing, it is Fleet's priority to help them find them a solution quickly.
PRs from Fleeties are reviewed by auto-assignment of codeowners, or by selecting the group or reviewer manually.
PRs should remain in draft until they are ready to be reviewed for final approval, this means the feature is complete with tests already added. This helps keep our active list of PRs relevant and focused. It is ok and encouraged to request feedback while a PR is in draft to engage the team.
All PRs from the community are routed through the oncall engineer. For documentation changes, the community contact (Kathy) is assigned by the oncall engineer. For code changes, if the oncall engineer has the knowledge and confidence to review, they should do so. Otherwise, they should request a review from an engineer with the appropriate domain knowledge. It is the oncall engineer's responsibility to monitor community PRs and make sure that they are moved forward (either by review with feedback or merge).
The oncall engineer is encouraged to attend some of the customer success meetings during the week. Post a message to the #g-cx Slack channel requesting invitations to upcoming meetings.
This has a dual purpose of providing more context for how our customers use Fleet. The engineer should actively participate and provide input where appropriate (if not sure, please ask your manager or organizer of the call).
The oncall engineer is asked to read, understand, test, correct, and improve at least one doc page per week. Our goal is to 1, ensure accuracy and verify that our deployment guides and tutorials are up to date and work as expected. And 2, improve the readability, consistency, and simplicity of our documentation – with empathy towards first-time users. See Writing documentation for writing guidelines, and don't hesitate to reach out to #g-digital-experience on Slack for writing support. A backlog of documentation improvement needs is kept here.
Engineering managers are asked to be aware of the oncall rotation and schedule a light workload for engineers while they are oncall. While it varies week to week considerably, the oncall responsibilities can sometimes take up a substantial portion of the engineer's time.
The remaining time after fulfilling the responsibilities of oncall is free for the engineer to choose their own path. Please choose something relevant to your work or Fleet's goals to focus on. If unsure, feel free to speak with your manager.
- Do training/learning relevant to your work.
- Improve the Fleet developer experience.
- Hack on a product idea. Note: Experiments are encouraged, but not all experiments will ship! Check in with the product team before shipping user-visible changes.
- Create a blog post (or other content) for fleetdm.com.
- Try out an experimental refactor.
At the end of your oncall shift, you will be asked to share about how you spent your time.
Oncall engineers do not need to actively monitor Slack channels, except when called in by the Community or Customer teams. Members of those teams are instructed to
#help-engineering to get the attention of the oncall engineer to continue discussing any issues that come up. In some cases, the Community or Customer representative will continue to communicate with the requestor. In others, the oncall engineer will communicate directly (team members should use their judgment and discuss on a case-by-case basis how to best communicate with community members and customers).
When the oncall engineer is unsure of the answer, they should follow this process for escalation.
To achieve quick "first-response" times, you are encouraged to say something like "I don't know the answer and I'm taking it back to the team," or "I think X, but I'm confirming that with the team (or by looking in the code)."
How to escalate:
Spend 30 minutes digging into the relevant code (osquery, Fleet) and/or documentation (osquery, Fleet). Even if you don't know the codebase (or even the programming language), you can sometimes find good answers this way. At the least, you'll become more familiar with each project. Try searching the code for relevant keywords, or filenames.
Create a new thread in the #help-engineering channel, tagging
@zwassand provide the information turned up in your research. Please include possibly relevant links (even if you didn't find what you were looking for there). Zach will work with you to craft an appropriate answer or find another team member who can help.
The oncall engineer changes each week on Wednesday.
A Slack reminder should notify the oncall of the handoff. Please do the following:
The new oncall engineer should change the
@oncallalias in Slack to point to them. In the search box, type "people" and select "People & user groups." Switch to the "User groups" tab. Click
@oncall. In the right sidebar, click "Edit Members." Remove the former oncall, and add yourself.
Hand off newer conversations (Slack threads, issues, PRs, etc.). For more recent threads, the former oncall can unsubscribe from the thread, and the new oncall should subscribe. The former oncall should explicitly share each of these threads and the new oncall can select "Get notified about new replies" in the "..." menu. The former oncall can select "Turn off notifications for replies" in that same menu. It can be helpful for the former oncall to remain available for any conversations they were deeply involved in, so use your judgment on which threads to hand off. Anything not clearly handed off remains the responsibility of the former oncall engineer.
In the Slack reminder thread, the oncall engineer includes their retrospective. Please answer the following:
What were the most common support requests over the week? This can potentially give the new oncall an idea of which documentation to focus their efforts on.
Which documentation page did you focus on? What changes were necessary?
How did you spend the rest of your oncall week? This is a chance to demo or share what you learned.
At Fleet, we take customer incidents very seriously. After working with customers to resolve issues, we will conduct an internal postmortem to determine any documentation or coding changes to prevent similar incidents from happening in the future. Why? We strive to make Fleet the best osquery management platform globally, and we sincerely believe that starts with sharing lessons learned with the community to become stronger together.
At Fleet, we do postmortem meetings for every production incident, whether it's a customer's environment or on fleetdm.com.
Before running the postmortem meeting, copy this Postmortem Template document and populate it with some initial data to enable a productive conversation.
Invite all stakeholders, typically the team involved and QA representatives.
Follow the document topic by topic. Keep the goal in mind which is to take action items for addressing the root cause and making sure a similar incident will not happen again.
Distinguish between the root cause of the bug, which by that time was solved and released, and the root cause of why this issue reached our customers. These could be different issues. (e.g. the root cause of the bug was a coding issue, but the root causes (plural) of the event may be that the test plan did not cover a specific scenario, a lack of testing, and a lack of metrics to identify the issue quickly).
Each action item will have an owner that will be responsible for creating a Github issue promptly after the meeting. This Github issue should be prioritized with the relevant PM/EM.
At Fleet, we consider an outage to be a situation where new features or previously stable features are broken or unusable.
- Occurences of outages are tracked in the Outages spreadsheet.
- Fleet encourages embracing the inevitability of mistakes and discourages blame games.
- Fleet stresses the critical importance of avoiding outages because they make customers' lives worse instead of better.
Fleet, as a Go server, scales horizontally very well. It’s not very CPU or memory intensive. However, there are some specific gotchas to be aware of when implementing new features. Visit our scaling Fleet page for tips on scaling Fleet as efficiently and effectively as possible.
The load testing page outlines the process we use to load test Fleet, and contains the results of our semi-annual load test.
To provide the most accurate and efficient support, Fleet will only target fixes based on the latest released version. In the current version fixes, Fleet will not backport to older releases.
Community version supported for bug fixes: Latest version only
Community support for support/troubleshooting: Current major version
Premium version supported for bug fixes: Latest version only
Premium support for support/troubleshooting: All versions
If you're assigned a community pull request for review, it is important to keep things moving for the contributor. The goal is to not go more than one business day without following up with the contributor.
A PR should be merged if:
- It's a change that is needed and useful.
- The CI is passing.
- Tests are in place.
- Documentation is updated.
- Changes file is created.
For PRs that aren't ready to merge:
- Thank the contributor for their hard work and explain why we can't merge the changes yet.
- Encourage the contributor to reach out in the #fleet channel of osquery Slack to get help from the rest of the community.
- Offer code review and coaching to help get the PR ready to go (see note below).
- Keep an eye out for any updates or responses.
Sometimes (typically for Fleet customers), a Fleet team member may add tests and make any necessary changes to merge the PR.
If everything is good to go, approve the review.
For PRs that will not be merged:
- Thank the contributor for their effort and explain why the changes won't be merged.
- Close the PR.
When merging a pull request from a community contributor:
- Ensure that the checklist for the submitter is complete.
- Verify that all necessary reviews have been approved.
- Merge the PR.
- Thank and congratulate the contributor.
- Share the merged PR with the team in the #help-promote channel of Fleet Slack to be publicized on social media. Those who contribute to Fleet and are recognized for their contributions often become great champions for the project.
Whenever a PR is proposed for making changes to our tables' schema(e.g. to schema/tables/screenlock.yml), it also has to be reflected in our osquery_fleet_schema.json file.
The website team will periodically update the json file with the latest changes. If the changes should be deployed sooner, you can generate the new json file yourself by running these commands:
cd website ./node_modules/sails/bin/sails.js run generate-merged-schema
When adding a new table, make sure it does not already exist with the same name. If it does, consider changing the new table name or merge the two tables if it makes sense.
If a table is added to our ChromeOS extension but it does not exist in osquery or if it is a table added by fleetd, add a note that mentions it. As in this example.
Fleet uses a human-oriented quality assurance (QA) process to make sure the product meets the standards of users and organizations.
Automated tests are important, but they can't catch everything. Many issues are hard to notice until a human looks empathetically at the user experience, whether in the user interface, the REST API, or the command line.
The goal of quality assurance is to identify corrections and improvements before release:
- Edge cases
- Error message UX
- Developer experience using the API/CLI
- Operator experience looking at logs
- API response time latency
- UI comprehensibility
- Data accuracy
- Perceived data freshness
To try Fleet locally for QA purposes, run
fleetctl preview, which defaults to running the latest stable release.
To target a different version of Fleet, use the
--tag flag to target any tag in Docker Hub, including any git commit hash or branch name. For example, to QA the latest code on the
main branch of fleetdm/fleet, you can run:
fleetctl preview --tag=main.
To start a preview without starting the simulated hosts, use the
--no-hosts flag (e.g.,
fleetctl preview --no-hosts).
For each bug found, please use the bug report template to create a new bug report issue.
For unreleased bugs in an active sprint, a new bug is created with the
~unreleased bug label. The
:release label and associated product group label is added, and the engineer responsible for the feature is assigned. If QA is unsure who the bug should be assigned to, it is assigned to the EM. Fixing the bug becomes part of the story.
You can read our guide to diagnosing issues in Fleet on the debugging page.
All bugs in Fleet are tracked by QA on the bugs board in ZenHub.
The lifecycle stages of a bug at Fleet are:
The above are all the possible states for a bug as envisioned in this process. These states each correspond to a set of GitHub labels, assignees, and boards.
See Bug states and filters at the end of this document for descriptions of these states and links to each GitHub filter.
Quickly reproducing bug reports is a priority for Fleet.
When a new bug is created using the bug report form, it is in the "inbox" state.
At this state, the bug review DRI (QA) is responsible for going through the inbox and documenting reproduction steps, asking for more reproduction details from the reporter, or asking the product team for more guidance. QA has 1 business day to move the bug to the next step (reproduced).
For community-reported bugs, this may require QA to gather more information from the reporter. QA should reach out to the reporter if more information is needed to reproduce the issue. Reporters are encouraged to provide timely follow-up information for each report. At two weeks since last communication QA will ping the reporter for more information on the status of the issue. After four weeks of stale communication QA will close the issue. Reporters are welcome to re-open the closed issue if more investigation is warranted.
Once reproduced, QA documents the reproduction steps in the description and moves it to the reproduced state. If QA or the engineering manager feels the bug report may be expected behavior, or if clarity is required on the intended behavior, it is assigned to the group's product manager. See on GitHub.
QA has weekly check-in with product to go over the inbox items. QA is responsible for proposing “not a bug”, closing due to lack of response (with a nice message), or raising other relevant questions. All requires product agreement
QA may also propose that a reported bug is not actually a bug. A bug is defined as “behavior that is not according to spec or implied by spec.” If agreed that it is not a bug, then it's assigned to the relevant product manager to determine its priority.
QA has reproduced the issue successfully. It should now be transferred to engineering.
Remove the “reproduce” label, add the label of the relevant team (e.g. #g-cx, #g-mdm, #g-infra, #g-website), and assign it to the relevant engineering manager. (Make your best guess as to which team. The EM will re-assign if they think it belongs to another team.) See on GitHub.
Fleeties do not have to wait for QA to reproduce the bug. If you're confident it's reproducible, it's a bug, and the reproduction steps are well-documented, it can be moved directly to the reproduced state.
If a bug requires input from product, the
:product label is added, it is assigned to the product group's PM, and the bug is moved to the "Product drafting" column of the bugs board. It will stay in this state until product closes the bug, or removes the
:product label and assigns to an EM.
A bug is in engineering after it has been reproduced and assigned to an EM. If a bug meets the criteria for a critical bug, the
~critical bug labels are added, and it is moved to the "Current release' column of the bugs board. If the bug is a
~critical bug, the EM follows the critical bug notification process.
If the bug does not meet the criteria of a critical bug, the EM will determine if there is capacity in the current sprint for this bug. If so, the
:release label is added, and it is moved to the "Current release' column on the bugs board. If there is no available capacity in the current sprint, the EM will move the bug to the "Sprint backlog" column where it will be prioritized for the next sprint.
When fixing the bug, if the proposed solution requires changes that would affect the user experience (UI, API, or CLI), notify the EM and PM to align on the acceptability of the change.
Fleet always prioritizes bugs into a release within six weeks. If a bug is not prioritized in the current release, and it is not prioritized in the next release, it is removed from the "Sprint backlog" and placed back in the "Product drafting" column with the
:product label. Product will determine if the bug should be closed as accepted behavior, or if further drafting is necessary.
Bugs will be verified as fixed by QA when they are placed in the "Awaiting QA" column of the relevant product group's sprint board. If the bug is verified as fixed, it is moved to the "Ready for release" column of the sprint board. Otherwise, the remaining issues are noted in a comment, and it is moved back to the "In progress" column of the sprint board.
This filter returns all "bug" issues opened after the specified date. Simply replace the date with a YYYY-MM-DD equal to one week ago. See on GitHub.
This filter returns all "bug" issues closed after the specified date. Simply replace the date with a YYYY-MM-DD equal to one week ago. See on Github.
When a release is in testing, QA should use the Slack channel #help-qa to keep everyone aware of issues found. All bugs found should be reported in the channel after creating the bug first.
When a critical bug is found, the Fleetie who labels the bug as critical is responsible for following the critical bug notification process below.
All unreleased bugs are addressed before publishing a release. Released bugs that are not critical may be addressed during the next release per the standard bug process.
Product may add the
~release blocker label to user stories to indicate that the story must be completed to publish the next version of Fleet. Bugs are never labeled as release blockers.
A critical bug is a bug with the
~critical bug label. A critical bug is defined as behavior that:
- Blocks the normal use of a workflow
- Prevents upgrades to Fleet
- Causes irreversible damage, such as data loss
- Introduces a security vulnerability
We need to inform customers and the community about critical bugs immediately so they don’t trigger it themselves. When a bug meeting the definition of critical is found, the bug finder is responsible for raising an alarm. Raising an alarm means pinging @here in the #help-product channel with the filed bug.
If the bug finder is not a Fleetie (e.g., a member of the community), then whoever sees the critical bug should raise the alarm. (We would expect this to be customer experience in the community Slack or QA in the bug inbox, though it could be anyone.) Note that the bug finder here is NOT necessarily the first person who sees the bug. If you come across a bug you think is critical, but it has not been escalated, raise the alarm!
Once raised, product confirms whether or not it's critical and defines expected behavior. When outside of working hours for the product team or if no one from product responds within 1 hour, then fall back to the #help-p1.
Once the critical bug is confirmed, customer experience needs to ping both customers and the community to warn them. If CX is not available, the oncall engineer is responsible for doing this. If a quick fix workaround exists, that should be communicated as well for those who are already upgraded.
When a critical bug is identified, we will then follow the patch release process in our documentation.
We track the success of this process by observing the throughput of issues through the system and identifying where buildups (and therefore bottlenecks) are occurring. The metrics are:
- Number of bugs opened this week
- Total # bugs open
- Bugs in each state (inbox, acknowledged, reproduced)
- Number of bugs closed this week
Each week these are tracked and shared in the weekly KPI sheet by Luke Heath.
In the above process, any reference to "product" refers to: Mo Zhu, Head of Product. In the above process, any reference to "QA" refers to: Reed Haynes, Product Quality Specialist
The infrastructure product group is responsible for deploying, supporting, and maintaining all Fleet-managed cloud deployments.
The following are quick links to infrastructure-related README files in both public and private repos that can be used as a quick reference for infrastructure-related code:
The infrastructure team follows industry best practices when designing and deploying infrastructure. For containerized infrastructure, Google has created a reference document as an ideal reference for these practices.
Many of these practices must be implemented in Fleet directly, and engineering will work to ensure that feature implementation follows these practices. The infrastructure team will make itself available to provide guidance as needed. If a feature is not compatible with these practices, an issue will be created with a request to correct the implementation.
The 24/7 on-call (aka infrastructure on-call) is responsible for alarms related to fleetdm.com, Fleet sandbox, Fleet managed cloud, as well as delivering 24/7 support for Fleet Ultimate customers. The infrastructure (24/7) on-call responsibility happens in shifts of one week. The people involved in them will be:
- Zachary Winnerman
- Robert Fairburn
Escalations (in order):
- Luke Heath
- Zach Wasserman (Fleet app)
- Eric Shaw (fleetdm.com)
- Mike McNeil
The first responder on-call will take ownership of the @infrastructure-oncall alias in Slack first thing Monday morning. The previous week's on-call will provide a summary in the #g-infra Slack channel with an update on alarms that came up the week before, open issues with or without direct end-user impact, and other issues to keep an eye out for.
Expected response times: during business hours, 1 hour. Outside of business hours <4 hours.
For fleetdm.com and sandbox alarms, if the issue is not user-facing (e.g. provisioner/deprovisioner/temporary errors in osquery/etc), the on-call engineer will proceed to address the issue. If the issue is user-facing (e.g. the user noticed this error first-hand through the Fleet UI), then the on-call engineer will proceed to identify the user and contact them letting them know that we are aware of the issue and working on a resolution. They may also request more information from the user if it is needed. They will cc the EM and PM of the #g-infra group on any user correspondence.
For Fleet managed cloud alarms that are user-facing, the first responder should collect the email address of the customer and all available information on the error. If the error occurs during business hours, the first responder should make their best effort to understand where in the app the error might have occurred. Assistance can be requested in
#help-engineering by including the data they know regarding the issue, and when available, a frontend or backend engineer can help identify what might be causing the problem. If the error occurs outside of business hours, the on-call engineer will contact the user letting them know that we are aware of the issue and working on a resolution. It’s more helpful to say something like “we saw that you received an error while trying to create a query” than to say “your POST /api/blah failed”.
Escalation of issues will be done manually by the first responder according to the escalation contacts mentioned above. An outage issue (template available) should be created in the Fleet confidential repo addressing:
- Who was affected and for how long?
- What expected behavior occurred?
- How do you know?
- What near-term resolution can be taken to recover the affected user?
- What is the underlying reason or suspected reason for the outage?
- What are the next steps Fleet will take to address the root cause?
All infrastructure alarms (fleetdm.com, Fleet managed cloud, and sandbox) will go to #help-p1.
The information needed to evaluate and potentially fix any issues is documented in the runbook.
When an infrastructure on-call engineer is out of the office, Zach Wasserman will serve as a backup to on-call in #help-p1. All absences must be communicated in advance to Luke Heath and Zach Wasserman.
Engineering is responsible for managing third-party accounts required to support engineering infrastructure.
We use the official Fleet Apple developer account to notarize installers we generate for Apple devices. Whenever Apple releases new terms of service, we are unable to notarize new packages until the new terms are accepted.
When this occurs, we will begin receiving the following error message when attempting to notarize packages: "You must first sign the relevant contracts online." To resolve this error, follow the steps below.
Visit the Apple developer account login page.
Log in using the credentials stored in 1Password under "Apple developer account".
Contact the Head of Business Operations to determine which phone number to use for 2FA.
Complete the 2FA process to log in.
Accept the new terms of service.
The following rituals are engaged in by the directly responsible individual (DRI) and at the frequency specified for the ritual.
|Pull request review||Daily||Engineers go through pull requests for which their review has been requested.||Luke Heath|
|Engineering group discussions||Weekly||See "Group Weeklies".||Zach Wasserman|
|Oncall handoff||Weekly||Hand off the oncall engineering responsibilities to the next oncall engineer.||Luke Heath|
|Vulnerability alerts (fleetdm.com)||Weekly||Review and remediate or dismiss vulnerability alerts for the fleetdm.com codebase on GitHub.||Eric Shaw|
|Vulnerability alerts (frontend)||Weekly||Review and remediate or dismiss vulnerability alerts for the Fleet frontend codebase (and related JS) on GitHub.||Zach Wasserman|
|Vulnerability alerts (backend)||Weekly||Review and remediate or dismiss vulnerability alerts for the Fleet backend codebase (and all Go code) on GitHub.||Zach Wasserman|
|Freeze ritual||Every three weeks||Go through the process of freezing the
|Release ritual||Every three weeks||Go through the process of releasing the next iteration of Fleet.||Luke Heath|
|Create patch release branch||Every patch release||Go through the process of creating a patch release branch, cherry picking commits, and pushing the branch to github.com/fleetdm/fleet.||Luke Heath|
|Bug review||Weekly||Review bugs that are in QA's inbox.||Reed Haynes|
|QA report||Every three weeks||Every release cycle, on the Monday of release week, the DRI for the release ritual is updated on status of testing.||Reed Haynes|
|Release QA||Every three weeks||Every release cycle, by end of day Friday of release week, all issues move to "Ready for release" on the #g-mdm and #g-cx sprint boards.||Reed Haynes|
The following Slack channels are maintained by this group: