87 lines
4.3 KiB
Plaintext
87 lines
4.3 KiB
Plaintext
|
The health mechanism is targeted for Real Time Alerting, in order to know when
|
||
|
something bad had happened to a PCI device
|
||
|
- Provide alert debug information
|
||
|
- Self healing
|
||
|
- If problem needs vendor support, provide a way to gather all needed debugging
|
||
|
information.
|
||
|
|
||
|
The main idea is to unify and centralize driver health reports in the
|
||
|
generic devlink instance and allow the user to set different
|
||
|
attributes of the health reporting and recovery procedures.
|
||
|
|
||
|
The devlink health reporter:
|
||
|
Device driver creates a "health reporter" per each error/health type.
|
||
|
Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error)
|
||
|
or unknown (driver specific).
|
||
|
For each registered health reporter a driver can issue error/health reports
|
||
|
asynchronously. All health reports handling is done by devlink.
|
||
|
Device driver can provide specific callbacks for each "health reporter", e.g.
|
||
|
- Recovery procedures
|
||
|
- Diagnostics and object dump procedures
|
||
|
- OOB initial parameters
|
||
|
Different parts of the driver can register different types of health reporters
|
||
|
with different handlers.
|
||
|
|
||
|
Once an error is reported, devlink health will do the following actions:
|
||
|
* A log is being send to the kernel trace events buffer
|
||
|
* Health status and statistics are being updated for the reporter instance
|
||
|
* Object dump is being taken and saved at the reporter instance (as long as
|
||
|
there is no other dump which is already stored)
|
||
|
* Auto recovery attempt is being done. Depends on:
|
||
|
- Auto-recovery configuration
|
||
|
- Grace period vs. time passed since last recover
|
||
|
|
||
|
The user interface:
|
||
|
User can access/change each reporter's parameters and driver specific callbacks
|
||
|
via devlink, e.g per error type (per health reporter)
|
||
|
- Configure reporter's generic parameters (like: disable/enable auto recovery)
|
||
|
- Invoke recovery procedure
|
||
|
- Run diagnostics
|
||
|
- Object dump
|
||
|
|
||
|
The devlink health interface (via netlink):
|
||
|
DEVLINK_CMD_HEALTH_REPORTER_GET
|
||
|
Retrieves status and configuration info per DEV and reporter.
|
||
|
DEVLINK_CMD_HEALTH_REPORTER_SET
|
||
|
Allows reporter-related configuration setting.
|
||
|
DEVLINK_CMD_HEALTH_REPORTER_RECOVER
|
||
|
Triggers a reporter's recovery procedure.
|
||
|
DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE
|
||
|
Retrieves diagnostics data from a reporter on a device.
|
||
|
DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET
|
||
|
Retrieves the last stored dump. Devlink health
|
||
|
saves a single dump. If an dump is not already stored by the devlink
|
||
|
for this reporter, devlink generates a new dump.
|
||
|
dump output is defined by the reporter.
|
||
|
DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR
|
||
|
Clears the last saved dump file for the specified reporter.
|
||
|
|
||
|
|
||
|
netlink
|
||
|
+--------------------------+
|
||
|
| |
|
||
|
| + |
|
||
|
| | |
|
||
|
+--------------------------+
|
||
|
|request for ops
|
||
|
|(diagnose,
|
||
|
mlx5_core devlink |recover,
|
||
|
|dump)
|
||
|
+--------+ +--------------------------+
|
||
|
| | | reporter| |
|
||
|
| | | +---------v----------+ |
|
||
|
| | ops execution | | | |
|
||
|
| <----------------------------------+ | |
|
||
|
| | | | | |
|
||
|
| | | + ^------------------+ |
|
||
|
| | | | request for ops |
|
||
|
| | | | (recover, dump) |
|
||
|
| | | | |
|
||
|
| | | +-+------------------+ |
|
||
|
| | health report | | health handler | |
|
||
|
| +-------------------------------> | |
|
||
|
| | | +--------------------+ |
|
||
|
| | health reporter create | |
|
||
|
| +----------------------------> |
|
||
|
+--------+ +--------------------------+
|