Commit Graph

5 Commits

Author SHA1 Message Date
Jack Morgenstein
b3b30f5e8a IB/mthca: Recover from catastrophic errors
Trigger device remove and then add when a catastrophic error is
detected in hardware.  This, in turn, will cause a device reset, which
we hope will recover from the catastrophic condition.

Since this might interefere with debugging the root cause, add a
module option to suppress this behaviour.

Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2006-09-22 15:22:54 -07:00
Linus Torvalds
78b9c0f91c Merge branch 'for-linus' of master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband 2005-11-10 13:27:06 -08:00
Roland Dreier
0b4ff2c0e6 [IB] mthca: fix typo in catastrophic error polling
Fix a typo in the rearming of the catastrophic error polling timer: we
should rearm the timer as long as the stop flag is _not_ set.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
2005-11-10 10:22:50 -08:00
Tim Schmielau
8c65b4a604 [PATCH] fix remaining missing includes
Fix more include file problems that surfaced since I submitted the previous
fix-missing-includes.patch.  This should now allow not to include sched.h
from module.h, which is done by a followup patch.

Signed-off-by: Tim Schmielau <tim@physik3.uni-rostock.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07 07:53:41 -08:00
Roland Dreier
3d155f8cd0 [IB] mthca: first pass at catastrophic error reporting
Add some initial support for detecting and reporting catastrophic
errors reported by Mellanox HCAs.  We start a periodic timer which
polls the catastrophic error reporting buffer in device memory.  If an
error is detected, we dump the contents of the buffer for port-mortem
debugging, and report a fatal asynchronous error to higher levels.

In the future we can try to recover from these errors by resetting the
device, but this will require some work in higher-level code as well.
Let's get this in now, so that we at least get catastrophic errors
reported in logs.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
2005-10-27 11:03:38 -07:00