This is needed if we wish to change the size of the resource structures.
Based on an original patch from Vivek Goyal <vgoyal@in.ibm.com>
Cc: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Extend the get_sb() filesystem operation to take an extra argument that
permits the VFS to pass in the target vfsmount that defines the mountpoint.
The filesystem is then required to manually set the superblock and root dentry
pointers. For most filesystems, this should be done with simple_set_mnt()
which will set the superblock pointer and then set the root dentry to the
superblock's s_root (as per the old default behaviour).
The get_sb() op now returns an integer as there's now no need to return the
superblock pointer.
This patch permits a superblock to be implicitly shared amongst several mount
points, such as can be done with NFS to avoid potential inode aliasing. In
such a case, simple_set_mnt() would not be called, and instead the mnt_root
and mnt_sb would be set directly.
The patch also makes the following changes:
(*) the get_sb_*() convenience functions in the core kernel now take a vfsmount
pointer argument and return an integer, so most filesystems have to change
very little.
(*) If one of the convenience function is not used, then get_sb() should
normally call simple_set_mnt() to instantiate the vfsmount. This will
always return 0, and so can be tail-called from get_sb().
(*) generic_shutdown_super() now calls shrink_dcache_sb() to clean up the
dcache upon superblock destruction rather than shrink_dcache_anon().
This is required because the superblock may now have multiple trees that
aren't actually bound to s_root, but that still need to be cleaned up. The
currently called functions assume that the whole tree is rooted at s_root,
and that anonymous dentries are not the roots of trees which results in
dentries being left unculled.
However, with the way NFS superblock sharing are currently set to be
implemented, these assumptions are violated: the root of the filesystem is
simply a dummy dentry and inode (the real inode for '/' may well be
inaccessible), and all the vfsmounts are rooted on anonymous[*] dentries
with child trees.
[*] Anonymous until discovered from another tree.
(*) The documentation has been adjusted, including the additional bit of
changing ext2_* into foo_* in the documentation.
[akpm@osdl.org: convert ipath_fs, do other stuff]
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nathan Scott <nathans@sgi.com>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Documentation/infiniband/core_locking.txt says:
All of the methods in struct ib_device exported by a low-level
driver must be fully reentrant. The low-level driver is required to
perform all synchronization necessary to maintain consistency, even
if multiple function calls using the same object are run
simultaneously.
However, mthca's modify_qp, modify_srq and resize_cq methods are
currently not reentrant. Add a mutex to the QP, SRQ and CQ structures
so that these calls can be properly serialized.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Some error paths after the mthca_alloc_mailbox() call in mthca_modify_qp()
just do a "return -EINVAL" without freeing the mailbox. Convert these
returns to "goto out" to avoid leaking the mailbox storage.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Report the true max_map_per_fmr value from mthca_query_device(),
taking into account the change in FMR remapping introduced by the
Sinai performance optimization.
Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Change the mthca snoop of MADs that set PortInfo to check if the SM
has set the client reregister bit, and if it has, generate a client
reregister event. If the bit is not set, just generate a LID change
event as usual.
Signed-off-by: Leonid Arsh <leonida@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Move ipath's struct port_info into <rdma/ib_smi.h>, so that it can be
used by mthca to implement client reregister support.
Remove the __attribute__((packed)) because all the members of the struct
are naturally aligned anyway.
Signed-off-by: Leonid Arsh <leonida@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The kernel has had wait_for_completion_timeout() for a long time now.
mthca should use it to handle FW commands timing out, instead of
implementing the same thing in a much more complicated way by using
wait_for_completion() along with a timer that does complete().
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Memfree firmware is in rare cases reporting WQE index == base - 1 in
receive completion with error, instead of (rq size - 1); base is 0 in
mthca. Here is a patch to avoid kernel crash and report a correct WR
id in this case.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
mthca does not restore the following PCI-X/PCI Express registers after reset:
PCI-X device: PCI-X command register
PCI-X bridge: upstream and downstream split transaction registers
PCI Express : PCI Express device control and link control registers
This causes instability and/or bad performance on systems where one of
these registers is set to a non-default value by BIOS.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If we post a list of length exactly a multiple of 256, nreq in
doorbell gets set to 256 which is wrong: it should be encoded by 0.
This is because we only zero it out on the next WR, which may not be
there. The solution is to ring the doorbell after posting a WQE, not
before posting the next one.
This is the same bug that we just fixed for QPs with non-shared RQ.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
At this point, the core QP structure hasn't been initialized, so what's
in there isn't valid. Get the same information elsewhere.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The problem was that node A's sending thread, which handles sending RDMA
read response data, would write the trigger word, the last packet would
be sent, node B would send a new RDMA read request, node A's interrupt
handler would initialize s_rdma_sge, then node A's sending thread would
update s_rdma_sge. This didn't happen very often naturally but was more
frequent with 1 byte RDMA reads. Rather than adding more locking or
increasing the QP structure size and copying sge data, I modified the
copy routine to update the pointers before writing the trigger word to
avoid the update race.
Signed-off-by: Ralph Campbell <ralphc@pathscale.com>
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fixed so it works on the PE-800. It had not previously been updated to
match PE-800 receive interrupt differences from HT-400.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This is required for even semi-decent performance on OpenIB.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix NULL deref due to pcidev being clobbered before dd->ipath_f_cleanup()
was called.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix the interface version that gets exported to userspace.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Make sure modify_qp won't modify the QP if any of the changes failed.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The local loopback path for RC can lock the rkey table lock without
blocking interrupts. The receive interrupt path can then call
ipath_rkey_ok() and deadlock. Remove the redundant lock.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If we post a list of length 256 exactly, nreq in doorbell gets set to
256 which is wrong: it should be encoded by 0. This is because we
only zero it out on the next WR, which may not be there. The solution
is to ring the doorbell after posting a WQE, not before posting the
next one.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Setting fw_cmd_doorbell allows FW command to be queued using posted
writes instead of requiring polling on a "go" bit, so it should be a
performance boost. However, the option causes problems with at least
some device/firmware combinations, so set the default to 0 until we
understand what's going on better.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The ipath driver's table of PCI IDs needs a { 0, } entry at the end.
This makes all of the device aliases visible to userspace so hotplug
loads the module for all supported devices. Without the patch,
modinfo ipath_core only shows:
alias: pci:v00001FC1d0000000Dsv*sd*bc*sc*i*
instead of the correct:
alias: pci:v00001FC1d00000010sv*sd*bc*sc*i*
alias: pci:v00001FC1d0000000Dsv*sd*bc*sc*i*
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Addresses for ioremap must be calculated off of pci_resource_start;
we can't directly use the bus address as seen by the HCA. Fix the
code that remaps device memory for FMR access.
Based on patch by Klaus Smolin.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix races in in destroying various objects. If a destroy routine
waits for an object to become free by doing
wait_event(&obj->wait, !atomic_read(&obj->refcount));
/* now clean up and destroy the object */
and another place drops a reference to the object by doing
if (atomic_dec_and_test(&obj->refcount))
wake_up(&obj->wait);
then this is susceptible to a race where the wait_event() and final
freeing of the object occur between the atomic_dec_and_test() and the
wake_up(). And this is a use-after-free, since wake_up() will be
called on part of the already-freed object.
Fix this in mthca by replacing the atomic_t refcounts with plain old
integers protected by a spinlock. This makes it possible to do the
decrement of the reference count and the wake_up() so that it appears
as a single atomic operation to the code waiting on the wait queue.
While touching this code, also simplify mthca_cq_clean(): the CQ being
cleaned cannot go away, because it still has a QP attached to it. So
there's no reason to be paranoid and look up the CQ by number; it's
perfectly safe to use the pointer that the callers already have.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Names that are the opposite of their intended meanings are not so helpful.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The reset code now turns off the PRESENT flag during a reset, so that
other code won't attempt to access a device that's in mid-reset.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Remember when the verbs layer unregisters from the lower-level code.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Different ipath hardware types have different numbers of buffers
available, so we decide on the counts ourselves unless we are specifically
overridden with a module parameter.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Some systems do not set up 64-bit maps on systems with 2GB or less of
memory installed, so we have to fall back to trying a 32-bit setup.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
We were accidentally exposing the "reset" sysfs file more than once
per device.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
GuidInfo records have 8 byte GUIDs in them, so an index should be
multiplied by 8 to get an offset. mthca_query_gid() was incorrectly
multiplying by 16.
Noticed by Leonid Keller <leonid@mellanox.co.il>.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch makes the needlessly global mthca_update_rate() static.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The driver allocates SRQ WQEs size with a power of 2 size both for
Tavor and for memfree. For Tavor, however, the hardware only requires
the WQE size to be a multiple of 16, not a power of 2, and the max
number of scatter-gather allowed is reported accordingly by the
firmware (and this is the value currently returned by
ib_query_device() and ibv_query_device()).
If the max number of scatter/gather entries reported by the FW is used
when creating an SRQ, the creation will fail for Tavor, since the
required WQE size will be increased to the next power of 2, which
turns out to be larger than the device permitted max WQE size (which
is not a power of 2).
This patch reduces the reported SRQ max wqe size so that it can be used
successfully in creating an SRQ on Tavor HCAs.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The PCI spec recommends against drivers playing with a device's PCI
read burst size, and says that systems software should configure it.
And we actually have users that report that changing it from the
default set by BIOS hurts performance and/or stability for them. On
the other hand, the Mellanox Programmer's Reference Manual recommends
turning it up all the way to the maximum value. Some tests conducted
here in the lab do not show performance improvement from this tuning,
but this might be just me.
As a work-around, make this tuning an option, off by default (safe
value), with an eye towards removing it completely one day if no one
complains.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Push translation of static rate to HCA format into low-level drivers,
where it belongs. For static rate encoding, use encoding of rate
field from IB standard PathRecord, with addition of value 0, for
backwards compatibility with current usage. The changes are:
- Add enum ib_rate to midlayer includes.
- Get rid of static rate translation in IPoIB; just use static rate
directly from Path and MulticastGroup records.
- Update mthca driver to translate absolute static rate into the
format used by hardware. This also fixes mthca's static rate
handling for HCAs that are capable of 4X DDR.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Change the mthca debugging trace output code so that it can enabled
and disabled at runtime with the debug_level module parameter in
sysfs. Also, don't allow CONFIG_INFINIBAND_MTHCA_DEBUG to be disabled
unless CONFIG_EMBEDDED is selected. We want users (and especially
distros) to have this turned on unless they really need to save space,
because by the time we want debugging output, it's usually too late to
rebuild a kernel.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Integrate the ipath core and OpenIB drivers into the kernel build
infrastructure. Add entry to MAINTAINERS.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The ipath_verbs.c file implements the driver-specific components of the
kernel's Infiniband verbs layer.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>