When ipoib_stop() is called it first calls netif_stop_queue() to stop
the kernel from passing more packets to the network driver. However,
the completion handler may call netif_wake_queue() re-enabling packet
transfer.
This might result in leaks (we see AH leaks which we think can be
attributed to this bug) as new packets get posted while the interface
is going down.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Michael Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When flushing out queued commands after a successful device reset,
make sure that SRP completes the right commands, instead of calling
scsi_done on the command passed into the device reset handler over and
over.
Signed-off-by: Ishai Rabinovitz <ishai@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If a reconnection attempt fails, then SRP does two scsi_host_put()s.
This is a historical relic from an earlier version of the driver that
took a reference on the scsi_host before trying to reconnect, so get
rid of the extra scsi_host_put().
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Sending a DREQ may fail, for example because the remote target has
already broken the connection. If so, then SRP should not wait for
the disconnection to complete, because it never will.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When deleting a child interface with a non-default P_Key via
/sys/class/net/ibX/delete_child, the interface must be freed with
free_netdev() (rather than kfree() on the private data).
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If a SCSI abort completes, or the command completes successfully, then
the driver must remove the command from its queue of pending
commands. Similarly, if a device reset succeeds, then all commands
queued for the given device must be removed from the queue.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If a SCSI abort succeeds, then the aborted request should to be
removed from the list of pending requests. This fixes list corruption
after an abort occurs.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
We know ipoib_flush_paths() is called from plain process context with
interrupts enabled, since it does wait_for_completion(). So there's
no need to use spin_lock_irqsave() -- spin_lock_irq() is fine.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ib_sa_cancel_query() must be called with priv->lock held since
a completion might arrive and set path->query to NULL.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Make IPoIB's send and receive queue sizes tunable via module
parameters ("send_queue_size" and "recv_queue_size"). This allows the
queue sizes to be enlarged to fix disastrously bad performance on some
platforms and workloads, without bloating memory usage when large
queues aren't needed.
Signed-off-by: Shirley Ma <xma@us.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ipoib_mcast_restart_task() might free an mcast object while a join
request is still outstanding, leading to an oops when the query
completes. Fix this by waiting for query to complete, similar to what
ipoib_stop_thread() is doing. The wait for mcast completion code is
consolidated in wait_for_mcast_join().
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Push translation of static rate to HCA format into low-level drivers,
where it belongs. For static rate encoding, use encoding of rate
field from IB standard PathRecord, with addition of value 0, for
backwards compatibility with current usage. The changes are:
- Add enum ib_rate to midlayer includes.
- Get rid of static rate translation in IPoIB; just use static rate
directly from Path and MulticastGroup records.
- Update mthca driver to translate absolute static rate into the
format used by hardware. This also fixes mthca's static rate
handling for HCAs that are capable of 4X DDR.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Consolidate IPoIB's private neighbour data handling into
ipoib_neigh_alloc() and ipoib_neigh_free(). This will make it easier
to keep track of the neighbour structures that IPoIB is handling, and
is a nice cleanup of the code:
add/remove: 2/1 grow/shrink: 1/8 up/down: 100/-178 (-78)
function old new delta
ipoib_neigh_alloc - 61 +61
ipoib_neigh_free - 36 +36
ipoib_mcast_join_finish 1288 1291 +3
path_rec_completion 575 573 -2
ipoib_mcast_join_task 664 660 -4
ipoib_neigh_destructor 101 92 -9
ipoib_neigh_setup_dev 14 3 -11
ipoib_neigh_setup 17 - -17
path_free 238 215 -23
ipoib_mcast_free 329 306 -23
ipoib_mcast_send 718 684 -34
neigh_add_path 705 650 -55
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Don't allow CONFIG_INFINIBAND_IPOIB_DEBUG to be disabled unless
CONFIG_EMBEDDED is selected. We want users (and especially distros)
to have this turned on unless they really need to save space, because
by the time we want debugging output, it's usually too late to rebuild
a kernel. The debugging output can be controlled at runtime via the
debug_level module parameter in sysfs.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ipoib_hard_header() needs to handle the case that daddr is NULL. This
can happen when packets are injected via a raw socket, and IPoIB
shouldn't oops in this case.
Reported by Anton Blanchard <anton@samba.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The recently merged patch to create a fake scatterlist for non-SG SCSI
commands had a bug: the driver ended up doing dma_unmap_sg() on a
scatterlist scmnd->request_buffer rather than the fake scatter list it
created. Fix this so that the driver unmaps the same thing it maps.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch causes the network interface to respond to P_Key change
events correctly. As a result, you'll see a child interface in the
"RUNNING" state (netif_carrier_on()) only when the corresponding P_Key
is configured by the SM. When SM removes a P_Key, the "RUNNING" state
will be disabled for the corresponding network interface. To
implement this, I added IB_EVENT_PKEY_CHANGE event handling. To
prevent flushing the device before the device is open by the "delay
open" mechanism, I added an additional device flag called
IPOIB_FLAG_INITIALIZED.
This also prevents the child network interface from trying to join to
multicast groups until the PKEY is configured. We used to get error
messages like:
ib0.f2f2: couldn't attach QP to multicast group ff12:401b:f2f2:0:0:0:ffff:ffff
in this case. To fix this, I just check IPOIB_FLAG_OPER_UP flag in
ipoib_set_mcast_list().
Signed-off-by: Leonid Arsh <leonida@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
With the current IPoIB driver, the status of network interfaces stays
"RUNNING" even if the link goes down (for example because a cable is
unplugged). Fix this by flushing the IPoIB interface when the link
goes down.
Signed-off-by: Leonid Arsh <leonida@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Since the SCSI midlayer is moving towards entirely getting rid of
commands with use_sg == 0, we should treat this case as an exception.
Therefore, change the IB SRP initiator to create a fake scatterlist
for these commands with sg_init_one(). This simplifies the flow of
DMA mapping and unmapping, since SRP can just use dma_map_sg() and
dma_unmap_sg() unconditionally, rather than having to choose between
the dma_{map,unmap}_sg() and dma_{map,unmap}_single() variants.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ipoib_ib_dev_flush() should get passed cpriv->dev, not &cpriv->dev.
Signed-off-by: Leonid Arsh <leonida@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
struct neigh_ops currently has a destructor field, which no in-kernel
drivers outside of infiniband use. The infiniband/ulp/ipoib in-tree
driver stashes some info in the neighbour structure (the results of
the second-stage lookup from ARP results to real link-level path), and
it uses neigh->ops->destructor to get a callback so it can clean up
this extra info when a neighbour is freed. We've run into problems
with this: since the destructor is in an ops field that is shared
between neighbours that may belong to different net devices, there's
no way to set/clear it safely.
The following patch moves this field to neigh_parms where it can be
safely set, together with its twin neigh_setup. Two additional
patches in the patch series update ipoib to use this new interface.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In neigh_add_path(), the queue of delayed packets can never be full,
because the queue is always freshly created and cannot be found by any
other code path. In fact, the test of the queue length is worse than
useless: if somehow the test ever triggered and path_rec_start() also
failed, then dev_kfree_skb_any() will be called twice on the same skb.
Fix this by deleting the useless test. Pointed out by Michael
S. Tsirkin <mst@mellanox.co.il>.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix leak found by Coverity: in the SRP_OPT_DGID case,
srp_parse_options() didn't free the result of match_strdup().
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Move ipoib_ib_dev_flush() to ipoib's workqueue. This keeps it ordered
with respect to other work scheduled by the ipoib driver. This fixes
problems with races, for example:
- ipoib_ib_dev_flush() has started running because of an IB event
- user does ifconfig ib0 down
- ipoib_mcast_stop_thread() gets called twice and waits for the same
completion twice
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix the IPoIB build (which is broken in net-2.6.17 because of my
screw-up, which left out this chunk in ipoib_multicast.c).
The neighbour destructor is now in neigh_params, so we don't
need to clear it in the ops structure.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add SCSI host attributes in sysfs that show the ID extension, IOC
GUID, service ID, P_Key and destination GID for each target port that
the SRP initiator connects to.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ipoib_mcast_stop_thread currently tests mcast->query and if it is
NULL, does not perform wait_for_completion on the mcast and frees the
mcast object directly.
However, since both operations are done without locking, it is
possible that ipoib_mcast_join_complete is in progress on this mcast
object and has set mcast->query to NULL already.
Solve this by:
- taking priv->lock before we change mcast->query in ipoib_mcast_join_complete,
and keeping it until we no longer need the mcast object
- taking priv->lock around mcast->query test in ipoib_mcast_stop_thread
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If posting receives in ipoib_ib_dev_open() fails, call
ipoib_ib_dev_stop() to move the device's QP back to the RESET state so
that we can try again later.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ipoib_mcast_send() tests mcast->ah twice. If this value is changed
between these two points, we leak an skb. However,
ipoib_mcast_join_finish() sets mcast->ah with no locking, so it could
race against ipoib_mcast_send().
As a solution, take priv->lock around assignment to mcast->ah thus
making sure ipoib_mcast_send() (which also takes priv->lock) is not in
flight.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Cosmetic change: make alignment explicit in to_ipoib_neigh.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Even after the last fix, it's still possible for a send-only join to
start before the join for the broadcast group has finished. This
could cause us to create a multicast group using attributes from the
broadcast group that haven't been initialized yet, so we would use
garbage for the Q_Key, etc. Fix this by waiting until the broadcast
group's attached flag is set before starting send-only joins.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Further, there's an additional issue that I saw in testing:
ipoib_mcast_send may get called when priv->broadcast is NULL (e.g. if
the device was downed and then upped internally because of a port
event).
If this happends and the send-only join request gets completed before
priv->broadcast is set, we get an oops.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix the following race scenario:
- Device is up.
- Port event or set mcast list triggers ipoib_mcast_stop_thread,
this cancels the query and waits on mcast "done" completion.
- Completion is called and "done" is set.
- Meanwhile, ipoib_mcast_send arrives and starts a new query,
re-initializing "done".
Fix this by adding a "multicast started" bit and checking it before
starting a send-only join.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Convert srp_host->target_mutex from a semaphore to a mutex.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Avoid corrupting mcast->pkt_queue by serializing access with
priv->tx_lock. Also, update dropped packet statistics to count
multicast packets removed from pkt_queue as dropped.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The SA path record query completion can initialize path->pathrec.dlid
before IPoIB's callback runs and initializes path->ah, so we must test
ah rather than dlid.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
semaphore to mutex conversion by Ingo and Arjan's script.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
[ Sanity-checked on real IB hardware ]
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The current handling of multicast groups in IPoIB ends up never
freeing send-only multicast groups. It turns out the logic was much
more complicated than it needed to be; we can fix this bug and
completely kill ipoib_mcast_dev_down() at the same time.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
dev->mc_list accesses must be protected by dev->xmit_lock.
Found by Eli Cohen <eli@mellanox.co.il>.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Multiple ipoib_neigh structures on mcast->neigh_list may point to the
same ah. This means that ipoib_mcast_free() can't just make a list of
ah structs to free, since this might end up trying to add the same ah
to the list more than once. Handle this in ipoib_multicast.c in the
same way as it is handled in ipoib_main.c for struct ipoib_path.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Don't leak memory on allocation failure for broadcast mcast group.
Also, print a warning to match handling for other mcast groups.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a node_guid field to struct ib_device. It is the responsibility
of the low-level driver to initialize this field before registering a
device with the midlayer. Convert everyone to looking at this field
instead of calling ib_query_device() when all they want is the node
GUID, and remove the node_guid field from struct ib_device_attr.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Include fixes for 2.6.14-git11. Should allow to remove sched.h from
module.h on i386, x86_64, arm, ia64, ppc, ppc64, and s390. Probably more
to come since I haven't yet checked the other archs.
Signed-off-by: Tim Schmielau <tim@physik3.uni-rostock.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
To help in reducing the number of include dependencies, several files were
touched as they were getting needed headers indirectly for stuff they use.
Thanks also to Alan Menegotto for pointing out that net/dccp/proto.c had
linux/dccp.h include twice.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If ipoib_ib_dev_up() fails after ipoib_ib_dev_open() is called, then
ipoib_ib_dev_stop() needs to be called to clean up.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
race condition: ipoib_ib_dev_flush is accessing child list without locks.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>