This patch makes the needlessly global mthca_update_rate() static.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The driver allocates SRQ WQEs size with a power of 2 size both for
Tavor and for memfree. For Tavor, however, the hardware only requires
the WQE size to be a multiple of 16, not a power of 2, and the max
number of scatter-gather allowed is reported accordingly by the
firmware (and this is the value currently returned by
ib_query_device() and ibv_query_device()).
If the max number of scatter/gather entries reported by the FW is used
when creating an SRQ, the creation will fail for Tavor, since the
required WQE size will be increased to the next power of 2, which
turns out to be larger than the device permitted max WQE size (which
is not a power of 2).
This patch reduces the reported SRQ max wqe size so that it can be used
successfully in creating an SRQ on Tavor HCAs.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The PCI spec recommends against drivers playing with a device's PCI
read burst size, and says that systems software should configure it.
And we actually have users that report that changing it from the
default set by BIOS hurts performance and/or stability for them. On
the other hand, the Mellanox Programmer's Reference Manual recommends
turning it up all the way to the maximum value. Some tests conducted
here in the lab do not show performance improvement from this tuning,
but this might be just me.
As a work-around, make this tuning an option, off by default (safe
value), with an eye towards removing it completely one day if no one
complains.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Push translation of static rate to HCA format into low-level drivers,
where it belongs. For static rate encoding, use encoding of rate
field from IB standard PathRecord, with addition of value 0, for
backwards compatibility with current usage. The changes are:
- Add enum ib_rate to midlayer includes.
- Get rid of static rate translation in IPoIB; just use static rate
directly from Path and MulticastGroup records.
- Update mthca driver to translate absolute static rate into the
format used by hardware. This also fixes mthca's static rate
handling for HCAs that are capable of 4X DDR.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Change the mthca debugging trace output code so that it can enabled
and disabled at runtime with the debug_level module parameter in
sysfs. Also, don't allow CONFIG_INFINIBAND_MTHCA_DEBUG to be disabled
unless CONFIG_EMBEDDED is selected. We want users (and especially
distros) to have this turned on unless they really need to save space,
because by the time we want debugging output, it's usually too late to
rebuild a kernel.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Quite a few cleanup functions in mthca were marked as __devexit.
However, they could also be called from error paths during
initialization, so they cannot be marked that way. Just delete all of
the incorrect annotations.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The previous patch for Tavor broke MemFree logic.
The driver should perform limit check only for Tavor. For MemFree,
the check is incorrect, since ds (WQE stride) is always a power-of-2
(although the max_desc_size may not be).
In Tavor, however, WQE stride and desc_size are the same, and are not
necessarily power-of-2. The check was really for the WQE stride (and
it Tavor, we use max_desc_size for the stride).
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If the call to mthca_MODIFY_QP() failed, then mthca_modify_qp() would
still do some things it shouldn't, such as store away attributes for
special QPs. Fix this, and simplify the code, by simply jumping to
the exit path if mthca_MODIFY_QP() fails.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
mthca_alloc_sqp() by mthca_set_qp_size() need to set qp->transport
before calling mthca_set_qp_size(), since the value is used there.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When setting the shared receive queue (SRQ) watermark in a modify SRQ
operation, make sure that the supplied value is not larger than the
full size of the SRQ.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Guarantee the calculated work queue entry size does not exceed the max
allowable WQE size when creating an SRQ. This is a problem with Arbel
in Tavor-compatibility mode because the current WQE size computation
method rounds up to next power of 2.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a check that the modify QP parameters sgid_index and path_mtu are
valid, since they might come from userspace.
Signed-off-by: Dotan Barak <dotanb@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix endianness handling of srq_limit: it is big-endian in the context
structure, so we need to swab it before returning it.
Also add support for srq_limit query for Tavor (non-MemFree) HCAs.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
MemFree devices need to reserve one shared receive queue (SRQ) work
request for internal use, so the capacity returned from the create_srq
and query_srq methods should be srq->max - 1.
Signed-off-by: Dotan Barak <dotanb@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix bug found by coverity: the loop body never executed, because it
was doing for (i = 0; i < MTHCA_EQ_CMD; ++i), but MTHCA_EQ_CMD is 0.
The correct loop bound is MTHCA_NUM_EQ, to loop over all EQs.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Sinai (one-port PCI Express) HCAs get improved throughput for messages
bigger than 80 KB in DDR mode if memory keys are formatted in a
specific way. The enhancement only works if the memory key table is
smaller than 2^24 entries. For larger tables, the enhancement is off
and a warning is printed (to avoid silent performance loss).
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Michael Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Use a named enum for the HCA's internal page size, rather than having
magic values of 4096 and shifts by 12 all over the code. Also, fix
one minor bug in EQ handling: only one HCA page is mapped to the HCA
during initialization, but a full kernel page is unmapped during
cleanup. This might cause problems when PAGE_SIZE != 4096.
Signed-off-by: Ishai Rabinovitz <ishai@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Check that the alternate P_Key index is in range when setting the
alternate path for a QP. Also make a cosmetic touch up to the debug
message printed when the main P_Key index is out of range.
Signed-off-by: Dotan Barak <dotanb@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support for IB_SEND_FENCE flag in post_send methods.
Signed-off-by: Dotan Barak <dotanb@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Implement query_ah (except for AVs which are in HCA memory). This is
needed to implement RMPP duplicate session detection on sending side
(extraction of DGID/DLID and GRH flag from address handle).
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch is checks whether the HCA supports posting FW commands
through a doorbell page (user access region 0, or "UAR0"). If this is
supported, the driver maps UAR0 and uses it for FW commands. This can
be controlled by the value of a writable module parameter
fw_cmd_doorbell. When the parameter is 0, the commands are posted
through HCR using the old method; otherwise if HCA is capable commands
go through UAR0.
This use of UAR0 to post commands eliminates the need for polling the
"go" bit prior to posting a new command. Since reading from a PCI
device is much more expensive then issuing a posted write, it is
expected that issuing FW commands this way will provide better CPU
utilization.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Have mthca's create_srq method return the actual capacity of the SRQ
that gets created. Also update comments in <rdma/ib_verbs.h> to
clarify that this is what is expected from ib_create_srq().
Signed-off-by: Dotan Barak <dotanb@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Use ib_modify_qp_is_ok() in mthca, and delete the big table of
attributes for queue pair state transitions.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add low-level driver support to ib_mthca so that consumers can request
a "send queue drained" event be generated when a transiton to the SQD
state completes.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch allows the consumer to set the page size of "pages" mapped
by the pool FMRs, which is a feature already existing in the base
verbs API. On the cosmetic side it changes ib_fmr_attr.page_size field
to be named page_shift.
Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a modify_device method to mthca, which implements setting the node
description. This makes the writable "node_desc" sysfs attribute work
for Mellanox HCAs.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The might_sleep() annotations in mthca are silly -- they all occur
shortly before calls that will end up in core functions like kmalloc()
that will print the same warning in an unsafe context anyway. In
fact, beyond cluttering the source, we're actually bloating text with
CONFIG_DEBUG_SPINLOCK_SLEEP and/or CONFIG_PREEMPT_VOLUNTARY set.
With both options set, getting rid of the might_sleep()s saves a lot:
add/remove: 0/0 grow/shrink: 0/7 up/down: 0/-171 (-171)
function old new delta
mthca_pd_alloc 132 109 -23
mthca_init_cq 969 946 -23
mthca_mr_alloc 592 568 -24
mthca_pd_free 67 42 -25
mthca_free_mr 219 194 -25
mthca_free_cq 570 545 -25
mthca_fmr_alloc 742 716 -26
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The function mthca_free_err_wqe() can never fail, so get rid of its
return value. That means handle_error_cqe() doesn't have to check
what mthca_free_err_wqe() returns, which means it can't fail either
and doesn't have to return anything either. All this results in
simpler source code and a slight object code improvement:
add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-10 (-10)
function old new delta
mthca_free_err_wqe 83 81 -2
mthca_poll_cq 1758 1750 -8
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When debugging is enabled, the mthca_QUERY_DEV_LIM() firmware command
function prints out some of the device limits that it queries.
However the debugging prints happen before all of the fields are
extracted from the firmware response, so some of the values that get
printed are uninitialized junk. Move the prints to the end of the
function to fix this.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Convert semaphores to mutexes in mthca. Leave firmware command
interface poll_sem and event_sem as semaphores.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
We have run into the following problem: if a task receives a signal
while in the process of e.g. destroying a resource (which could be
because the relevant file was closed) mthca could bail out from trying
to take a command interface semaphore without performing the
appropriate command to tell hardware that the resource is being
destroyed.
As a result we see messages like
ib_mthca 0000:04:00.0: HW2SW_CQ failed (-4)
In this case, hardware could access the resource after the memory has
been freed, possibly causing memory corruption.
A simple solution is to replace down_interruptible() by down() in
command interface activation.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
[ It's also not safe to bail out on multicast table operations, since
they may be invoked on the cleanup path too. So use down() for
mcg_table.sem too. ]
Signed-off-by: Roland Dreier <rolandd@cisco.com>
There are some cards around that have UAR (user access region) size
different from 8 MB. Relax our sanity check to make sure that the PCI
BAR is big enough to access the UAR size reported by the device
firmware instead.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
mthca_create_ah() includes the port number in the GID index. The reverse
needs to be done in mthca_read_ah().
Noted by Hal Rosenstock.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
build_mlx_header() was using sqp->ud_header.grh_present before it was
initialized by mthca_read_ah(). Furthermore, header->grh_present is
set by ib_ud_header_init, so there's no need to set it again in
mthca_read_ah().
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Use the ALIGN macro to simplify some rounding code.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix memory leaks in mthca_create_qp() and mthca_create_srq()
error handling.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Convert "/ (1 << lg)" to ">> lg" for a slight code size reduction.
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-24 (-24)
function old new delta
mthca_map_cmd 613 589 -24
Signed-off-by: Ishai Rabinovitz <ishai@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a node_guid field to struct ib_device. It is the responsibility
of the low-level driver to initialize this field before registering a
device with the midlayer. Convert everyone to looking at this field
instead of calling ib_query_device() when all they want is the node
GUID, and remove the node_guid field from struct ib_device_attr.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Factor out common code for initializing MAD packets, which is shared
by many query routines in mthca_provider.c, into init_query_mad().
add/remove: 1/0 grow/shrink: 0/4 up/down: 16/-44 (-28)
function old new delta
init_query_mad - 16 +16
mthca_query_port 521 518 -3
mthca_query_pkey 301 294 -7
mthca_query_device 648 641 -7
mthca_query_gid 453 426 -27
Signed-off-by: Roland Dreier <rolandd@cisco.com>
I am seeing EQ overruns in SDP stress tests: if the CQ completion
handler arms a CQ, this could generate more EQEs, so that EQ will
never get empty and consumer index will never get updated.
This is similiar to what we have with command interface:
/*
* cmd_event() may add more commands.
* The card will think the queue has overflowed if
* we don't tell it we've been processing events.
*/
However, for completion events, we *don't* want to update the consumer
index on each event. So, perform EQ doorbell coalescing: allocate EQs
with some spare EQEs, and update once we run out of them.
The value 0x80 was selected to avoid any performance impact.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
For all pages except possibly the last one, the byte beyond the buffer
end must be page aligned. Therefore, when computing the page shift to
use, OR the end addresses of the buffers as well as the start
addresses into the mask we check.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>