After this patch none of the netlink callback support anything
except the initial network namespace but the rtnetlink infrastructure
now handles multiple network namespaces.
Changes from v2:
- IPv6 addrlabel processing
Changes from v1:
- no need for special rtnl_unlock handling
- fixed IPv6 ndisc
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before I can enable rtnetlink to work in all network namespaces I need
to be certain that something won't break. So this patch deliberately
disables all of the rtnletlink methods in everything except the
initial network namespace. After the methods have been audited this
extra check can be disabled.
Changes from v1:
- added IPv6 addrlabel protection
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Qdisc_class_ops are const, and Qdisc_ops are mostly read.
Using "const" and "__read_mostly" qualifiers helps to reduce false
sharing.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The IPv4 and IPv6 hook values are identical, yet some code tries to figure
out the "correct" value by looking at the address family. Introduce NF_INET_*
values for both IPv4 and IPv6. The old values are kept in a #ifndef __KERNEL__
section for userspace compatibility.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Many-many code in the kernel initialized the timer->function
and timer->data together with calling init_timer(timer). There
is already a helper for this. Use it for networking code.
The patch is HUGE, but makes the code 130 lines shorter
(98 insertions(+), 228 deletions(-)).
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The only qdiscs that check subqueue state before dequeue'ing are PRIO
and RR. The other qdiscs, including the default pfifo_fast qdisc,
will allow traffic bound for subqueue 0 through to hard_start_xmit.
The check for netif_queue_stopped() is done above in pkt_sched.h, so
it is unnecessary for qdisc_restart(). However, if the underlying
driver is multiqueue capable, and only sets queue states on subqueues,
this will allow packets to enter the driver when it's currently unable
to process packets, resulting in expensive requeues and driver
entries. This patch re-adds the check for the subqueue status before
calling hard_start_xmit, so we can try and avoid the driver entry when
the queues are stopped.
Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Computing the rank of the first set bit in the hash mask (for using later
in u32_hash_fold()) was done with plain C code. Using ffs() instead makes
the code more readable and improves performance (since ffs() is better
optimized in assembler).
Using the conditional operator on hash mask before applying ntohl() also
saves one ntohl() call if mask is 0.
Signed-off-by: Radu Rendec <radu.rendec@ines.ro>
Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
Acked-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
While trying to implement u32 hashes in my shaping machine I ran into
a possible bug in the u32 hash/bucket computing algorithm
(net/sched/cls_u32.c).
The problem occurs only with hash masks that extend over the octet
boundary, on little endian machines (where htonl() actually does
something).
Let's say that I would like to use 0x3fc0 as the hash mask. This means
8 contiguous "1" bits starting at b6. With such a mask, the expected
(and logical) behavior is to hash any address in, for instance,
192.168.0.0/26 in bucket 0, then any address in 192.168.0.64/26 in
bucket 1, then 192.168.0.128/26 in bucket 2 and so on.
This is exactly what would happen on a big endian machine, but on
little endian machines, what would actually happen with current
implementation is 0x3fc0 being reversed (into 0xc03f0000) by htonl()
in the userspace tool and then applied to 192.168.x.x in the u32
classifier. When shifting right by 16 bits (rank of first "1" bit in
the reversed mask) and applying the divisor mask (0xff for divisor
256), what would actually remain is 0x3f applied on the "168" octet of
the address.
One could say is this can be easily worked around by taking endianness
into account in userspace and supplying an appropriate mask (0xfc03)
that would be turned into contiguous "1" bits when reversed
(0x03fc0000). But the actual problem is the network address (inside
the packet) not being converted to host order, but used as a
host-order value when computing the bucket.
Let's say the network address is written as n31 n30 ... n0, with n0
being the least significant bit. When used directly (without any
conversion) on a little endian machine, it becomes n7 ... n0 n8 ..n15
etc in the machine's registers. Thus bits n7 and n8 would no longer be
adjacent and 192.168.64.0/26 and 192.168.128.0/26 would no longer be
consecutive.
The fix is to apply ntohl() on the hmask before computing fshift,
and in u32_hash_fold() convert the packet data to host order before
shifting down by fshift.
With helpful feedback from Jamal Hadi Salim and Jarek Poplawski.
Signed-off-by: David S. Miller <davem@davemloft.net>
tecl_reset() is called from deactivate and qdisc is set to noop already,
but subsequent teql_xmit does not know about it and dereference private
data as teql qdisc and thus oopses.
not catch it first :)
Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
clean skb_clone of any signs of CONFIG_NET_CLS_ACT and
have mirred us skb_act_clone()
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix one more user of netiff_subqueue_stopped. To check for the
queue id one must use the __netiff_subqueue_stoped call.
This run out of my sight when I made the:
668f895a85
[NET]: Hide the queue_mapping field inside netif_subqueue_stopped
commit :(
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Many places get the queue_mapping field from skb to pass it to the
netif_subqueue_stopped() which will be 0 in any case.
Make the helper that works with sk_buff
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make the helper for getting the field, symmetrical to
the "set" one. Return 0 if CONFIG_NETDEVICES_MULTIQUEUE=n
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix the various misspellings of "system", controller", "interrupt" and
"[un]necessary".
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
The function dev_deactivate is supposed to only return when
all outstanding transmissions have completed. Unfortunately
it is possible for store operations in the driver's transmit
function to only become visible after dev_deactivate returns.
This patch fixes this by taking the queue lock after we see
the end of the queue run. This ensures that all effects of
any previous transmit calls are visible.
If however we detect that there is another queue run occuring,
then we'll warn about it because this should never happen as
we have pointed dev->qdisc to noop_qdisc within the same queue
lock earlier in the functino.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert "QoS and/or fair queueing" to menuconfig.
This makes it easy for someone to disable all sub-options with
one config symbol.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While looking at a net driver with the following construct,
if (!netif_carrier_ok(dev))
netif_carrier_on(dev);
it stuck me that the netif_carrier_ok() check was redundant, since
netif_carrier_on() checks bit __LINK_STATE_NOCARRIER anyway. This is
the same reason why netif_queue_stopped() need not be called prior to
netif_wake_queue().
This is true, but there is however an unwanted side effect from assuming
that netif_carrier_on() can be called multiple times: it touches the
watchdog, regardless of pre-existing carrier state.
The fix: move watchdog-up inside the bit-cleared code path.
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With all the users of the double pointers removed, this patch mops up by
finally replacing all occurances of sk_buff ** in the netfilter API by
sk_buff *.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The fourth parameter of /proc/net/psched is supposed to show the timer
resultion and is used by HTB userspace to calculate the necessary
burst rate. Currently we show the clock resolution, which results in a
too low burst rate when the two differ.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a bunch of sparse warnings. Mostly about 0 used as
NULL pointer, and shadowed variable declarations.
One notable case was that hash size should have been unsigned.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stateless NAT is useful in controlled environments where restrictions are
placed on through traffic such that we don't need connection tracking to
correctly NAT protocol-specific data.
In particular, this is of interest when the number of flows or the number
of addresses being NATed is large, or if connection tracking information
has to be replicated and where it is not practical to do so.
Previously we had stateless NAT functionality which was integrated into
the IPv4 routing subsystem. This was a great solution as long as the NAT
worked on a subnet to subnet basis such that the number of NAT rules was
relatively small. The reason is that for SNAT the routing based system
had to perform a linear scan through the rules.
If the number of rules is large then major renovations would have take
place in the routing subsystem to make this practical.
For the time being, the least intrusive way of achieving this is to use
the u32 classifier written by Alexey Kuznetsov along with the actions
infrastructure implemented by Jamal Hadi Salim.
The following patch is an attempt at this problem by creating a new nat
action that can be invoked from u32 hash tables which would allow large
number of stateless NAT rules that can be used/updated in constant time.
The actual NAT code is mostly based on the previous stateless NAT code
written by Alexey. In future we might be able to utilise the protocol
NAT code from netfilter to improve support for other protocols.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since hardware header operations are part of the protocol class
not the device instance, make them into a separate object and
save memory.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add inline for common usage of hardware header creation, and
fix bug in IPV6 mcast where the assumption about negative return is
an errno. Negative return from hard_header means not enough space
was available,(ie -N bytes).
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
For N cpus, with full throttle traffic on all N CPUs, funneling traffic
to the same ethernet device, the devices queue lock is contended by all
N CPUs constantly. The TX lock is only contended by a max of 2 CPUS.
In the current mode of operation, after all the work of entering the
dequeue region, we may endup aborting the path if we are unable to get
the tx lock and go back to contend for the queue lock. As N goes up,
this gets worse.
The changes in this patch result in a small increase in performance
with a 4CPU (2xdual-core) with no irq binding. Both e1000 and tg3
showed similar behavior;
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's been a useless no-op for long enough in 2.6 so I figured it's time to
remove it. The number of people that could object because they're
maintaining unified 2.4 and 2.6 drivers is probably rather small.
[ Handled drivers added by netdev tree and some missed IRDA cases... -DaveM ]
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change L2T (length to time) macros, in all rate based schedulers, to
call a common function qdisc_l2t() that does the rate table lookup.
This function handles if the packet size lookup is larger than the
rate table, which often occurs with TSO enabled.
Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes most of the generic device layer network
namespace safe. This patch makes dev_base_head a
network namespace variable, and then it picks up
a few associated variables. The functions:
dev_getbyhwaddr
dev_getfirsthwbytype
dev_get_by_flags
dev_get_by_name
__dev_get_by_name
dev_get_by_index
__dev_get_by_index
dev_ioctl
dev_ethtool
dev_load
wireless_process_ioctl
were modified to take a network namespace argument, and
deal with it.
vlan_ioctl_set and brioctl_set were modified so their
hooks will receive a network namespace argument.
So basically anthing in the core of the network stack that was
affected to by the change of dev_base was modified to handle
multiple network namespaces. The rest of the network stack was
simply modified to explicitly use &init_net the initial network
namespace. This can be fixed when those components of the network
stack are modified to handle multiple network namespaces.
For now the ifindex generator is left global.
Fundametally ifindex numbers are per namespace, or else
we will have corner case problems with migration when
we get that far.
At the same time there are assumptions in the network stack
that the ifindex of a network device won't change. Making
the ifindex number global seems a good compromise until
the network stack can cope with ifindex changes when
you change namespaces, and the like.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes /proc/net per network namespace. It modifies the global
variables proc_net and proc_net_stat to be per network namespace.
The proc_net file helpers are modified to take a network namespace argument,
and all of their callers are fixed to pass &init_net for that argument.
This ensures that all of the /proc/net files are only visible and
usable in the initial network namespace until the code behind them
has been updated to be handle multiple network namespaces.
Making /proc/net per namespace is necessary as at least some files
in /proc/net depend upon the set of network devices which is per
network namespace, and even more files in /proc/net have contents
that are relevant to a single network namespace.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Several devices have multiple independant RX queues per net
device, and some have a single interrupt doorbell for several
queues.
In either case, it's easier to support layouts like that if the
structure representing the poll is independant from the net
device itself.
The signature of the ->poll() call back goes from:
int foo_poll(struct net_device *dev, int *budget)
to
int foo_poll(struct napi_struct *napi, int budget)
The caller is returned the number of RX packets processed (or
the number of "NAPI credits" consumed if you want to get
abstract). The callee no longer messes around bumping
dev->quota, *budget, etc. because that is all handled in the
caller upon return.
The napi_struct is to be embedded in the device driver private data
structures.
Furthermore, it is the driver's responsibility to disable all NAPI
instances in it's ->stop() device close handler. Since the
napi_struct is privatized into the driver's private data structures,
only the driver knows how to get at all of the napi_struct instances
it may have per-device.
With lots of help and suggestions from Rusty Russell, Roland Dreier,
Michael Chan, Jeff Garzik, and Jamal Hadi Salim.
Bug fixes from Thomas Graf, Roland Dreier, Peter Zijlstra,
Joseph Fannin, Scott Wood, Hans J. Koch, and Michael Chan.
[ Ported to current tree and all drivers converted. Integrated
Stephen's follow-on kerneldoc additions, and restored poll_list
handling to the old style to fix mutual exclusion issues. -DaveM ]
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is followup to Patrick's patch. A little optimization to enqueue
routine allows to remove artificial limitation on queue length.
Plus, testing showed that hash function used by SFQ is too bad or even worse.
It does not even sweep the whole range of hash values.
Switched to Jenkins' hash.
Signed-off-by: Alexey Kuznetsov <kaber@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/sched/sch_cbq.c: In function 'cbq_enqueue':
net/sched/sch_cbq.c:383: warning: 'ret' may be used uninitialized in this function
has been verified to be a bogus case. So let's shut it up.
Signed-off-by: Satyam Sharma <satyam@infradead.org>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
(with no apologies to C Heston)
On Mon, 2007-10-09 at 21:00 +0800, Herbert Xu wrote:
On Sun, Sep 02, 2007 at 01:11:29PM +0000, Christian Kujau wrote:
> >
> > after upgrading to 2.6.23-rc5 (and applying davem's fix [0]), lockdep
> > was quite noisy when I tried to shape my external (wireless) interface:
> >
> > [ 6400.534545] FahCore_78.exe/3552 just changed the state of lock:
> > [ 6400.534713] (&dev->ingress_lock){-+..}, at: [<c038d595>]
> > netif_receive_skb+0x2d5/0x3c0
> > [ 6400.534941] but this lock took another, soft-read-irq-unsafe lock in the
> > past:
> > [ 6400.535145] (police_lock){-.--}
>
> This is a genuine dead-lock. The police lock can be taken
> for reading with softirqs on. If a second CPU tries to take
> the police lock for writing, while holding the ingress lock,
> then a softirq on the first CPU can dead-lock when it tries
> to get the ingress lock.
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
When CONFIG_NET_CLS_ACT is enabled, tc_classify() is called twice in
prio_classify(). This causes "interesting" behaviour: with the setup
below, packets are duplicated, sent twice to ifb0, and then loop in and
out of ifb0.
The patch uses the previously calculated return value in the switch,
which is probably what Patrick had in mind in commit
bdba91ec70 -- maybe Patrick can
double-check this?
-- example setup --
ifconfig ifb0 up
tc qdisc add dev ifb0 root netem delay 2s
tc qdisc add dev $ETH root handle 1: prio
tc filter add dev $ETH parent 1: protocol ip prio 10 u32 \
match ip dst 172.24.110.6/32 flowid 1:1 \
action mirred egress redirect dev ifb0
ping -c1 172.24.110.6
Signed-off-by: Lucas Nussbaum <lucas.nussbaum@imag.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch cleans up duplicate includes in
net/sched/
Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix the check in prio_tune() to see if sch->parent is TC_H_ROOT instead of
sch->handle to load or reject the qdisc for multiqueue devices.
Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix sch_api to correctly set sch->parent for both ingress and egress
qdiscs in qdisc_create().
Signed-off-by: Patrick McHardy <trash@kaber.net>
Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix handling of empty or completely non-matching filter chains. In
that case -1 is returned and tcf_result is uninitialized, the
qdisc should fall back to default classification in that case.
Noticed by PJ Waskiewicz <peter.p.waskiewicz.jr@intel.com>.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: vignesh babu <vignesh.babu@wipro.com>
Signed-off-by: chas williams <chas@cmf.nrl.navy.mil>
Signed-off-by: David S. Miller <davem@davemloft.net>
The NET_CLS_ACT option is now a full replacement for NET_CLS_POLICE,
remove the old code. The config option will be kept around to select
the equivalent NET_CLS_ACT options for a short time to allow easier
upgrades.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The behaviour of NET_CLS_POLICE for TC_POLICE_RECLASSIFY was to return
it to the qdisc, which could handle it internally or ignore it. With
NET_CLS_ACT however, tc_classify starts over at the first classifier
and never returns it to the qdisc. This makes it impossible to support
qdisc-internal reclassification, which in turn makes it impossible to
remove the old NET_CLS_POLICE code without breaking compatibility since
we have two qdiscs (CBQ and ATM) that support this.
This patch adds a tc_classify_compat function that handles
reclassification the old way and changes CBQ and ATM to use it.
This again is of course not fully backwards compatible with the previous
NET_CLS_ACT behaviour. Unfortunately there is no way to fully maintain
compatibility *and* support qdisc internal reclassification with
NET_CLS_ACT, but this seems like the better choice over keeping the two
incompatible options around forever.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Handle act_api classification results.
The ATM scheduler behaves slightly different than other schedulers
in that it only handles policer results for successful classifications,
this behaviour is retained for the act_api case.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
As noticed by Ranko Zivojnovic <ranko@spidernet.net>, calling qdisc_run
from the timer handler can result in deadlock:
> CPU#0
>
> qdisc_watchdog() fires and gets dev->queue_lock
> qdisc_run()...qdisc_restart()...
> -> releases dev->queue_lock and enters dev_hard_start_xmit()
>
> CPU#1
>
> tc del qdisc dev ...
> qdisc_graft()...dev_graft_qdisc()...dev_deactivate()...
> -> grabs dev->queue_lock ...
>
> qdisc_reset()...{cbq,hfsc,htb,netem,tbf}_reset()...qdisc_watchdog_cancel()...
> -> hrtimer_cancel() - waiting for the qdisc_watchdog() to exit, while still
> holding dev->queue_lock
>
> CPU#0
>
> dev_hard_start_xmit() returns ...
> -> wants to get dev->queue_lock(!)
>
> DEADLOCK!
The entire optimization is a bit questionable IMO, it moves potentially
large parts of NET_TX_SOFTIRQ work to TIMER_SOFTIRQ/HRTIMER_SOFTIRQ,
which kind of defeats the separation of them.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Ranko Zivojnovic <ranko@spidernet.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the HTB scheduler does not correctly account for TSO packets
which causes large inaccuracies in the bandwidth control when using TSO.
This patch allows the HTB scheduler to work with TSO enabled devices.
Signed-off-by: Ranjit Manomohan <ranjitm@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the generic estimator instead of reimplementing (parts of) it.
For compatibility always create a default estimator for new classes.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove stats_lock pointers from qdisc-internal structures, in all cases
it points to dev->queue_lock. The only case where it is necessary is for
top-level qdiscs, where it might also point to dev->ingress_lock in case
of the ingress qdisc. Also remove it from actions completely, it always
points to the actions internal lock.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The generic estimator is always built in anways and all the config options
does is prevent including a minimal amount of code for setting it up.
Additionally the option is already automatically selected for most cases.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the new sch_rr qdisc for multiqueue network device support. Allow
sch_prio and sch_rr to be compiled with or without multiqueue hardware
support.
sch_rr is part of sch_prio, and is referenced from MODULE_ALIAS. This
was done since sch_prio and sch_rr only differ in their dequeue
routine.
Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the multiqueue hardware device support API to the core network
stack. Allow drivers to allocate multiple queues and manage them at
the netdev level if they choose to do so.
Added a new field to sk_buff, namely queue_mapping, for drivers to
know which tx_ring to select based on OS classification of the flow.
Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Changes :
- netif_queue_stopped need not be called inside qdisc_restart as
it has been called already in qdisc_run() before the first skb
is sent, and in __qdisc_run() after each intermediate skb is
sent (note : we are the only sender, so the queue cannot get
stopped while the tx lock was got in the ~LLTX case).
- BUG_ON((int) q->q.qlen < 0) was a relic from old times when -1
meant more packets are available, and __qdisc_run used to loop
when qdisc_restart() returned -1. During those days, it was
necessary to make sure that qlen is never less than zero, since
__qdisc_run would get into an infinite loop if no packets are on
the queue and this bug in qdisc was there (and worse - no more
skbs could ever get queue'd as we hold the queue lock too). With
Herbert's recent change to return values, this check is not
required. Hopefully Herbert can validate this change. If at all
this is required, it should be added to skb_dequeue (in failure
case), and not to qdisc_qlen.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
New changes :
- Incorporated Peter Waskiewicz's comments.
- Re-added back one warning message (on driver returning wrong value).
Previous changes :
- Converted to use switch/case code which looks neater.
- "if (ret == NETDEV_TX_LOCKED && lockless)" is buggy, and the lockless
check should be removed, since driver will return NETDEV_TX_LOCKED only
if lockless is true and driver has to do the locking. In the original
code as well as the latest code, this code can result in a bug where
if LLTX is not set for a driver (lockless == 0) but the driver is written
wrongly to do a trylock (despite LLTX being set), the driver returns
LOCKED. But since lockless is zero, the packet is requeue'd instead of
calling collision code which will issue warning and free up the skb.
Instead this skb will be retried with this driver next time, and the same
result will ensue. Removing this check will catch these driver bugs instead
of hiding the problem. I am keeping this change to readability section
since :
a. it is confusing to check two things as it is; and
b. it is difficult to keep this check in the changed 'switch' code.
- Changed some names, like try_get_tx_pkt to dev_dequeue_skb (as that is
the work being done and easier to understand) and do_dev_requeue to
dev_requeue_skb, merged handle_dev_cpu_collision and tx_islocked to
dev_handle_collision (handle_dev_cpu_collision is a small routine with only
one caller, so there is no need to have two separate routines which also
results in getting rid of two macros, etc.
- Removed an XXX comment as it should never fail (I suspect this was related
to batch skb WIP, Jamal ?). Converted some functions to original coding
style of having the return values and the function name on same line, eg
prio2list.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Over the years this code has gotten hairier. Resulting in many long
discussions over long summer days and patches that get it wrong.
This patch helps tame that code so normal people will understand it.
Thanks to Thomas Graf, Peter J. waskiewicz Jr, and Patrick McHardy
for their valuable reviews.
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
cbq and atm destroy their filters twice when destroying inner classes
during qdisc destruction.
Reported-and-tested-by: Strobl Anton <a.strobl@aws-it.at>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Recent gcc versions emit warnings when unsigned variables are
compared < 0 or >= 0.
Signed-off-by: Bill Nottingham <notting@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
round_jiffies for net dev watchdog timer.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The event cache time must be an absolute value, when no event exists
it is incorrectly set to 1s instead of 1s in the future.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
My previous patch that changed the return value of qdisc_restart
incorrectly made the case where dequeue returns empty continue
processing packets.
This patch is based on diagnosis and fix by Patrick McHardy.
Reported-and-debugged-by: Anant Nitya <kernel@prachanda.info>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes an out-of-boundary condition when the classified
band equals q->bands. Caught by Alexey
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we relinquish queue_lock in qdisc_restart and then retake it for
requeueing, we might race against dev_deactivate and end up requeueing
onto noop_qdisc. This causes a warning to be printed.
This patch fixes this by checking this before we requeue. As an added
bonus, we can remove the same check in __qdisc_run which was added to
prevent dev->gso_skb from being requeued when we're shutting down.
Even though we've had to add a new conditional in its place, it's better
because it only happens on requeues rather than every single time that
qdisc_run is called.
For this to work we also need to move the clearing of gso_skb up in
dev_deactivate as now qdisc_restart can occur even after we wait for
__LINK_STATE_QDISC_RUNNING to clear (but it won't do anything as long
as the queue and gso_skb is already clear).
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that we return the queue length after NETDEV_TX_OK we better
make sure that we have the right queue. Otherwise we can cause a
stall after a really quick dev_deactive/dev_activate.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current return value scheme and associated comment was invented
back in the 20th century when we still had that tbusy flag. Things
have changed quite a bit since then (even Tony Blair is moving on
now, not to mention the new French president).
All we need to indicate now is whether the caller should continue
processing the queue. Therefore it's sufficient if we return 0 if
we want to stop and non-zero otherwise.
This is based on a patch by Krishna Kumar.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
When transmit fails with NETDEV_TX_LOCKED the skb is requeued
to dev->qdisc again. The dev->qdisc pointer is protected by
the queue lock which needs to be dropped when attempting to
transmit and acquired again before requeing. The problem is
that qdisc_restart() fetches the dev->qdisc pointer once and
stores it in the `q' variable which is invalidated when
dropping the queue_lock, therefore the variable needs to be
refreshed before requeueing.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Optimize teql_enqueue so that it first checks limits before enqueing.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cleanup of dev_base list use, with the aim to simplify making device
list per-namespace. In almost every occasion, use of dev_base variable
and dev->next pointer could be easily replaced by for_each_netdev
loop. A few most complicated places were converted to using
first_netdev()/next_netdev().
Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Acked-by: Kirill Korotaev <dev@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Spring cleaning time...
There seems to be a lot of places in the network code that have
extra bogus semicolons after conditionals. Most commonly is a
bogus semicolon after: switch() { }
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Switch ingress queueing back to use ingress_lock. qdisc_lock_tree now locks
both the ingress and egress qdiscs on the device. All changes to data that
might be used on both ingress and egress needs to be protected by using
qdisc_lock_tree instead of manually taking dev->queue_lock. Additionally
the qdisc stats_lock needs to be initialized to ingress_lock for ingress
qdiscs.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since we're now holding the rtnl during the entire dump operation, we
can remove qdisc_tree_lock, whose only purpose is to protect dump
callbacks from concurrent changes to the qdisc tree.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
We're holding dev->queue_lock in qdisc_watchdog_schedule and
qdisc_watchdog_cancel, no need for the barriers.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Uninline tcf_destroy and add a helper function to destroy an entire filter
chain.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
netem checks PSCHED_TLESS(cb->time_to_send, now) to find out whether it is
allowed to send a packet, which is equivalent to cb->time_to_send < now.
Use !PSCHED_TLESS(now, cb->time_to_send) instead to properly handle
cb->time_to_send == now.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Get rid of some of my creative spelling.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
If possible, avoid having to do a transmit softirq when a qdisc
watchdog decides to re-enable. The watchdog routine runs off
a timer, so it is already in the same effective context as
the softirq.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The netem code would call getnstimeofday() and dequeue/requeue after
every packet, even if it was waiting. Avoid this overhead by using
the throttled flag.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In most cases, the next packet will be sent after the
last one. So optimize that case.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The random number generator always generates 32 bit values.
The time values are limited by psched_tdiff_t
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
If you setup netem to just delay packets; "tc qdisc ls" will report
the reordering as 100%. Well it's a lie, reorder isn't used unless
gap is set, so just set value to 0 so the output of utility
is correct.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
So that it is also an offset from skb->head, reduces its size from 8 to 4 bytes
on 64bit architectures, allowing us to combine the 4 bytes hole left by the
layer headers conversion, reducing struct sk_buff size to 256 bytes, i.e. 4
64byte cachelines, and since the sk_buff slab cache is SLAB_HWCACHE_ALIGN...
:-)
Many calculations that previously required that skb->{transport,network,
mac}_header be first converted to a pointer now can be done directly, being
meaningful as offsets or pointers.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/sched/sch_api.c: In function 'psched_show':
net/sched/sch_api.c:1219: warning: format '%08x' expects type 'unsigned int', but argument 6 has type 's64'
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
q->now is increased during dequeue and doesn't contain the current time
afterwards, resulting in a too large timeout value for the qdisc watchdog.
Use "now" instead, which still contains the current time.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The timer resolution exported in /proc/net/psched is used by userspace to
calculate HTB's burst values. Currently it is set to HZ, since we're now
using hrtimers, use KTIME_MONOTONIC_RES, which makes HTB use smaller burst
values.
This patch also affects libnl, which incorrectly uses this value for
the SFQ perturbation parameter, which is always in seconds, and some
routing cache values, which are in USER_HZ, so both cases are broken
anyway.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that all packet schedulers have been converted to hrtimers most users
of PSCHED_JIFFIE2US and PSCHED_US2JIFFIE are gone. The remaining users use
it to convert external time units to packet scheduler clock ticks, so use
PSCHED_TICKS_PER_SEC instead.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>