Commit Graph

258 Commits

Author SHA1 Message Date
Ranjit Manomohan
c9726d6890 [NET_SCHED]: Make HTB scheduler work with TSO.
Currently the HTB scheduler does not correctly account for TSO packets
which causes large inaccuracies in the bandwidth control when using TSO.
This patch allows the HTB scheduler to work with TSO enabled devices.

Signed-off-by: Ranjit Manomohan <ranjitm@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-10 22:43:16 -07:00
Patrick McHardy
0ba4805383 [NET_SCHED]: Remove unnecessary includes
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-10 22:16:41 -07:00
Patrick McHardy
ee39e10c27 [NET_SCHED]: sch_htb: use generic estimator
Use the generic estimator instead of reimplementing (parts of) it.
For compatibility always create a default estimator for new classes.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-10 22:16:39 -07:00
Patrick McHardy
4bdf39911e [NET_SCHED]: Remove unnecessary stats_lock pointers
Remove stats_lock pointers from qdisc-internal structures, in all cases
it points to dev->queue_lock. The only case where it is necessary is for
top-level qdiscs, where it might also point to dev->ingress_lock in case
of the ingress qdisc. Also remove it from actions completely, it always
points to the actions internal lock.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-10 22:16:38 -07:00
Patrick McHardy
876d48aabf [NET_SCHED]: Remove CONFIG_NET_ESTIMATOR option
The generic estimator is always built in anways and all the config options
does is prevent including a minimal amount of code for setting it up.
Additionally the option is already automatically selected for most cases.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-10 22:16:37 -07:00
Peter P Waskiewicz Jr
d62733c8e4 [SCHED]: Qdisc changes and sch_rr added for multiqueue
Add the new sch_rr qdisc for multiqueue network device support.  Allow
sch_prio and sch_rr to be compiled with or without multiqueue hardware
support.

sch_rr is part of sch_prio, and is referenced from MODULE_ALIAS.  This
was done since sch_prio and sch_rr only differ in their dequeue
routine.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-10 22:16:22 -07:00
Peter P Waskiewicz Jr
f25f4e4480 [CORE] Stack changes to add multiqueue hardware support API
Add the multiqueue hardware device support API to the core network
stack.  Allow drivers to allocate multiple queues and manage them at
the netdev level if they choose to do so.

Added a new field to sk_buff, namely queue_mapping, for drivers to
know which tx_ring to select based on OS classification of the flow.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-10 22:16:21 -07:00
Krishna Kumar
e50c41b53d [NET]: qdisc_restart - couple of optimizations.
Changes :

- netif_queue_stopped need not be called inside qdisc_restart as
  it has been called already in qdisc_run() before the first skb
  is sent, and in __qdisc_run() after each intermediate skb is
  sent (note : we are the only sender, so the queue cannot get
  stopped while the tx lock was got in the ~LLTX case).

- BUG_ON((int) q->q.qlen < 0) was a relic from old times when -1
  meant more packets are available, and __qdisc_run used to loop
  when qdisc_restart() returned -1. During those days, it was
  necessary to make sure that qlen is never less than zero, since
  __qdisc_run would get into an infinite loop if no packets are on
  the queue and this bug in qdisc was there (and worse - no more
  skbs could ever get queue'd as we hold the queue lock too). With
  Herbert's recent change to return values, this check is not
  required.  Hopefully Herbert can validate this change. If at all
  this is required, it should be added to skb_dequeue (in failure
  case), and not to qdisc_qlen.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-10 22:15:36 -07:00
Krishna Kumar
6c1361a6f2 [NET]: qdisc_restart - readability changes plus one bug fix.
New changes :

- Incorporated Peter Waskiewicz's comments.
- Re-added back one warning message (on driver returning wrong value).

Previous changes :

- Converted to use switch/case code which looks neater.

- "if (ret == NETDEV_TX_LOCKED && lockless)" is buggy, and the lockless
  check should be removed, since driver will return NETDEV_TX_LOCKED only
  if lockless is true and driver has to do the locking. In the original
  code as well as the latest code, this code can result in a bug where
  if LLTX is not set for a driver (lockless == 0) but the driver is written
  wrongly to do a trylock (despite LLTX being set), the driver returns
  LOCKED. But since lockless is zero, the packet is requeue'd instead of
  calling collision code which will issue warning and free up the skb.
  Instead this skb will be retried with this driver next time, and the same
  result will ensue. Removing this check will catch these driver bugs instead
  of hiding the problem. I am keeping this change to readability section
  since :
  	a. it is confusing to check two things as it is; and
  	b. it is difficult to keep this check in the changed 'switch' code.

- Changed some names, like try_get_tx_pkt to dev_dequeue_skb (as that is
  the work being done and easier to understand) and do_dev_requeue to
  dev_requeue_skb, merged handle_dev_cpu_collision and tx_islocked to
  dev_handle_collision (handle_dev_cpu_collision is a small routine with only
  one caller, so there is no need to have two separate routines which also
  results in getting rid of two macros, etc.

- Removed an XXX comment as it should never fail (I suspect this was related
  to batch skb WIP, Jamal ?). Converted some functions to original coding
  style of having the return values and the function name on same line, eg
  prio2list.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-10 22:15:35 -07:00
Jamal Hadi Salim
c716a81ab9 [NET_SCHED]: Cleanup readability of qdisc restart
Over the years this code has gotten hairier. Resulting in many long
discussions over long summer days and patches that get it wrong.
This patch helps tame that code so normal people will understand it.

Thanks to Thomas Graf, Peter J. waskiewicz Jr, and Patrick McHardy
for their valuable reviews.

Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-10 22:06:16 -07:00
Patrick McHardy
b00b4bf94e [NET_SCHED]: Fix filter double free
cbq and atm destroy their filters twice when destroying inner classes
during qdisc destruction.

Reported-and-tested-by: Strobl Anton <a.strobl@aws-it.at>

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-06-07 13:41:05 -07:00
Bill Nottingham
75202e7689 [NET]: Fix comparisons of unsigned < 0.
Recent gcc versions emit warnings when unsigned variables are
compared < 0 or >= 0.

Signed-off-by: Bill Nottingham <notting@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-06-03 18:08:47 -07:00
Venkatesh Pallipadi
60468d5b5b [NET]: Make net watchdog timers 1 sec jiffy aligned.
round_jiffies for net dev watchdog timer.

Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-06-03 18:08:46 -07:00
Patrick McHardy
2e4b3b0e87 [NET_SCHED]: sch_htb: fix event cache time calculation
The event cache time must be an absolute value, when no event exists
it is incorrectly set to 1s instead of 1s in the future.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-05-24 16:36:56 -07:00
Herbert Xu
36247f5421 [NET_SCHED]: Fix qdisc_restart return value when dequeue is empty
My previous patch that changed the return value of qdisc_restart
incorrectly made the case where dequeue returns empty continue
processing packets.

This patch is based on diagnosis and fix by Patrick McHardy.

Reported-and-debugged-by: Anant Nitya <kernel@prachanda.info>

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-05-24 16:36:43 -07:00
Jamal Hadi Salim
3e5c2d3bdb [NET_SCHED]: prio qdisc boundary condition
This fixes an out-of-boundary condition when the classified
band equals q->bands. Caught by Alexey

Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-05-14 02:57:19 -07:00
Herbert Xu
41a23b0788 [NET_SCHED]: Avoid requeue warning on dev_deactivate
When we relinquish queue_lock in qdisc_restart and then retake it for
requeueing, we might race against dev_deactivate and end up requeueing
onto noop_qdisc.  This causes a warning to be printed.

This patch fixes this by checking this before we requeue.  As an added
bonus, we can remove the same check in __qdisc_run which was added to
prevent dev->gso_skb from being requeued when we're shutting down.

Even though we've had to add a new conditional in its place, it's better
because it only happens on requeues rather than every single time that
qdisc_run is called.

For this to work we also need to move the clearing of gso_skb up in
dev_deactivate as now qdisc_restart can occur even after we wait for
__LINK_STATE_QDISC_RUNNING to clear (but it won't do anything as long
as the queue and gso_skb is already clear).

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-05-10 23:47:42 -07:00
Herbert Xu
cce1fa36a8 [NET_SCHED]: Reread dev->qdisc for NETDEV_TX_OK
Now that we return the queue length after NETDEV_TX_OK we better
make sure that we have the right queue.  Otherwise we can cause a
stall after a really quick dev_deactive/dev_activate.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-05-10 23:47:41 -07:00
Herbert Xu
d90df3ad07 [NET_SCHED]: Rationalise return value of qdisc_restart
The current return value scheme and associated comment was invented
back in the 20th century when we still had that tbusy flag.  Things
have changed quite a bit since then (even Tony Blair is moving on
now, not to mention the new French president).

All we need to indicate now is whether the caller should continue
processing the queue.  Therefore it's sufficient if we return 0 if
we want to stop and non-zero otherwise.

This is based on a patch by Krishna Kumar.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-05-10 23:47:40 -07:00
Thomas Graf
5830725f8a [NET]: Fix dev->qdisc race for NETDEV_TX_LOCKED case
When transmit fails with NETDEV_TX_LOCKED the skb is requeued
to dev->qdisc again. The dev->qdisc pointer is protected by
the queue lock which needs to be dropped when attempting to
transmit and acquired again before requeing. The problem is
that qdisc_restart() fetches the dev->qdisc pointer once and
stores it in the `q' variable which is invalidated when
dropping the queue_lock, therefore the variable needs to be
refreshed before requeueing.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-05-10 23:47:39 -07:00
Krishna Kumar
4cd8c9e87b [NET_SCHED]: teql_enqueue can check limits before skb enqueue
Optimize teql_enqueue so that it first checks limits before enqueing.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-05-10 23:45:10 -07:00
Pavel Emelianov
7562f876cd [NET]: Rework dev_base via list_head (v3)
Cleanup of dev_base list use, with the aim to simplify making device
list per-namespace. In almost every occasion, use of dev_base variable
and dev->next pointer could be easily replaced by for_each_netdev
loop. A few most complicated places were converted to using
first_netdev()/next_netdev().

Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Acked-by: Kirill Korotaev <dev@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-05-03 15:13:45 -07:00
Stephen Hemminger
3ff50b7997 [NET]: cleanup extra semicolons
Spring cleaning time...

There seems to be a lot of places in the network code that have
extra bogus semicolons after conditionals.  Most commonly is a
bogus semicolon after: switch() { }

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:29:24 -07:00
Patrick McHardy
fd44de7cc1 [NET_SCHED]: ingress: switch back to using ingress_lock
Switch ingress queueing back to use ingress_lock. qdisc_lock_tree now locks
both the ingress and egress qdiscs on the device. All changes to data that
might be used on both ingress and egress needs to be protected by using
qdisc_lock_tree instead of manually taking dev->queue_lock. Additionally
the qdisc stats_lock needs to be initialized to ingress_lock for ingress
qdiscs.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:29:08 -07:00
Patrick McHardy
0463d4ae25 [NET_SCHED]: Eliminate qdisc_tree_lock
Since we're now holding the rtnl during the entire dump operation, we
can remove qdisc_tree_lock, whose only purpose is to protect dump
callbacks from concurrent changes to the qdisc tree.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:29:07 -07:00
Patrick McHardy
c95e939508 [NET_SCHED]: qdisc: remove unnecessary memory barriers
We're holding dev->queue_lock in qdisc_watchdog_schedule and
qdisc_watchdog_cancel, no need for the barriers.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:58 -07:00
Patrick McHardy
a48b5a6144 [NET_SCHED]: Unline tcf_destroy
Uninline tcf_destroy and add a helper function to destroy an entire filter
chain.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:56 -07:00
Patrick McHardy
3bebcda280 [NET_SCHED]: turn PSCHED_GET_TIME into inline function
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:55 -07:00
Patrick McHardy
03cc45c0a5 [NET_SCHED]: turn PSCHED_TDIFF_SAFE into inline function
Also rename to psched_tdiff_bounded.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:54 -07:00
Patrick McHardy
8edc0c31d6 [NET_SCHED]: kill PSCHED_TDIFF
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:53 -07:00
Patrick McHardy
a084980dcb [NET_SCHED]: kill PSCHED_SET_PASTPERFECT/PSCHED_IS_PASTPERFECT
Use direct assignment and comparison instead.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:51 -07:00
Patrick McHardy
104e087898 [NET_SCHED]: kill PSCHED_TLESS
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:50 -07:00
Patrick McHardy
7c59e25f31 [NET_SCHED]: kill PSCHED_TADD/PSCHED_TADD2
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:49 -07:00
Patrick McHardy
26e252df1e [NET_SCHED]: kill PSCHED_AUDIT_TDIFF
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:48 -07:00
Patrick McHardy
76d643cd3b [NET_SCHED]: sch_netem: fix off-by-one in send time comparison
netem checks PSCHED_TLESS(cb->time_to_send, now) to find out whether it is
allowed to send a packet, which is equivalent to cb->time_to_send < now.
Use !PSCHED_TLESS(now, cb->time_to_send) instead to properly handle
cb->time_to_send == now.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:47 -07:00
Stephen Hemminger
bb2f8cc0ec [NETEM]: spelling errors
Get rid of some of my creative spelling.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:33 -07:00
Stephen Hemminger
1936502d00 [NET_SCHED] qdisc: avoid transmit softirq on watchdog wakeup
If possible, avoid having to do a transmit softirq when a qdisc
watchdog decides to re-enable.  The watchdog routine runs off
a timer, so it is already in the same effective context as
the softirq.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:23 -07:00
Stephen Hemminger
11274e5a43 [NETEM]: avoid excessive requeues
The netem code would call getnstimeofday() and dequeue/requeue after
every packet, even if it was waiting. Avoid this overhead by using
the throttled flag.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:22 -07:00
Stephen Hemminger
075aa573b7 [NETEM]: Optimize tfifo
In most cases, the next packet will be sent after the
last one. So optimize that case.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:21 -07:00
Stephen Hemminger
b407621c35 [NETEM]: use better types for time values
The random number generator always generates 32 bit values.
The time values are limited by psched_tdiff_t

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:20 -07:00
Stephen Hemminger
a362e0a789 [NETEM]: report reorder percent correctly.
If you setup netem to just delay packets; "tc qdisc ls" will report
the reordering as 100%. Well it's a lie, reorder isn't used unless
gap is set, so just set value to 0 so the output of utility
is correct.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:20 -07:00
Thomas Graf
708914cc5e [PKT_SCHED] act: Use rtnl registration interface
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:11 -07:00
Thomas Graf
82623c0d73 [PKT_SCHED] cls: Use rtnl registration interface
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:10 -07:00
Thomas Graf
be577ddc2b [PKT_SCHED] qdisc: Use rtnl registration interface
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:09 -07:00
Arnaldo Carvalho de Melo
dc5fc579b9 [NETLINK]: Use nlmsg_trim() where appropriate
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:37 -07:00
Arnaldo Carvalho de Melo
27a884dc3c [SK_BUFF]: Convert skb->tail to sk_buff_data_t
So that it is also an offset from skb->head, reduces its size from 8 to 4 bytes
on 64bit architectures, allowing us to combine the 4 bytes hole left by the
layer headers conversion, reducing struct sk_buff size to 256 bytes, i.e. 4
64byte cachelines, and since the sk_buff slab cache is SLAB_HWCACHE_ALIGN...
:-)

Many calculations that previously required that skb->{transport,network,
mac}_header be first converted to a pointer now can be done directly, being
meaningful as offsets or pointers.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:28 -07:00
Patrick McHardy
514bca322c [NET_SCHED]: Fix warning
net/sched/sch_api.c: In function 'psched_show':
net/sched/sch_api.c:1219: warning: format '%08x' expects type 'unsigned int', but argument 6 has type 's64'

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:17 -07:00
Patrick McHardy
bb239acf56 [NET_SCHED]: sch_cbq: fix watchdog scheduled too late
q->now is increased during dequeue and doesn't contain the current time
afterwards, resulting in a too large timeout value for the qdisc watchdog.
Use "now" instead, which still contains the current time.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:16 -07:00
Patrick McHardy
4361cb17f0 [NET_SCHED]: Export real timer resolution in /proc/net/psched
The timer resolution exported in /proc/net/psched is used by userspace to
calculate HTB's burst values. Currently it is set to HZ, since we're now
using hrtimers, use KTIME_MONOTONIC_RES, which makes HTB use smaller burst
values.

This patch also affects libnl, which incorrectly uses this value for
the SFQ perturbation parameter, which is always in seconds, and some
routing cache values, which are in USER_HZ, so both cases are broken
anyway.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:15 -07:00
Patrick McHardy
00c04af9df [NET_SCHED]: kill jiffie conversion macros
Now that all packet schedulers have been converted to hrtimers most users
of PSCHED_JIFFIE2US and PSCHED_US2JIFFIE are gone. The remaining users use
it to convert external time units to packet scheduler clock ticks, so use
PSCHED_TICKS_PER_SEC instead.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:14 -07:00