Currently all qdiscs which allow to create classes uses a fixed sized hash
table with size 16 to hash the classes. This causes a large bottleneck
when using thousands of classes and unbound filters.
Add helpers for dynamically sized class hashes to fix this. The following
patches will convert the qdiscs to use them.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Filters need to be destroyed before beginning to destroy classes
since the destination class needs to still be alive to unbind the
filter.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass double tcf_proto pointers to tcf_destroy_chain() to make it
clear the start of the filter list for more consistency.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit d62733c8e4
([SCHED]: Qdisc changes and sch_rr added for multiqueue)
added a NET_SCH_RR option that was unused since the code
went unconditionally into sch_prio.
Reported-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Note, in the following patch, 'err' is initialized as:
int err = -ENOBUFS;
Signed-off-by: WANG Cong <wcong@critical-links.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a htb_hysteresis parameter to htb_sch.ko and by sysfs magic make
it runtime adjustable via
/sys/module/sch_htb/parameters/htb_hysteresis mode 640.
Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Acked-by: Martin Devera <devik@cdi.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
The HTB hysteresis mode reduce the CPU load, but at the
cost of scheduling accuracy.
On ADSL links (512 kbit/s upstream), this inaccuracy introduce
significant jitter, enought to disturbe VoIP. For details see my
masters thesis (http://www.adsl-optimizer.dk/thesis/), chapter 7,
section 7.3.1, pp 69-70.
Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Acked-by: Martin Devera <devik@cdi.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes CVS keywords that weren't updated for a long time
from comments.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make nlmsg_trim(), nlmsg_cancel(), genlmsg_cancel(), and
nla_nest_cancel() void functions.
Return -EMSGSIZE instead of -1 if the provided message buffer is not
big enough.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
cls_api should return ENOENT when the requested classifier doesn't
exist.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
reallocation of the policy data was being ignored. It could fail.
Simplify so that there is no need for reallocating.
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert to netlink helpers by using netlink policy validation.
As a side effect fixes a leak.
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is lack of removing a class from the event queue while changing
from parent to leaf which can cause corruption of this rb tree. This
patch fixes a bug introduced by my patch: "sch_htb: turn intermediate
classes into leaves" commit: 160d5e10f8.
Many thanks to Jan 'yanek' Bortl for finding a way to reproduce this
rare bug and narrowing the test case, which made possible proper
diagnosing.
This patch is recommended for all kernels starting from 2.6.20.
Reported-and-tested-by: Jan 'yanek' Bortl <yanek@ya.bofh.cz>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
WARN_ON_ONCE() gives a stack trace including the full module list.
Having this in the kernel dump for the timeout case in the
generic netdev watchdog will help us see quicker which driver
is involved. It also allows us to collect statistics
and patterns in terms of which drivers have this event occuring.
Suggested by Andrew Morton
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
datalen is unsigned so it can never be less than zero,
but that's ok because the attribute passed to nla_len()
has been validated and therefore a negative return
value is impossible.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
TC_H_MAJ(parentid) for root classes is the same as for ingress, and if
ingress qdisc is created qdisc_lookup() returns its pointer (without
ingress NULL is returned). After this all qdisc_lookups give the same,
and we get endless loop. (I don't know how this could hide for so long
- it should trigger with every leaf class deleted if it's qdisc isn't
empty.)
After this fix qdisc_lookup() is omitted both for ingress and root
parents, but looking for root is only wasting a little time here...
Many thanks to Enrico Demarin for finding a test for catching this
bug, which probably bothered quite a lot of admins.
Reported-by: Enrico Demarin <enrico@superclick.com>,
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Deleting of nonroot hnodes mostly doesn't work in u32_delete():
refcnt == 1 is expected, but such hnodes' refcnts are initialized
with 0 and charged only with "link" nodes. Now they'll start with
1 like usual. Thanks to Patrick McHardy for an improving suggestion.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Acked-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
The qdisc_run loop is currently unbounded and runs entirely in a
softirq. This is bad as it may create an unbounded softirq run.
This patch fixes this by calling need_resched and breaking out if
necessary.
It also adds a break out if the jiffies value changes since that would
indicate we've been transmitting for too long which starves other
softirqs.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce per-sock inlines: sock_net(), sock_net_set()
and per-inet_timewait_sock inlines: twsk_net(), twsk_net_set().
Without CONFIG_NET_NS, no namespace other than &init_net exists.
Let's explicitly define them to help compiler optimizations.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
HTB is event driven algorithm and part of its work is to apply
scheduled events at proper times. It tried to defend itself from
livelock by processing only limited number of events per dequeue.
Because of faster computers some users already hit this hardcoded
limit.
This patch limits processing up to 2 jiffies (why not 1 jiffie ?
because it might stop prematurely when only fraction of jiffie
remains).
Signed-off-by: Martin Devera <devik@cdi.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
(Anonymous) unions can help us to avoid ugly casts.
A common cast it the (struct rtable *)skb->dst one.
Defining an union like :
union {
struct dst_entry *dst;
struct rtable *rtable;
};
permits to use skb->rtable in place.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 954415e33e ("[PKT_SCHED] ematch:
tcf_em_destroy robustness") removed a cast on em->data when
passing it to kfree(), but em->data is an integer type that can
hold pointers as well as other values so the cast is necessary.
Signed-off-by: David S. Miller <davem@davemloft.net>
htb_requeue() enqueues skbs for which htb_classify() returns NULL.
This is wrong because such skbs could be handled by NET_CLS_ACT code,
and the decision could be different than earlier in htb_enqueue().
So htb_requeue() is changed to work and look more like htb_enqueue().
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make the code in tcf_em_tree_destroy more robust and cleaner:
* Don't need to cast pointer to kfree() or avoid passing NULL.
* After freeing the tree, clear the pointer to avoid possible problems
from repeated free.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A couple of functions in meta match don't need to be inline.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If userspace passes a unknown match index into em_meta, then
em_meta_change will return an error and the data for the match will
not be set. This then causes an null pointer dereference when the
cleanup is done in the error path via tcf_em_tree_destroy. Since the
tree structure comes kzalloc, it is initialized to NULL.
Discovered when testing a new version of tc command against an
accidental older kernel.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since we're using fls(), we need to check whether the value is
non-zero first.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/sched/em_meta.c: In function 'meta_int_vlan_tag':
net/sched/em_meta.c:179: warning: 'tag' may be used uninitialized in this function
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Provide a way to use tc filters on vlan tag even if tag is buried in
skb due to hardware acceleration.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 2.6 latest git build was broken when using the following
configuration options:
CONFIG_NET_EMATCH=n
CONFIG_NET_CLS_FLOW=y
with the following error:
net/sched/cls_flow.c: In function 'flow_dump':
net/sched/cls_flow.c:598: error: 'struct tcf_ematch_tree' has no
member named 'hdr'
make[2]: *** [net/sched/cls_flow.o] Error 1
make[1]: *** [net/sched] Error 2
make: *** [net] Error 2
see the recent post by Li Zefan:
http://www.spinics.net/lists/netdev/msg54434.html
The reason for this crash is that struct tcf_ematch_tree
(net/pkt_cls.h) is empty when CONFIG_NET_EMATCH is not defined.
When CONFIG_NET_EMATCH is defined, the tcf_ematch_tree structure
indeed holds a struct tcf_ematch_tree_hdr (hdr) as flow_dump()
expects.
This patch adds #ifdef CONFIG_NET_EMATCH in flow_dump to avoid this.
Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add new "flow" classifier, which is meant to extend the SFQ hashing
capabilities without hard-coding new hash functions and also allows
deterministic mappings of keys to classes, replacing some out of tree
iptables patches like IPCLASSIFY (maps IPs to classes), IPMARK (maps
IPs to marks, with fw filters to classes), ...
Some examples:
- Classic SFQ hash:
tc filter add ... flow hash \
keys src,dst,proto,proto-src,proto-dst divisor 1024
- Classic SFQ hash, but using information from conntrack to work properly in
combination with NAT:
tc filter add ... flow hash \
keys nfct-src,nfct-dst,proto,nfct-proto-src,nfct-proto-dst divisor 1024
- Map destination IPs of 192.168.0.0/24 to classids 1-257:
tc filter add ... flow map \
key dst addend -192.168.0.0 divisor 256
- alternatively:
tc filter add ... flow map \
key dst and 0xff
- similar, but reverse ordered:
tc filter add ... flow map \
key dst and 0xff xor 0xff
Perturbation is currently not supported because we can't reliable kill the
timer on destruction.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for dumping statistics and make internal queues visible as
classes.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for external classifiers to allow using different flow
hash functions similar to ESFQ. When no classifier is attached the
built-in hash is used as before.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the old policer code is gone, TC actions are needed for policing.
The ingress qdisc can get packets directly from netif_receive_skb()
in case TC actions are enabled or through netfilter otherwise, but
since without TC actions there is no policer the only thing it actually
does is count packets.
Remove the netfilter support and always require TC actions.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use nla_nest_start/nla_nest_end for dumping nested attributes.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
nla_parse() returns more detailed errno codes, propagate them back on
error.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert open-coded nlmsg_parse to use the real function.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix two invalid attribute accesses, indices start at 1 with the new
netlink API.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace open coded equivalent of nla_parse_nested_compat().
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix format string warning introduces by the netlink API conversion:
net/sched/sch_atm.c:250: warning: format '%lu' expects type 'long unsigned int', but argument 3 has type 'int'.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert packet schedulers to use the netlink API. Unfortunately a gradual
conversion is not possible without breaking compilation in the middle or
adding lots of casts, so this patch converts them all in one step. The
patch has been mostly generated automatically with some minor edits to
at least allow seperate conversion of classifiers and actions.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Additionally remove unnecessary NULL initilizations of the next pointer.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Classifier code cleanup. Get rid of printk wrapper, and fix whitespace
and other style stuff reported by checkpatch
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ATM scheduler clean house:
* get rid of printk and qdisc_priv() wrapper
* split some assignment in if() statements
* whitespace and line breaks.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Get rid of all style things checkpatch warns about, indentation and
whitespace.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make dsmark work properly with non-linear and cloned skb's
Before modifying the header, it needs to check that skb header is
writeable.
Note: this makes the assumption, that if it queues a good skb
then a good skb will come out of the embedded qdisc.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove extraneous macro wrappers for printk and qdisc_priv.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The code is already gone for about half a year, the config option
has been kept around to select the replacement options for easier
upgrades. This seems long enough, people upgrading from older
kernels will have to reconfigure a lot anyway.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The printk about ingress qdisc registration error can't be triggered
under normal circumstances. Since register_qdisc only fails for two
identical registrations, the only way to trigger it is by loading the
sch_ingress modules multiple times under different names, in which
case we already return -EEXIST to userspace.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the repeating "ifndef CONFIG_NET_CLS_ACT/ifdef CONFIG_NETFILTER"
ifdefs into a single condition.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of complaining at scheduler initialization time, check the
dependencies in Kconfig.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
- ->reset is optional
- sch_api provides identical defaults for ->dequeue/->requeue
- ->drop can't happen since ingress never has a parent qdisc
Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove excessive debugging statements and some "future use" stuff.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add whitespace around operators, and add a few blank lines to improve
readability.
Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SFQ doesn't need true random numbers, it is only using them to salt a
hash. Therefore it is better to use net_random() and avoid any
possible problems with depleting the entropy pool.
Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The perturbation timer used for re-keying can be deferred, it doesn't
need to be deterministic.
Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add __acquires() and __releases() annotations to suppress some sparse
warnings.
example of warnings :
net/ipv4/udp.c:1555:14: warning: context imbalance in 'udp_seq_start' - wrong
count at exit
net/ipv4/udp.c:1571:13: warning: context imbalance in 'udp_seq_stop' -
unexpected unlock
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kill the defines again, convert to the new checksum helper names and
remove the dependency of NET_ACT_NAT on NETFILTER.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
After this patch none of the netlink callback support anything
except the initial network namespace but the rtnetlink infrastructure
now handles multiple network namespaces.
Changes from v2:
- IPv6 addrlabel processing
Changes from v1:
- no need for special rtnl_unlock handling
- fixed IPv6 ndisc
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before I can enable rtnetlink to work in all network namespaces I need
to be certain that something won't break. So this patch deliberately
disables all of the rtnletlink methods in everything except the
initial network namespace. After the methods have been audited this
extra check can be disabled.
Changes from v1:
- added IPv6 addrlabel protection
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Qdisc_class_ops are const, and Qdisc_ops are mostly read.
Using "const" and "__read_mostly" qualifiers helps to reduce false
sharing.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The IPv4 and IPv6 hook values are identical, yet some code tries to figure
out the "correct" value by looking at the address family. Introduce NF_INET_*
values for both IPv4 and IPv6. The old values are kept in a #ifndef __KERNEL__
section for userspace compatibility.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Many-many code in the kernel initialized the timer->function
and timer->data together with calling init_timer(timer). There
is already a helper for this. Use it for networking code.
The patch is HUGE, but makes the code 130 lines shorter
(98 insertions(+), 228 deletions(-)).
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The only qdiscs that check subqueue state before dequeue'ing are PRIO
and RR. The other qdiscs, including the default pfifo_fast qdisc,
will allow traffic bound for subqueue 0 through to hard_start_xmit.
The check for netif_queue_stopped() is done above in pkt_sched.h, so
it is unnecessary for qdisc_restart(). However, if the underlying
driver is multiqueue capable, and only sets queue states on subqueues,
this will allow packets to enter the driver when it's currently unable
to process packets, resulting in expensive requeues and driver
entries. This patch re-adds the check for the subqueue status before
calling hard_start_xmit, so we can try and avoid the driver entry when
the queues are stopped.
Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>