The following patch renames __in_dev_get() to __in_dev_get_rtnl() and
introduces __in_dev_get_rcu() to cover the second case.
1) RCU with refcnt should use in_dev_get().
2) RCU without refcnt should use __in_dev_get_rcu().
3) All others must hold RTNL and use __in_dev_get_rtnl().
There is one exception in net/ipv4/route.c which is in fact a pre-existing
race condition. I've marked it as such so that we remember to fix it.
This patch is based on suggestions and prior work by Suzanne Wood and
Paul McKenney.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arnaldo and I agreed it could be applied now, because I have other
pending patches depending on this one (Thank you Arnaldo)
(The other important patch moves skc_refcnt in a separate cache line,
so that the SMP/NUMA performance doesnt suffer from cache line ping pongs)
1) First some performance data :
--------------------------------
tcp_v4_rcv() wastes a *lot* of time in __inet_lookup_established()
The most time critical code is :
sk_for_each(sk, node, &head->chain) {
if (INET_MATCH(sk, acookie, saddr, daddr, ports, dif))
goto hit; /* You sunk my battleship! */
}
The sk_for_each() does use prefetch() hints but only the begining of
"struct sock" is prefetched.
As INET_MATCH first comparison uses inet_sk(__sk)->daddr, wich is far
away from the begining of "struct sock", it has to bring into CPU
cache cold cache line. Each iteration has to use at least 2 cache
lines.
This can be problematic if some chains are very long.
2) The goal
-----------
The idea I had is to change things so that INET_MATCH() may return
FALSE in 99% of cases only using the data already in the CPU cache,
using one cache line per iteration.
3) Description of the patch
---------------------------
Adds a new 'unsigned int skc_hash' field in 'struct sock_common',
filling a 32 bits hole on 64 bits platform.
struct sock_common {
unsigned short skc_family;
volatile unsigned char skc_state;
unsigned char skc_reuse;
int skc_bound_dev_if;
struct hlist_node skc_node;
struct hlist_node skc_bind_node;
atomic_t skc_refcnt;
+ unsigned int skc_hash;
struct proto *skc_prot;
};
Store in this 32 bits field the full hash, not masked by (ehash_size -
1) Using this full hash as the first comparison done in INET_MATCH
permits us immediatly skip the element without touching a second cache
line in case of a miss.
Suppress the sk_hashent/tw_hashent fields since skc_hash (aliased to
sk_hash and tw_hash) already contains the slot number if we mask with
(ehash_size - 1)
File include/net/inet_hashtables.h
64 bits platforms :
#define INET_MATCH(__sk, __hash, __cookie, __saddr, __daddr, __ports, __dif)\
(((__sk)->sk_hash == (__hash))
((*((__u64 *)&(inet_sk(__sk)->daddr)))== (__cookie)) && \
((*((__u32 *)&(inet_sk(__sk)->dport))) == (__ports)) && \
(!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))))
32bits platforms:
#define TCP_IPV4_MATCH(__sk, __hash, __cookie, __saddr, __daddr, __ports, __dif)\
(((__sk)->sk_hash == (__hash)) && \
(inet_sk(__sk)->daddr == (__saddr)) && \
(inet_sk(__sk)->rcv_saddr == (__daddr)) && \
(!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))))
- Adds a prefetch(head->chain.first) in
__inet_lookup_established()/__tcp_v4_check_established() and
__inet6_lookup_established()/__tcp_v6_check_established() and
__dccp_v4_check_established() to bring into cache the first element of the
list, before the {read|write}_lock(&head->lock);
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
I've found the problem in general. It affects any 64-bit
architecture. The problem occurs when you change the system time.
Suppose that when you boot your system clock is forward by a day.
This gets recorded down in skb_tv_base. You then wind the clock back
by a day. From that point onwards the offset will be negative which
essentially overflows the 32-bit variables they're stored in.
In fact, why don't we just store the real time stamp in those 32-bit
variables? After all, we're not going to overflow for quite a while
yet.
When we do overflow, we'll need a better solution of course.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use '#ifdef' consistently on __KERNEL__. This was reported as bug #5340
(isn't easier to send a fix than report the bug?!)
Signed-off-by: Diego Calleja <diegocg@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Only one of the run or kick path is supposed to put an iocb on the run
list. If both of them do it than one of them can end up referencing a
freed iocb. The kick path could delete the task_list item from the wait
queue before getting the ctx_lock and putting the iocb on the run list.
The run path was testing the task_list item outside the lock so that it
could catch ki_retry methods that return -EIOCBRETRY *without* putting the
iocb on a wait queue and promising to call kick_iocb. This unlocked check
could then race with the kick path to cause both to try and put the iocb on
the run list.
The patch stops the run path from testing task_list by requring that any
ki_retry that returns -EIOCBRETRY *must* guarantee that kick_iocb() will be
called in the future. aio_p{read,write}, the only in-tree -EIOCBRETRY
users, are updated.
Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Benjamin LaHaise <bcrl@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Roland points out that the flags end up having non-obvious dependencies
elsewhere, so revert aa55a08687 and add
some comments about why things are as they are.
We'll just have to fix up the broken comparisons. Roland has a patch.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
do_signal_stop:
for_each_thread(t) {
if (t->state < TASK_STOPPED)
++sig->group_stop_count;
}
However, TASK_NONINTERACTIVE > TASK_STOPPED, so this loop will not
count TASK_INTERRUPTIBLE | TASK_NONINTERACTIVE threads.
See also wait_task_stopped(), which checks ->state > TASK_STOPPED.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
[ We really probably should always use the appropriate bitmasks to test
task states, not do it like this. Using something like
#define TASK_RUNNABLE (TASK_RUNNING | TASK_INTERRUPTIBLE | \
TASK_UNINTERRUPTIBLE | TASK_NONINTERACTIVE)
and then doing "if (task->state & TASK_RUNNABLE)" or similar. But the
ordering of the task states is historical, and keeping the ordering
does make sense regardless. ]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The attached patch adds extra permission grants to keys for the possessor of a
key in addition to the owner, group and other permissions bits. This makes
SUID binaries easier to support without going as far as labelling keys and key
targets using the LSM facilities.
This patch adds a second "pointer type" to key structures (struct key_ref *)
that can have the bottom bit of the address set to indicate the possession of
a key. This is propagated through searches from the keyring to the discovered
key. It has been made a separate type so that the compiler can spot attempts
to dereference a potentially incorrect pointer.
The "possession" attribute can't be attached to a key structure directly as
it's not an intrinsic property of a key.
Pointers to keys have been replaced with struct key_ref *'s wherever
possession information needs to be passed through.
This does assume that the bottom bit of the pointer will always be zero on
return from kmem_cache_alloc().
The key reference type has been made into a typedef so that at least it can be
located in the sources, even though it's basically a pointer to an undefined
type. I've also renamed the accessor functions to be more useful, and all
reference variables should now end in "_ref".
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The following is generated when compiling a
recent (2.6.14-rc2-git5) kernel configured for
ARM, with GCC4.
CC init/main.o
In file included from include/linux/netdevice.h:29,
from include/net/sock.h:48,
from init/main.c:50:
include/linux/if_ether.h:114: error: array type has incomplete element type
It seems that if CONFIG_SYSCTL is not set, then
the compiler will throw an error due to the definition
of the ether_table[] array
Attached is a solution to the problem
Signed-off-by: Ben Dooks <ben-linux@fluff.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Written by Adrian Sun (asun@darksunrising.com).
Ported to 2.6.x by Tom 'spot' Callaway <tcallawa@redhat.com>.
Further cleaned up and integrated by David S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
Place them on separate cache lines in SMP to lower memory bouncing
between multiple CPU accessing the device.
- One part is mostly used on receive path (including
eth_type_trans()) (poll_list, poll, quota, weight, last_rx,
dev_addr, broadcast)
- One part is mostly used on queue transmit path (qdisc)
(queue_lock, qdisc, qdisc_sleeping, qdisc_list, tx_queue_len)
- One part is mostly used on xmit path (device)
(xmit_lock, xmit_lock_owner, priv, hard_start_xmit, trans_start)
'features' is placed outside of these hot points, in a location that
may be shared by all cpus (because mostly read)
name_hlist is moved close to name[IFNAMSIZ] to speedup __dev_get_by_name()
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When you've enabled conntrack and NAT as a module (standard case in all
distributions), and you've also enabled the new conntrack netlink
interface, loading ip_conntrack_netlink.ko will auto-load iptable_nat.ko.
This causes a huge performance penalty, since for every packet you iterate
the nat code, even if you don't want it.
This patch splits iptable_nat.ko into the NAT core (ip_nat.ko) and the
iptables frontend (iptable_nat.ko). Threfore, ip_conntrack_netlink.ko will
only pull ip_nat.ko, but not the frontend. ip_nat.ko will "only" allocate
some resources, but not affect runtime performance.
This separation is also a nice step in anticipation of new packet filters
(nf-hipac, ipset, pkttables) being able to use the NAT core.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
If input message rate from userspace is too high, do not drop them,
but try to deliver using work queue allocation.
Failing there is some kind of congestion control.
It also removes warn_on on this condition, which scares people.
Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Added a missing TO_NATIVE call to scripts/mod/file2alias.c:do_pcmcia_entry()
- Add an alignment attribute to struct pcmcia_device_no to solve an alignment
issue seen when cross-compiling on x86 for m68k.
Signed-off-by: Kars de Jong <jongk@linux-m68k.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Support some more TI cardbus bridges. most of them are multifunction
devices which adds 1394 controllers, smartcard readers etc. this could
also help with the various problems with the XX21 controllers seen on the
linux-pcmcia list.
Signed-off-by: Daniel Ritz <daniel.ritz@gmx.ch>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Echo Audio cardbus products are known to be incompatible with EnE bridges.
in order to maybe solve the problem a EnE specific test bit has to be set,
another cleared...but other setups have a good chance to break when just
forcing the bits. so do the whole thingy automatically.
The patch adds a hook in cb_alloc() that allows special tuning for the
different chipsets. for ene just match the Echo products and set/clear the
test bits, defaults to do the same thing as w/o the patch to not break
working setups.
Signed-off-by: Daniel Ritz <daniel.ritz@gmx.ch>
Cc: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
This patch fixes a number of bugs. It cannot be reasonably split up in
multiple fixes, since all bugs interact with each other and affect the same
function:
Bug #1:
The event cache code cannot be called while a lock is held. Therefore, the
call to ip_conntrack_event_cache() within ip_ct_refresh_acct() needs to be
moved outside of the locked section. This fixes a number of 2.6.14-rcX
oops and deadlock reports.
Bug #2:
We used to call ct_add_counters() for unconfirmed connections without
holding a lock. Since the add operations are not atomic, we could race
with another CPU.
Bug #3:
ip_ct_refresh_acct() lost REFRESH events in some cases where refresh
(and the corresponding event) are desired, but no accounting shall be
performed. Both, evenst and accounting implicitly depended on the skb
parameter bein non-null. We now re-introduce a non-accounting
"ip_ct_refresh()" variant to explicitly state the desired behaviour.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the lead up to 2.6.13 I fixed a large number of reboot problems by
making the calling conventions consistent. Despite checking and double
checking my work it appears I missed an obvious one.
This first patch simply refactors the reboot routines so all of the
preparation for various kinds of reboots are in their own functions.
Making it very hard to get the various kinds of reboot out of sync.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
add the helper and use it instead of open coding the klist_node_attached() check
(which is a layering violation IMHO)
idea by Alan Stern.
Signed-off-by: Daniel Ritz <daniel.ritz@gmx.ch>
Cc: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Hugh made me note this line for permission checking in mprotect():
if ((newflags & ~(newflags >> 4)) & 0xf) {
after figuring out what's that about, I decided it's nasty enough. Btw
Hugh itself didn't like the 0xf.
We can safely change it to VM_READ|VM_WRITE|VM_EXEC because we never change
VM_SHARED, so no need to check that.
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Update comment for the 2.6.6-rc1 conversion from page->list and
address_space->{clean,dirty,locked}_pages to radix tree tagging and ->lru.
I've mostly avoided to mention page lists (at least I've shortened the
comment).
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch comments the fact that although passing le64_to_cpup et
al. is within the intended use of the byteorder macros, using
get_unaligned is the recommended way to go.
Signed-off-by: Ed L. Cashin <ecashin@coraid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both __ip_conntrack_expect_find and ip_conntrack_expect_find_get take
a reference to the expectation, the difference is that callers of
__ip_conntrack_expect_find must hold ip_conntrack_lock.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some IPv6 matches have very similar loops to find IPv6 extension header
and we can unify them. This patch introduces ipv6_find_hdr() to do it.
I just checked that it can find the target headers in the packet which has
dst,hbh,rt,frag,ah,esp headers.
Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This new "version 3" PPTP conntrack/nat helper is finally ready for
mainline inclusion. Special thanks to lots of last-minute bugfixing
by Patric McHardy.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allocation for the optnames is similar to the DCCP options, with a
range for rx and tx half connection CCIDs.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Moving the TFRC sender and receiver variables to separate structs, so
that we can copy these structs to userspace thru getsockopt,
dccp_diag, etc.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Isolating it, that will be used when we introduce a CCID2 (TCP-Like)
implementation.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix http://bugzilla.kernel.org/show_bug.cgi?id=5241
2.6.13 broke compilation of the xorg tree, which apprarently insists on
including that file.
Cc: Vojtech Pavlik <vojtech@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Kill an unused member of the i2c_adapter structure. This additionally
fixes a potential bug, because <linux/i2c.h> doesn't include
<linux/config.h>, so different files including <linux/i2c.h> could see a
different definition of the i2c_adapter structure, depending on them
including <linux/config.h> (or other header files themselves including
<linux/config.h>) before <linux/i2c.h>, or not.
Credits go to Jörn Engel for pointing me to the problem.
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
As discussed in the dccp@vger mailing list:
Now applications have to use setsockopt(DCCP_SOCKOPT_SERVICE, service[s]),
prior to calling listen() and connect().
An array of unsigned ints can be passed meaning that the listening sock accepts
connection requests for several services.
With this we can ditch struct sockaddr_dccp and use only sockaddr_in (and
sockaddr_in6 in the future).
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>