We can now kill them synchronously with all of the
previous dev_deactivate() cures.
This makes netdev destruction and shutdown saner as
the qdiscs hold references to the device.
Signed-off-by: David S. Miller <davem@davemloft.net>
From: Jarek Poplawski <jarkao2@gmail.com>
When we are destroying non-root qdiscs, we need to lock
the root of the qdisc tree not the the qdisc itself.
Signed-off-by: David S. Miller <davem@davemloft.net>
The condition under which the previous qdisc has no more references
after we've attached &noop_qdisc is that both RUNNING and SCHED
are both seen clear while holding the root lock.
So just make specifically that check in the polling loop, instead
of this overly complex "check without then check with lock held"
sequence.
Signed-off-by: David S. Miller <davem@davemloft.net>
Change handling of the __QDISC_STATE_SCHED flag in net_tx_action() to
enable proper control in dev_deactivate(). Now, if this flag is seen
as unset under root_lock means a qdisc can't be netif_scheduled.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This new state lets dev_deactivate() mark a qdisc as having been
deactivated.
dev_queue_xmit() and ing_filter() check for this bit and do not
try to process the qdisc if the bit is set.
dev_deactivate() polls the qdisc after setting the bit, waiting
for both __QDISC_STATE_RUNNING and __QDISC_STATE_SCHED to clear.
This isn't perfect yet, but subsequent changesets will make it so.
This part is just one piece of the puzzle.
Signed-off-by: David S. Miller <davem@davemloft.net>
There's an skb_copy_datagram_iovec() to copy out of a paged skb, but
nothing the other way around (because we don't do that).
We want to allocate big skbs in tun.c, so let's add the function.
It's a carbon copy of skb_copy_datagram_iovec() with enough changes to
be annoying.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_gso_segment didn't preserve some attributes in the original skb
such as the netfilter fields. This was harmless until they were used
which is the case for packets going through lo.
This patch makes it call __copy_skb_header which also picks up some
other missing attributes.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add more ethtool generic operations to dump the bridge offload
settings.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Let me first state that disabling the route cache hash rebuild
should not be done without extensive analysis on the risk profile
and careful deliberation.
However, there are times when this can be done safely or for
testing. For example, when you have mechanisms for ensuring
that offending parties do not exist in your network.
This patch lets the user disable the rebuild if the interval is
set to zero. This also incidentally fixes a divide-by-zero error
with name-spaces.
In addition, this patch makes the effect of an interval change
immediate rather than it taking effect at the next rebuild as
is currently the case.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a bug with spin_lock_bh() inserted instead of spin_unlock_bh() by
some recent patch.
Reported-by: Denys Fedoryshchenko <denys@visp.net.lb>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6_dev_get_saddr() blindly de-references dst_dev to get the network
namespace, but some callers might pass NULL. Change callers to pass a
namespace pointer instead.
Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes the multicast socket to be per namespace.
When a network namespace is created, other than the init_net and a
multicast packet is received, the kernel goes to a hang or a kernel panic.
How to reproduce ?
* create a child network namespace
* create a pair virtual device veth
* ip link add type veth
* move one side to the pair network device to the child namespace
* ip link set netns <childpid> dev veth1
* ping -I veth0 224.0.0.1
The bug appears because the function ip_mc_init_dev does not initialize
the different multicast fields as it exits because it is not the init_net.
BUG: soft lockup - CPU#0 stuck for 61s! [avahi-daemon:2695]
Modules linked in:
irq event stamp: 50350
hardirqs last enabled at (50349): [<c03ee949>] _spin_unlock_irqrestore+0x34/0x39
hardirqs last disabled at (50350): [<c03ec639>] schedule+0x9f/0x5ff
softirqs last enabled at (45712): [<c0374d4b>] ip_setsockopt+0x8e7/0x909
softirqs last disabled at (45710): [<c03ee682>] _spin_lock_bh+0x8/0x27
Pid: 2695, comm: avahi-daemon Not tainted (2.6.27-rc2-00029-g0872073 #3)
EIP: 0060:[<c03ee47c>] EFLAGS: 00000297 CPU: 0
EIP is at __read_lock_failed+0x8/0x10
EAX: c4f38810 EBX: c4f38810 ECX: 00000000 EDX: c04cc22e
ESI: fb0000e0 EDI: 00000011 EBP: 0f02000a ESP: c4e3faa0
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
CR0: 8005003b CR2: 44618a40 CR3: 04e37000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
[<c02311f8>] ? _raw_read_lock+0x23/0x25
[<c0390666>] ? ip_check_mc+0x1c/0x83
[<c036d478>] ? ip_route_input+0x229/0xe92
[<c022e2e4>] ? trace_hardirqs_on_thunk+0xc/0x10
[<c0104c9c>] ? do_IRQ+0x69/0x7d
[<c0102e64>] ? restore_nocheck_notrace+0x0/0xe
[<c036fdba>] ? ip_rcv+0x227/0x505
[<c0358764>] ? netif_receive_skb+0xfe/0x2b3
[<c03588d2>] ? netif_receive_skb+0x26c/0x2b3
[<c035af31>] ? process_backlog+0x73/0xbd
[<c035a8cd>] ? net_rx_action+0xc1/0x1ae
[<c01218a8>] ? __do_softirq+0x7b/0xef
[<c0121953>] ? do_softirq+0x37/0x4d
[<c035b50d>] ? dev_queue_xmit+0x3d4/0x40b
[<c0122037>] ? local_bh_enable+0x96/0xab
[<c035b50d>] ? dev_queue_xmit+0x3d4/0x40b
[<c012181e>] ? _local_bh_enable+0x79/0x88
[<c035fcb8>] ? neigh_resolve_output+0x20f/0x239
[<c0373118>] ? ip_finish_output+0x1df/0x209
[<c0373364>] ? ip_dev_loopback_xmit+0x62/0x66
[<c0371db5>] ? ip_local_out+0x15/0x17
[<c0372013>] ? ip_push_pending_frames+0x25c/0x2bb
[<c03891b8>] ? udp_push_pending_frames+0x2bb/0x30e
[<c038a189>] ? udp_sendmsg+0x413/0x51d
[<c038a1a9>] ? udp_sendmsg+0x433/0x51d
[<c038f927>] ? inet_sendmsg+0x35/0x3f
[<c034f092>] ? sock_sendmsg+0xb8/0xd1
[<c012d554>] ? autoremove_wake_function+0x0/0x2b
[<c022e6de>] ? copy_from_user+0x32/0x5e
[<c022e6de>] ? copy_from_user+0x32/0x5e
[<c034f238>] ? sys_sendmsg+0x18d/0x1f0
[<c0175e90>] ? pipe_write+0x3cb/0x3d7
[<c0170347>] ? do_sync_write+0xbe/0x105
[<c012d554>] ? autoremove_wake_function+0x0/0x2b
[<c03503b2>] ? sys_socketcall+0x176/0x1b0
[<c01085ea>] ? syscall_trace_enter+0x6c/0x7b
[<c0102e1a>] ? syscall_call+0x7/0xb
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
gen_kill_estimator() required rtnl_lock() protection, but since it is
moved to an RCU callback __qdisc_destroy() let's use est_lock instead.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based upon discussions with Jarek P. and Herbert Xu.
First, we're testing the wrong qdisc. We just reset the device
queue qdiscs to &noop_qdisc and checking it's state is completely
pointless here.
We want to wait until the previous qdisc that was sitting at
the ->qdisc pointer is not busy any more. And that would be
->qdisc_sleeping.
Because of how we propagate the samples qdisc pointer down into
qdisc_run and friends via per-cpu ->output_queue and netif_schedule,
we have to wait also for the __QDISC_STATE_SCHED bit to clear as
well.
Signed-off-by: David S. Miller <davem@davemloft.net>
Recent changes introduced a bug in htb_delete(): cl->parent->children
counter update misses checking cl->parent for NULL, which is used for
root classes, so deleting them causes an oops.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the new multi-queue transmit code, it is possible to accidentally
make pktgen pick a non-existing tx queue simply by using a stale
script to drive pktgen. Access to this non-existing tx queue will
then trigger a bad memory access and kill the machine.
For example, setting "queue_map_max 2" will cause my machine to die
when accessing a garbage spinlock in the non-existing tx queue:
BUG: spinlock bad magic on CPU#0, kpktgend_0/564
lock: ffff88001ddf6718, .magic: ffffffff, .owner: /-1, .owner_cpu: 0
Pid: 564, comm: kpktgend_0 Not tainted 2.6.27-rc3 #35
Call Trace:
[<ffffffff803a1228>] spin_bug+0xa4/0xac
[<ffffffff803a1253>] _raw_spin_lock+0x23/0x123
[<ffffffff8055b06f>] _spin_lock_bh+0x17/0x1b
[<ffffffff804cb57d>] pktgen_thread_worker+0xa97/0x1002
[<ffffffff8022874d>] ? finish_task_switch+0x38/0x97
[<ffffffff80242077>] ? autoremove_wake_function+0x0/0x36
[<ffffffff80242077>] ? autoremove_wake_function+0x0/0x36
[<ffffffff804caae6>] ? pktgen_thread_worker+0x0/0x1002
[<ffffffff80241a40>] kthread+0x44/0x6d
[<ffffffff8020c399>] child_rip+0xa/0x11
[<ffffffff802419fc>] ? kthread+0x0/0x6d
[<ffffffff8020c38f>] ? child_rip+0x0/0x11
The attached patch adds some sanity checking to prevent
these sorts of configuration errors.
Signed-off-by: Andrew Gallatin <gallatin@myri.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Thanks to Eugene Teo for reporting this problem.
Signed-off-by: Eugene Teo <eugenete@kernel.sg>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Small fix removing an unnecessary intermediate variable.
Signed-off-by: Jean-Christophe DUBOIS <jcd@tribudubois.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Flushing must consistently return ENOMEM on failure of any allocation
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
Flushing of actions has been broken since we changed
the semantics of netlink parsed tb[X] to mean X is an attribute type.
This makes the flushing work.
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case of error, the function rxrpc_get_transport returns an ERR
pointer, but never returns a NULL pointer. So after a call to this
function, a NULL test should be replaced by an IS_ERR test.
A simplified version of the semantic patch that makes this change is
as follows:
(http://www.emn.fr/x-info/coccinelle/)
// <smpl>
@correct_null_test@
expression x,E;
statement S1, S2;
@@
x = rxrpc_get_transport(...)
<... when != x = E
if (
(
- x@p2 != NULL
+ ! IS_ERR ( x )
|
- x@p2 == NULL
+ IS_ERR( x )
)
)
S1
else S2
...>
? x = E;
// </smpl>
Signed-off-by: Julien Brunel <brunel@diku.dk>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the minimal the wireless extensions oughta send at least
the name in addition to the ifindex.
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's an internal implementation detail which we _should_ be free to change.
So we did, and it promptly broke.
The compiler shold be able to work out when to use the __constant version
anyway.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexey Dobriyan wrote:
> On Thu, Aug 07, 2008 at 07:00:56PM +0200, John Gumb wrote:
>> Scenario: no ipv6 default route set.
>
>> # ip -f inet6 route get fec0::1
>>
>> BUG: unable to handle kernel NULL pointer dereference at 00000000
>> IP: [<c0369b85>] rt6_fill_node+0x175/0x3b0
>> EIP is at rt6_fill_node+0x175/0x3b0
>
> 0xffffffff80424dd3 is in rt6_fill_node (net/ipv6/route.c:2191).
> 2186 } else
> 2187 #endif
> 2188 NLA_PUT_U32(skb, RTA_IIF, iif);
> 2189 } else if (dst) {
> 2190 struct in6_addr saddr_buf;
> 2191 ====> if (ipv6_dev_get_saddr(ip6_dst_idev(&rt->u.dst)->dev,
> ^^^^^^^^^^^^^^^^^^^^^^^^
> NULL
>
> 2192 dst, 0, &saddr_buf) == 0)
> 2193 NLA_PUT(skb, RTA_PREFSRC, 16, &saddr_buf);
> 2194 }
The commit that changed this can't be reverted easily, but the patch
below works for me.
Fix NULL de-reference in rt6_fill_node() when there's no IPv6 input
device present in the dst entry.
Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since qdisc_stab_lock is used in qdisc_put_stab(), which is called in
BH context from __qdisc_destroy() RCU callback, softirq safe locking
is needed.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to align the coding styles of ip_vs_zero_stats() and
its child-function ip_vs_zero_estimator(), clear ip_vs_stats
members explicitlty rather than doing a limited memset().
This was chosen over modifying ip_vs_zero_estimator() to use
memset() as it is more robust against changes in members
in the relevant structures. memset() would be prefered if
all members of the structure were to be cleared.
Cc: Sven Wegener <sven.wegener@stealer.net>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
It's a global variable and automatically initialized to zero. And now we can
also initialize the lock at compile time.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Acked-by: Simon Horman <horms@verge.net.au>
There's no reason for dynamically allocating an estimator object for every
stats object. Directly embed an estimator object into every stats object and
switch to using the kernel-provided list implementation. This makes the code
much simpler and faster, as we do not need to traverse the list of all
estimators to find the one belonging to a stats object. There's no need to use
an rwlock, as we only have one reader. Also reorder the members of the
estimator structure slightly to avoid padding overhead. This can't be done
with the stats object as the members are currently copied to our user space
object via memcpy() and changing it would break ABI.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Acked-by: Simon Horman <horms@verge.net.au>
Being able to discard these functions saves a couple of bytes at runtime. The
cleanup functions can't be annotated with __exit as they are also called from
init functions.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Acked-by: Simon Horman <horms@verge.net.au>
No need to do it at runtime and this saves a couple of bytes in the text
section.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Acked-by: Simon Horman <horms@verge.net.au>
There is a slight chance for a deadlock in the estimator code. We can't call
del_timer_sync() while holding our lock, as the timer might be active and
spinning for the lock on another cpu. Work around this issue by using
try_to_del_timer_sync() and releasing the lock. We could actually delete the
timer outside of our lock, as the add and kill functions are only every called
from userspace via [gs]etsockopt() and are serialized by a mutex, but better
make this explicit.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Cc: stable <stable@kernel.org>
Acked-by: Simon Horman <horms@verge.net.au>
Commit 998e7a7680 ("ipvs: Use kthread_run()
instead of doing a double-fork via kernel_thread()") introduced a possible
deadlock in the sync code. We need to use the _bh versions for the lock, as the
lock is also accessed from a bottom half.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Acked-by: Simon Horman <horms@verge.net.au>
The socket lock is there to protect the normal UDP receive path.
Encapsulation UDP sockets don't need that protection. In fact
the locking is deadly for them as they may contain another UDP
packet within, possibly with the same addresses.
Also the nested bit was copied from TCP. TCP needs it because
of accept(2) spawning sockets. This simply doesn't apply to UDP
so I've removed it.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based upon bug reports by Stephen Hemminger.
We still had some cases using ->qdisc instead of ->qdisc_sleeping.
Also, qdisc_lookup() should return ingress qdiscs.
Signed-off-by: David S. Miller <davem@davemloft.net>
When an action is added several times with the same exact index
it gets deleted on every even-numbered attempt.
This fixes that issue.
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
The indentation in part of tcp_minisocks makes it look like one of the if
statements is much more important than it actually is.
Signed-off-by: Adam Langley <agl@imperialviolet.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Bluetooth qualification for PAN demands testing with BNEP header
compression disabled. This is actually pretty stupid and the Linux
implementation outsmarts the test system since it compresses whenever
possible. So to pass qualification two need parameters have been added
to control the compression of source and destination headers.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Currently a mesh node will not forward a multicast frame if it is not subscribed
to the specific multicast address. This patch addresses the issue and fixes mesh
multicast forwarding.
Signed-off-by: Luis Carlos Cobo <luisca@cozybit.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Now we deal with mesh forwarding before the 802.11->802.3 conversion, thus
eliminating a few unnecessary steps. The next hop lookup is called from
ieee80211_master_start_xmit() instead of subif_start_xmit(). Until the next hop
is found, RA in the frame will be all zeroes for frames originating from the
device. For forwarded frames, RA will contain the TA of the received frame,
which will be necessary to send a path error if a next hop is not found.
Signed-off-by: Luis Carlos Cobo <luisca@cozybit.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Sofar far pktgen have had a restriction to only use one device per kernel
thread. With the new multiqueue architecture this is no longer adequate.
The patch below is an effort to remove this by in pktgen configuration
adding a tag to the device name a la eth0@0 etc. The tag is used for
usual device config just as before. Also a new flag is introduced to mirror
queue_map with sending threads smp_processor_id() QUEUE_MAP_CPU.
An example: We use 4 CPU's to send to one 10g interface (eth0)
and we use the new tagging to send a mix of packet sizes, 64, 576 and
1500 bytes. Also we use TX queues according to smp_processor_id()
PGDEV=/proc/net/pktgen/kpktgend_0
pgset "add_device eth0@0"
PGDEV=/proc/net/pktgen/kpktgend_1
pgset "add_device eth0@1"
PGDEV=/proc/net/pktgen/kpktgend_2
pgset "add_device eth0@2"
PGDEV=/proc/net/pktgen/kpktgend_3
pgset "add_device eth0@3"
....
PGDEV=/proc/net/pktgen/eth0@0
pgset "pkt_size 64"
pgset "flag QUEUE_MAP_CPU"
PGDEV=/proc/net/pktgen/eth0@1
pgset "pkt_size 572"
pgset "flag QUEUE_MAP_CPU"
PGDEV=/proc/net/pktgen/eth0@2
pgset "pkt_size 1496"
PGDEV=/proc/net/pktgen/eth0@3
pgset "pkt_size 1496"
pgset "flag QUEUE_MAP_CPU"
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a packet_type specifies an active slave to bonding and not just any
interface, allow it to receive frames that came in on that interface.
Signed-off-by: Joe Eykholt <jre@nuovasystems.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Allow a packet_type that specifies the exact device to receive
even on an inactive bonding slave devices. This is important for some
L2 protocols such as LLDP and FCoE. This can eventually be used
for the bonding special cases as well.
Signed-off-by: Joe Eykholt <jre@nuovasystems.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Otherwise subsequent changes need multiple return values.
Signed-off-by: Joe Eykholt <jre@nuovasystems.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>