Remove the congestion sensing mechanism from netif_rx, and always
return either full or empty. Almost no driver checks the return value
from netif_rx, and those that do only use it for debug messages.
The original design of netif_rx was to do flow control based on the
receive queue, but NAPI has supplanted this and no driver uses the
feedback.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove last vestiages of fastroute code that is no longer used.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements Tom Kelly's Scalable TCP congestion control algorithm
for the modular framework.
The algorithm has some nice scaling properties, and has been used a fair bit
in research, though is known to have significant fairness issues, so it's not
really suitable for general purpose use.
Signed-off-by: John Heffner <jheffner@psc.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
H-TCP is a congestion control algorithm developed at the Hamilton Institute, by
Douglas Leith and Robert Shorten. It is extending the standard Reno algorithm
with mode switching is thus a relatively simple modification.
H-TCP is defined in a layered manner as it is still a research platform. The
basic form includes the modification of beta according to the ratio of maxRTT
to min RTT and the alpha=2*factor*(1-beta) relation, where factor is dependant
on the time since last congestion.
The other layers improve convergence by adding appropriate factors to alpha.
The following patch implements the H-TCP algorithm in it's basic form.
Signed-Off-By: Baruch Even <baruch@ev-en.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP Vegas code modified for the new TCP infrastructure.
Vegas now uses microsecond resolution timestamps for
better estimation of performance over higher speed links.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP Hybla congestion avoidance.
- "In heterogeneous networks, TCP connections that incorporate a
terrestrial or satellite radio link are greatly disadvantaged with
respect to entirely wired connections, because of their longer round
trip times (RTTs). To cope with this problem, a new TCP proposal, the
TCP Hybla, is presented and discussed in the paper[1]. It stems from an
analytical evaluation of the congestion window dynamics in the TCP
standard versions (Tahoe, Reno, NewReno), which suggests the necessary
modifications to remove the performance dependence on RTT.[...]"[1]
[1]: Carlo Caini, Rosario Firrincieli, "TCP Hybla: a TCP enhancement for
heterogeneous networks",
International Journal of Satellite Communications and Networking
Volume 22, Issue 5 , Pages 547 - 566. September 2004.
Signed-off-by: Daniele Lacamera (root at danielinux.net)net
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sally Floyd's high speed TCP congestion control.
This is useful for comparison and research.
Signed-off-by: John Heffner <jheffner@psc.edu>
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the existing 2.6.12 Westwood code moved from tcp_input
to the new congestion framework. A lot of the inline functions
have been eliminated to try and make it clearer.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP BIC congestion control reworked to use the new congestion control
infrastructure. This version is more up to date than the BIC
code in 2.6.12; it incorporates enhancements from BICTCP 1.1,
to handle low latency links.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Enhancement to the tcp_diag interface used by the iproute2 ss command
to report the tcp congestion control being used by a socket.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow TCP to have multiple pluggable congestion control algorithms.
Algorithms are defined by a set of operations and can be built in
or modules. The legacy "new RENO" algorithm is used as a starting
point and fallback.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch creates a new kstrdup library function and changes the "local"
implementations in several places to use this function.
Most of the changes come from the sound and net subsystems. The sound part
had already been acknowledged by Takashi Iwai and the net part by David S.
Miller.
I left UML alone for now because I would need more time to read the code
carefully before making changes there.
Signed-off-by: Paulo Marques <pmarques@grupopie.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch is a follow up to patch 1 regarding "Selective Sub Address
matching with call user data". It allows use of the Fast-Select-Acceptance
optional user facility for X.25.
This patch just implements fast select with no restriction on response
(NRR). What this means (according to ITU-T Recomendation 10/96 section
6.16) is that if in an incoming call packet, the relevant facility bits are
set for fast-select-NRR, then the called DTE can issue a direct response to
the incoming packet using a call-accepted packet that contains
call-user-data. This patch allows such a response.
The called DTE can also respond with a clear-request packet that contains
call-user-data. However, this feature is currently not implemented by the
patch.
How is Fast Select Acceptance used?
By default, the system does not allow fast select acceptance (as before).
To enable a response to fast select acceptance,
After a listen socket in created and bound as follows
socket(AF_X25, SOCK_SEQPACKET, 0);
bind(call_soc, (struct sockaddr *)&locl_addr, sizeof(locl_addr));
but before a listen system call is made, the following ioctl should be used.
ioctl(call_soc,SIOCX25CALLACCPTAPPRV);
Now the listen system call can be made
listen(call_soc, 4);
After this, an incoming-call packet will be accepted, but no call-accepted
packet will be sent back until the following system call is made on the socket
that accepts the call
ioctl(vc_soc,SIOCX25SENDCALLACCPT);
The network (or cisco xot router used for testing here) will allow the
application server's call-user-data in the call-accepted packet,
provided the call-request was made with Fast-select NRR.
Signed-off-by: Shaun Pereira <spereira@tusc.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
From: Shaun Pereira <spereira@tusc.com.au>
This is the first (independent of the second) patch of two that I am
working on with x25 on linux (tested with xot on a cisco router). Details
are as follows.
Current state of module:
A server using the current implementation (2.6.11.7) of the x25 module will
accept a call request/ incoming call packet at the listening x.25 address,
from all callers to that address, as long as NO call user data is present
in the packet header.
If the server needs to choose to accept a particular call request/ incoming
call packet arriving at its listening x25 address, then the kernel has to
allow a match of call user data present in the call request packet with its
own. This is required when multiple servers listen at the same x25 address
and device interface. The kernel currently matches ALL call user data, if
present.
Current Changes:
This patch is a follow up to the patch submitted previously by Andrew
Hendry, and allows the user to selectively control the number of octets of
call user data in the call request packet, that the kernel will match. By
default no call user data is matched, even if call user data is present.
To allow call user data matching, a cudmatchlength > 0 has to be passed
into the kernel after which the passed number of octets will be matched.
Otherwise the kernel behavior is exactly as the original implementation.
This patch also ensures that as is normally the case, no call user data
will be present in the Call accepted / call connected packet sent back to
the caller
Future Changes on next patch:
There are cases however when call user data may be present in the call
accepted packet. According to the X.25 recommendation (ITU-T 10/96)
section 5.2.3.2 call user data may be present in the call accepted packet
provided the fast select facility is used. My next patch will include this
fast select utility and the ability to send up to 128 octets call user data
in the call accepted packet provided the fast select facility is used. I
am currently testing this, again with xot on linux and cisco.
Signed-off-by: Shaun Pereira <spereira@tusc.com.au>
(With a fix from Alexey Dobriyan <adobriyan@gmail.com>)
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
From: jlamanna@gmail.com
ebtables.c vfree() checking cleanups.
Signed-off by: James Lamanna <jlamanna@gmail.com>
Signed-off-by: Domen Puncer <domen@coderock.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
From: Nishanth Aravamudan <nacc@us.ibm.com>
Use msleep() instead of schedule_timeout() to guarantee the task
delays as expected. The current code is not wrong, but it does not account for
early return due to signals, so I think msleep() should be appropriate.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Domen Puncer <domen@coderock.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch provides support for registering multiple netpoll clients to the
same network device. Only one of these clients may register an rx_hook,
however. In practice, this restriction has not been problematic. It is
worth mentioning, though, that the current design can be easily extended to
allow for the registration of multiple rx_hooks.
The basic idea of the patch is that the rx_np pointer in the netpoll_info
structure points to the struct netpoll that has rx_hook filled in. Aside
from this one case, there is no need for a pointer from the struct
net_device to an individual struct netpoll.
A lock is introduced to protect the setting and clearing of the np_rx
pointer. The pointer will only be cleared upon netpoll client module
removal, and the lock should be uncontested.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces a netpoll_info structure, which the struct net_device
will now point to instead of pointing to a struct netpoll. The reason for
this is two-fold: 1) fields such as the rx_flags, poll_owner, and poll_lock
should be maintained per net_device, not per netpoll; and 2) this is a first
step in providing support for multiple netpoll clients to register against the
same net_device.
The struct netpoll is now pointed to by the netpoll_info structure. As
such, the previous behaviour of the code is preserved.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Small patch to save an unecessary call to strlen() : sprintf() gave us
the length, just trust it.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make the socket transport kick the event queue to start socket connects
immediately. This should improve responsiveness of applications that are
sensitive to slow mount operations (like automounters).
We are now also careful to cancel the connect worker before destroying
the xprt. This eliminates a race where xprt_destroy can finish before
the connect worker is even allowed to run.
Test-plan:
Destructive testing (unplugging the network temporarily). Connectathon
with UDP and TCP. Hard-code impossibly small connect timeout.
Version: Fri, 29 Apr 2005 15:32:01 -0400
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When the network layer reports a connection close, the RPC task
waiting to reconnect should be notified so it can retry immediately
instead of waiting for the normal connection establishment timeout.
This reverts a change made in 2.6.6 as part of adding client support
for RPC over TCP socket idle timeouts.
Test-plan:
Destructive testing with NFS over TCP mounts.
Version: Fri, 29 Apr 2005 15:31:46 -0400
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cancel autodisconnect requests inside xprt_transmit() in order to avoid
races.
Use more efficient del_singleshot_timer_sync()
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
For internal purposes, the rpc_clnt_sigmask() call is replaced by
a call to rpc_task_sigmask(), which ensures that the current task
sigmask respects both the client cl_intr flag and the per-task NOINTR flag.
Problem noted by Jiaying Zhang.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The NFS and NFSACL programs run on the same RPC transport. This patch adds
support for this by converting svc_program into a chained list of programs
(server-side).
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Olaf Kirch <okir@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently we return -ENOMEM for every single failure to create a new auth.
This is actually accurate for auth_null and auth_unix, but for auth_gss it's a
bit confusing.
Allow rpcauth_create (and the ->create methods) to return errors. With this
patch, the user may sometimes see an EINVAL instead. Whee.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We shouldn't be silently falling back from krb5p to krb5i.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Ensure that we don't create an RPC client without checking that the server
does indeed support the RPC program + version that we are trying to set up.
This enables us to immediately return an error to "mount" if it turns out
that the server is only supporting NFSv2, when we requested NFSv3 or NFSv4.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fix up call_header() so that it calls xdr_adjust_iovec().
Fix calculation of the scratch buffer length in xdr_init_encode().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the task->tk_exit() wants to restart the RPC call after delaying
then the current RPC code will clobber the timer by calling
rpc_delete_timer() immediately after re-entering the loop in
__rpc_execute().
Problem noticed by Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Kconfig option had an extra double quote at the end of the line
which was causing in warning when building.
Signed-off-by: Kumar Gala <kumar.gala@freescale.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Sit tunnel logging is currently broken:
MAC=01:23:45:67:89:ab->01:23:45:47:89:ac TUNNEL=123.123. 0.123-> 12.123. 6.123
Apart from the broken IP address, MAC addresses are printed differently
for sit tunnels than for everything else.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Drop reference before handing the packets to raw_rcv()
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Keir Fraser <Keir.Fraser@xl.cam.ac.uk>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since expectation timeouts were made compulsory [1], there is no need to
check for them in ip_conntrack_expect_insert.
[1] https://lists.netfilter.org/pipermail/netfilter-devel/2005-January/018143.html
Signed-off-by: Phil Oester <kernel@linuxace.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Netfilter assumes that skb->data == skb->nh.ipv6h
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Here is a simplified version of the patch to fix a bug in IPv6
multicasting. It:
1) adds existence check & EADDRINUSE error for regular joins
2) adds an exception for EADDRINUSE in the source-specific multicast
join (where a prior join is ok)
3) adds a missing/needed read_lock on sock_mc_list; would've raced
with destroying the socket on interface down without
4) adds a "leave group" in the (INCLUDE, empty) source filter case.
This frees unneeded socket buffer memory, but also prevents
an inappropriate interaction among the 8 socket options that
mess with this. Some would fail as if in the group when you
aren't really.
Item #4 had a locking bug in the last version of this patch; rather than
removing the idev->lock read lock only, I've simplified it to remove
all lock state in the path and treat it as a direct "leave group" call for
the (INCLUDE,empty) case it covers. Tested on an MP machine. :-)
Much thanks to HoerdtMickael <hoerdt@clarinet.u-strasbg.fr> who
reported the original bug.
Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Essentially netlink at the moment always reports a pid and sequence of 0
always for v6 route activities.
To understand the repurcassions of this look at:
http://lists.quagga.net/pipermail/quagga-dev/2005-June/003507.html
While fixing this, i took the liberty to resolve the outstanding issue
of IPV6 routes inserted via ioctls to have the correct pids as well.
This patch tries to behave as close as possible to the v4 routes i.e
maintains whatever PID the socket issuing the command owns as opposed to
the process. That made the patch a little bulky.
I have tested against both netlink derived utility to add/del routes as
well as ioctl derived one. The Quagga folks have tested against quagga.
This fixes the problem and so far hasnt been detected to introduce any
new issues.
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Below is a more generic patch to do fib_lookup via netlink. For others
we should say that we discussed this as a way to verify route selection.
It's also possible there are others uses for this.
In short the fist half of struct fib_result_nl is filled in by caller
and netlink call fills in the other half and returns it.
In case anyone is interested there is a corresponding user app to compare
the full routing table this was used to test implementation of the LC-trie.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the flag XFRM_STATE_NOPMTUDISC for xfrm states. It is
similar to the nopmtudisc on IPIP/GRE tunnels. It only has an effect
on IPv4 tunnel mode states. For these states, it will ensure that the
DF flag is always cleared.
This is primarily useful to work around ICMP blackholes.
In future this flag could also allow a larger MTU to be set within the
tunnel just like IPIP/GRE tunnels. This could be useful for short haul
tunnels where temporary fragmentation outside the tunnel is desired over
smaller fragments inside the tunnel.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: James Morris <jmorris@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the xfrm_state_afinfo->init_flags hook which allows
each address family to perform any common initialisation that does
not require a corresponding destructor call.
It will be used subsequently to set the XFRM_STATE_NOPMTUDISC flag
in IPv4.
It also fixes up the error codes returned by xfrm_init_state.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: James Morris <jmorris@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds xfrm_init_state which is simply a wrapper that calls
xfrm_get_type and subsequently x->type->init_state. It also gets rid
of the unused args argument.
Abstracting it out allows us to add common initialisation code, e.g.,
to set family-specific flags.
The add_time setting in xfrm_user.c was deleted because it's already
set by xfrm_state_alloc.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: James Morris <jmorris@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implements sctp_connectx() as defined in the SCTP sockets API draft by
tunneling the request through a setsockopt().
Signed-off-by: Frank Filz <ffilzlnx@us.ibm.com>
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When enabled, this should disable UCOPY prequeue'ing altogether,
but it does not due to a missing test.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes the type of the third parameter 'length' of the
raw_send_hdrinc() function from 'int' to 'size_t'.
This makes sense since this function is only ever called from one
location, and the value passed as the third parameter in that location is
itself of type size_t, so this makes the recieving functions parameter
type match. Also, inside raw_send_hdrinc() the 'length' variable is
used in comparisons with unsigned values and passed as parameter to
functions expecting unsigned values (it's used in a single comparison with
a signed value, but that one can never actually be negative so the patch
also casts that one to size_t to stop gcc worrying, and it is passed in a
single instance to memcpy_fromiovecend() which expects a signed int, but
as far as I can see that's not a problem since the value of 'length'
shouldn't ever exceed the value of a signed int).
Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes the type of the local variable 'i' in
raw_probe_proto_opt() from 'int' to 'unsigned int'. The only use of 'i' in
this function is as a counter in a for() loop and subsequent index into
the msg->msg_iov[] array.
Since 'i' is compared in a loop to the unsigned variable msg->msg_iovlen
gcc -W generates this warning :
net/ipv4/raw.c:340: warning: comparison between signed and unsigned
Changing 'i' to unsigned silences this warning and is safe since the array
index can never be negative anyway, so unsigned int is the logical type to
use for 'i' and also enables a larger msg_iov[] array (but I don't know if
that will ever matter).
Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch gets rid of the following gcc -W warning in net/ipv4/raw.c :
net/ipv4/raw.c:387: warning: comparison of unsigned expression < 0 is always false
Since 'len' is of type size_t it is unsigned and can thus never be <0, and
since this is obvious from the function declaration just a few lines above
I think it's ok to remove the pointless check for len<0.
Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch silences these two gcc -W warnings in net/ipv4/raw.c :
net/ipv4/raw.c:517: warning: signed and unsigned type in conditional expression
net/ipv4/raw.c:613: warning: signed and unsigned type in conditional expression
It doesn't change the behaviour of the code, simply writes the conditional
expression with plain 'if()' syntax instead of '? :' , but since this
breaks it into sepperate statements gcc no longer complains about having
both a signed and unsigned value in the same conditional expression.
Signed-off-by: David S. Miller <davem@davemloft.net>
Removes the skb trimming code which is not needed since we never
touch the skb upon failure. Removes unnecessary initializers,
and simplifies the code a bit.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
prio2list() returns the relevant sk_buff_head for the
band specified by the priority for a given skb.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Removes the skb trimming code which is not needed since we never
touch the skb upon failure. Removes unnecessary includes,
initializers, and simplifies the code a bit.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The simplicity of the fifo qdisc allows several qdisc operations to be
redirected to the relevant queue management function directly. Saves
a lot of code lines and gives the pfifo a byte based backlog.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch replaces the spin_lock_irqsave call on the receive queue
lock in SCTP with spin_lock_bh. Despite the proliferation of
spin_lock_irqsave calls in this stack, it is only entered from the
IPv4/IPv6 stack and user space. That is, it is never entered from
hardirq context.
The call in question is only called from recvmsg which means that
IRQs aren't disabled. Therefore it is safe to replace it with
spin_lock_bh.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
In light of my recent patch to net/ipv4/udp.c that replaced the
spin_lock_irq calls on the receive queue lock with spin_lock_bh,
here is a similar patch for all other occurences of spin_lock_irq
on receive/error queue locks in IPv4 and IPv6.
In these stacks, we know that they can only be entered from user
or softirq context. Therefore it's safe to disable BH only.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch ensures that netlink events created as a result of programns
using ioctls (such as ifconfig, route etc) contains the correct PID of
those events.
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch converts "unsigned flags" to use more explict types like u16
instead and incrementally introduces NLMSG_NEW().
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch was supposed to be part of the neighbour tables related
patchset but apparently got lost.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes the format of the XFRM_MSG_DELSA and
XFRM_MSG_DELPOLICY notification so that the main message
sent is of the same format as that received by the kernel
if the original message was via netlink. This also means
that we won't lose the byid information carried in km_event.
Since this user interface is introduced by Jamal's patch
we can still afford to change it.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch rectifies some rtnetlink message builders that derive the
flags from the pid. It is now explicit like the other cases
which get it right. Also fixes half a dozen dumpers which did not
set NLM_F_MULTI at all.
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduces a new macro NLMSG_NEW which extends NLMSG_PUT but takes
a flags argument. NLMSG_PUT stays there for compatibility but now
calls NLMSG_NEW with flags == 0. NLMSG_PUT_ANSWER is renamed to
NLMSG_NEW_ANSWER which now also takes a flags argument.
Also converts the users of NLMSG_PUT_ANSWER to use NLMSG_NEW_ANSWER
and fixes the two direct users of __nlmsg_put to either provide
the flags or use NLMSG_NEW(_ANSWER).
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes dsmark to do all configuration sanity checks first and
only apply the changes if all of them can be applied without
any errors. Also fixes the weak sanity checks for DSMARK_VALUE
and DSMASK_MASK.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
To retrieve the neighbour tables send RTM_GETNEIGHTBL with the
NLM_F_DUMP flag set. Every neighbour table configuration is
spread over multiple messages to avoid running into message
size limits on systems with many interfaces. The first message
in the sequence transports all not device specific data such as
statistics, configuration, and the default parameter set.
This message is followed by 0..n messages carrying device
specific parameter sets.
Although the ordering should be sufficient, NDTA_NAME can be
used to identify sequences. The initial message can be identified
by checking for NDTA_CONFIG. The device specific messages do
not contain this TLV but have NDTPA_IFINDEX set to the
corresponding interface index.
To change neighbour table attributes, send RTM_SETNEIGHTBL
with NDTA_NAME set. Changeable attribute include NDTA_THRESH[1-3],
NDTA_GC_INTERVAL, and all TLVs in NDTA_PARMS unless marked
otherwise. Device specific parameter sets can be changed by
setting NDTPA_IFINDEX to the interface index of the corresponding
device.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
This chunks out the accept_queue and tcp_listen_opt code and moves
them to net/core/request_sock.c and include/net/request_sock.h, to
make it useful for other transport protocols, DCCP being the first one
to use it.
Next patches will rename tcp_listen_opt to accept_sock and remove the
inline tcp functions that just call a reqsk_queue_ function.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ok, this one just renames some stuff to have a better namespace and to
dissassociate it from TCP:
struct open_request -> struct request_sock
tcp_openreq_alloc -> reqsk_alloc
tcp_openreq_free -> reqsk_free
tcp_openreq_fastfree -> __reqsk_free
With this most of the infrastructure closely resembles a struct
sock methods subset.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kept this first changeset minimal, without changing existing names to
ease peer review.
Basicaly tcp_openreq_alloc now receives the or_calltable, that in turn
has two new members:
->slab, that replaces tcp_openreq_cachep
->obj_size, to inform the size of the openreq descendant for
a specific protocol
The protocol specific fields in struct open_request were moved to a
class hierarchy, with the things that are common to all connection
oriented PF_INET protocols in struct inet_request_sock, the TCP ones
in tcp_request_sock, that is an inet_request_sock, that is an
open_request.
I.e. this uses the same approach used for the struct sock class
hierarchy, with sk_prot indicating if the protocol wants to use the
open_request infrastructure by filling in sk_prot->rsk_prot with an
or_calltable.
Results? Performance is improved and TCP v4 now uses only 64 bytes per
open request minisock, down from 96 without this patch :-)
Next changeset will rename some of the structs, fields and functions
mentioned above, struct or_calltable is way unclear, better name it
struct request_sock_ops, s/struct open_request/struct request_sock/g,
etc.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Small fixup to use netlink macros instead of hardcoding.
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu wrote:
> @@ -1254,6 +1326,7 @@ static int pfkey_add(struct sock *sk, st
> if (IS_ERR(x))
> return PTR_ERR(x);
>
> + xfrm_state_hold(x);
This introduces a leak when xfrm_state_add()/xfrm_state_update()
fail. We hold two references (one from xfrm_state_alloc(), one
from xfrm_state_hold()), but only drop one. We need to take the
reference because the reference from xfrm_state_alloc() can
be dropped by __xfrm_state_delete(), so the fix is to drop both
references on error. Same problem in xfrm_user.c.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes XFRM_SAP_* and converts them over to XFRM_MSG_*.
The netlink interface is meant to map directly onto the underlying
xfrm subsystem. Therefore rather than using a new independent
representation for the events we can simply use the existing ones
from xfrm_user.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This patch fixes policy deletion in xfrm_user so that it sets
km_event.data.byid. This puts xfrm_user on par with what af_key
does in this case.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This patch adjusts the SA state conversion in af_key such that
XFRM_STATE_ERROR/XFRM_STATE_DEAD will be converted to SADB_STATE_DEAD
instead of SADB_STATE_DYING.
According to RFC 2367, SADB_STATE_DYING SAs can be turned into
mature ones through updating their lifetime settings. Since SAs
which are in the states XFRM_STATE_ERROR/XFRM_STATE_DEAD cannot
be resurrected, this value is unsuitable.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This patch ensures that the hard state/policy expire notifications are
only sent when the state/policy is successfully removed from their
respective tables.
As it is, it's possible for a state/policy to both expire through
reaching a hard limit, as well as being deleted by the user.
Note that this behaviour isn't actually forbidden by RFC 2367.
However, it is a quality of implementation issue.
As an added bonus, the restructuring in this patch will help
eventually in moving the expire notifications from softirq
context into process context, thus improving their reliability.
One important side-effect from this change is that SAs reaching
their hard byte/packet limits are now deleted immediately, just
like SAs that have reached their hard time limits.
Previously they were announced immediately but only deleted after
30 seconds.
This is bad because it prevents the system from issuing an ACQUIRE
command until the existing state was deleted by the user or expires
after the time is up.
In the scenario where the expire notification was lost this introduces
a 30 second delay into the system for no good reason.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Heres the final patch.
What this patch provides
- netlink xfrm events
- ability to have events generated by netlink propagated to pfkey
and vice versa.
- fixes the acquire lets-be-happy-with-one-success issue
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This is a fixed-up version of the broken "upstream-2.6.13" branch, where
I re-did the manual merge of drivers/net/r8169.c by hand, and made sure
the history is all good.
This fixes various crashes on 64-bit when using this module.
Based upon a patch by Juergen Kreileder <jk@blackdown.de>.
Signed-off-by: David S. Miller <davem@davemloft.net>
ACKed-by: Patrick McHardy <kaber@trash.net>
This patch alows you to change the source address of icmp error
messages. It applies cleanly to 2.6.11.11 and retains the default
behaviour.
In the old (default) behaviour icmp error messages are sent with the ip
of the exiting interface.
The new behaviour (when the sysctl variable is toggled on), it will send
the message with the ip of the interface that received the packet that
caused the icmp error. This is the behaviour network administrators will
expect from a router. It makes debugging complicated network layouts
much easier. Also, all 'vendor routers' I know of have the later
behaviour.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Vladislav Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Vladislav Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Userland layer-2 tunneling devices allocated through the TUNTAP driver
(drivers/net/tun.c) have a type of ARPHRD_NONE, and have no link-layer
address. The kernel complains at regular interval when IPv6 Privacy
extension are enabled because it can't find an hardware address :
Dec 29 11:02:04 auguste kernel: __ipv6_regen_rndid(idev=cb3e0c00):
cannot get EUI64 identifier; use random bytes.
IPv6 Privacy extensions should probably be disabled on that sort of
device. They won't work anyway. If userland wants a more usual
Ethernet-ish interface with usual IPv6 autoconfiguration, it will use a
TAP device with an emulated link-layer and a random hardware address
rather than a TUN device.
As far as I could fine, TUN virtual device from TUNTAP is the very only
sort of device using ARPHRD_NONE as kernel device type.
Signed-off-by: Rmi Denis-Courmont <rdenis@simphalempin.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We saw following trace several times:
|BUG: using smp_processor_id() in preemptible [00000001] code: httpd/30137
|caller is icmpv6_send+0x23/0x540
| [<c01ad63b>] smp_processor_id+0x9b/0xb8
| [<c02993e7>] icmpv6_send+0x23/0x540
This is because of icmpv6_socket, which is the only one user of
smp_processor_id() in icmpv6_send(), AFAIK.
Since it should be used in non-preemptive context,
let's defer the dereference after disabling preemption
(by icmpv6_xmit_lock()).
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
From: Ralf Baechle <ralf@linux-mips.org>
There are archives of the old list at http://oss.sgi.com/archives/netdev
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is brought to you by the department of applied stupidity.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds meta collectors for all socket attributes that make sense
to be filtered upon. Some of them are only useful for debugging
but having them doesn't hurt.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Changing the sysctl net.core.dev_weight has no effect because the weight
of the backlog devices is set during initialization and never changed.
This patch propagates any changes to the global value affected by sysctl
to the per-cpu devices. It is done every time the packet handler
function is run.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simple interface to allow changing network device scheduling weight
with sysfs. Please consider this for 2.6.12, since risk/impact is small.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
It was checking the "GET" function pointer instead of
the "SET" one. Looks like a cut&paste error :-)
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no usage of this EXPORT_SYMBOL in the kernel.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
I cunningly put the audit call immediately after the
copy_from_user().... but used the _userspace_ copy of the args still.
Let's not do that.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Unused indices which are ignored while walking must still
be counted to avoid dumping the same index twice.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steven Hand <Steven.Hand@cl.cam.ac.uk> wrote:
>
> Reconstructed forward trace:
>
> net/ipv4/udp.c:1334 spin_lock_irq()
> net/ipv4/udp.c:1336 udp_checksum_complete()
> net/core/skbuff.c:1069 skb_shinfo(skb)->nr_frags > 1
> net/core/skbuff.c:1086 kunmap_skb_frag()
> net/core/skbuff.h:1087 local_bh_enable()
> kernel/softirq.c:0140 WARN_ON(irqs_disabled());
The receive queue lock is never taken in IRQs (and should never be) so
we can simply substitute bh for irq.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we have ip_queue being used from LOCAL_IN, then we end up with a
situation where the verdicts coming back from userspace traverse the TCP
input path from syscall context. While this seems to work most of the
time, there's an ugly deadlock:
syscall context is interrupted by the timer interrupt. When the timer
interrupt leaves, the timer softirq get's scheduled and calls
tcp_delack_timer() and alike. They themselves do bh_lock_sock(sk),
which is already held from somewhere else -> boom.
I've now tested the suggested solution by Patrick McHardy and Herbert Xu to
simply use local_bh_{en,dis}able().
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We are intentionally ignoring the copy_to_user() value,
make it clear to the compiler too.
Noted by Jeff Garzik.
Signed-off-by: David S. Miller <davem@davemloft.net>
It cannot work properly, so just ignore it in drr
and rr multipath algorithms just like the random
multipath algorithm does.
Suggested by Herbert Xu.
Signed-off by: Pravin B. Shelar <pravins@calsoftinc.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add an option to make secondary IP addresses get promoted
when primary IP addresses are removed from the device.
It defaults to off to preserve existing behavior.
Signed-off-by: Harald Welte <laforge@gnumonks.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This improves the bridge local receive path by avoiding going
through another softirq. The bridge receive path is already being called
from a netif_receive_skb() there is no point in going through another
receiveq round trip.
Recursion is limited because bridge can never be a port of a bridge
so handle_bridge() always returns.
Signed-off-by: David S. Miller <davem@davemloft.net>
Avoid poisoning of the bridge forwarding table by frames that have been
dropped by filtering. This prevents spoofed source addresses on hostile
side of bridge from causing packet leakage, a small but possible security
risk.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make features of the bridge pseudo-device be a subset of the underlying
devices. Motivated by Xen and others who use bridging to do failover.
Signed-off-by: Catalin BOIE <catab at umrella.ro>
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The features field in netdevice is really a bitmask, and bitmask's should
be unsigned.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Resend of earlier patch (no changes) from Catalin used to provide
device feature change notification.
Signed-off-by: Catalin BOIE <catab at umbrella.ro>
Acked-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
"_s" suffix is certainly of hungarian origin.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[XFRM] Call dst_check() with appropriate cookie
This fixes infinite loop issue with IPv6 tunnel mode.
Signed-off-by: Kazunori Miyazawa <kazunori@miyazawa.org>
Signed-off-by: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Here is a fixed up version of the reorder feature of netem.
It is the same as the earlier patch plus with the bugfix from Julio merged in.
Has expected backwards compatibility behaviour.
Go ahead and merge this one, the TCP strangeness I was seeing was due
to the reordering bug, and previous version of TSO patch.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Netem works better if there if packets are just queued in the inner discipline
rather than having a separate delayed queue. Change to use the dequeue/requeue
to peek like TBF does.
By doing this potential qlen problems with the old method are avoided. The problems
happened when the netem_run that moved packets from the inner discipline to the nested
discipline failed (because inner queue was full). This happened in dequeue, so the
effective qlen of the netem would be decreased (because of the drop), but there was
no way to keep the outer qdisc (caller of netem dequeue) in sync.
The problem window is still there since this patch doesn't address the issue of
requeue failing in netem_dequeue, but that shouldn't happen since the sequence dequeue/requeue
should always work. Long term correct fix is to implement qdisc->peek in all the qdisc's
to allow for this (needed by several other qdisc's as well).
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Handle duplication of packets in netem by re-inserting at top of qdisc tree.
This avoid problems with qlen accounting with nested qdisc. This recursion
requires no additional locking but will potentially increase stack depth.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>