From: Ali Saidi <saidi@engin.umich.edu>
When TCP receive copy offload is enabled it's possible that
tcp_rcv_established() will cause two acks to be sent for a single
packet. In the case that a tcp_dma_early_copy() is successful,
copied_early is set to true which causes tcp_cleanup_rbuf() to be
called early which can send an ack. Further along in
tcp_rcv_established(), __tcp_ack_snd_check() is called and will
schedule a delayed ACK. If no packets are processed before the delayed
ack timer expires the packet will be acked twice.
Signed-off-by: David S. Miller <davem@davemloft.net>
I'm quite sure that if I give this function in its old format
for you to inspect, you start to wonder what is the type of
demanded or if it's a global variable.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
It all started from me noticing that this urgent check in
tcp_clean_rtx_queue is unnecessarily inside the loop. Then
I took a longer look to it and found out that the users of
urg_mode can trivially do without, well almost, there was
one gotcha.
Bonus: those funny people who use urg with >= 2^31 write_seq -
snd_una could now rejoice too (that's the only purpose for the
between being there, otherwise a simple compare would have done
the thing). Not that I assume that the rest of the tcp code
happily lives with such mind-boggling numbers :-). Alas, it
turned out to be impossible to set wmem to such numbers anyway,
yes I really tried a big sendfile after setting some wmem but
nothing happened :-). ...Tcp_wmem is int and so is sk_sndbuf...
So I hacked a bit variable to long and found out that it seems
to work... :-)
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wrap calling sk->sk_backlog_rcv() in a function. This will allow extending the
generic sk_backlog_rcv behaviour.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the socket cached in the skb if it's present.
Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To be able to use the cached socket reference in the skb during input
processing we add a new set of lookup functions that receive the skb on
their argument list.
Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To be able to use the cached socket reference in the skb during input
processing we add a new set of lookup functions that receive the skb on
their argument list.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since IPVS now has partial IPv6 support, this patch moves IPVS from
net/ipv4/ipvs to net/netfilter/ipvs. It's a result of:
$ git mv net/ipv4/ipvs net/netfilter
and adapting the relevant Kconfigs/Makefiles to the new path.
Signed-off-by: Julius Volz <juliusv@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
The iptables tproxy code has to be able to do UDP socket hash lookups,
so we have to provide an exported lookup function for this purpose.
Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current TCP code relies on the local port of the listening socket
being the same as the destination address of the incoming
connection. Port redirection used by many transparent proxying
techniques obviously breaks this, so we have to store the original
destination port address.
This patch extends struct inet_request_sock and stores the incoming
destination port value there. It also modifies the handshake code to
use that value as the source port when sending reply packets.
Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Netfilter's ip_route_me_harder() tries to re-route packets either
generated or re-routed by Netfilter. This patch changes
ip_route_me_harder() to handle packets from non-locally-bound sockets
with IP_TRANSPARENT set as local and to set the appropriate flowi
flags when re-doing the routing lookup.
Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
The TCP stack sends out SYN+ACK/ACK/RST reply packets in response to
incoming packets. The non-local source address check on output bites
us again, as replies for transparently redirected traffic won't have a
chance to leave the node.
This patch selectively sets the FLOWI_FLAG_ANYSRC flag when doing the
route lookup for those replies. Transparent replies are enabled if the
listening socket has the transparent socket flag set.
Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet_iif() in inet_sock.h requires route.h. Since users of inet_iif()
usually require other route.h functionality anyway this patch moves
inet_iif() to route.h.
Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Setting IP_TRANSPARENT is not really useful without allowing non-local
binds for the socket. To make user-space code simpler we allow these
binds even if IP_TRANSPARENT is set but IP_FREEBIND is not.
Signed-off-by: Tóth László Attila <panther@balabit.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces the IP_TRANSPARENT socket option: enabling that
will make the IPv4 routing omit the non-local source address check on
output. Setting IP_TRANSPARENT requires NET_ADMIN capability.
Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_route_output() contains a check to make sure that no flows with
non-local source IP addresses are routed. This obviously makes using
such addresses impossible.
This patch introduces a flowi flag which makes omitting this check
possible. The new flag provides a way of handling transparent and
non-transparent connections differently.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
This minor cleanup simplifies later changes which will convert
struct sk_buff and friends over to using struct list_head.
Signed-off-by: David S. Miller <davem@davemloft.net>
The nr_conns variable in the sync message header is only eight bits wide
and will overflow on interfaces with a large MTU. As a result the backup
won't parse all connections contained in the sync buffer. On regular
ethernet with an MTU of 1500 this isn't a problem, because we can't
overflow the value, but consider jumbo frames being used on a cross-over
connection between both directors.
We now restrict the size of the sync buffer, so that we never put more
than 255 connections into a single sync buffer.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Signed-off-by: Simon Horman <horms@verge.net.au>
I'm trying to use the TCP_MAXSEG option to setsockopt() to set the MSS
for both sides of a bidirectional connection.
man tcp says: "If this option is set before connection establishment, it
also changes the MSS value announced to the other end in the initial
packet."
However, the kernel only uses the MTU/route cache to set the advertised
MSS. That means if I set the MSS to, say, 500 before calling connect(),
I will send at most 500-byte packets, but I will still receive 1500-byte
packets in reply.
This is a bug, either in the kernel or the documentation.
This patch (applies to latest net-2.6) reduces the advertised value to
that requested by the user as long as setsockopt() is called before
connect() or accept(). This seems like the behavior that one would
expect as well as that which is documented.
I've tried to make sure that things that depend on the advertised MSS
are set correctly.
Signed-off-by: Tom Quetchenbach <virtualphtn@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If lost skb is sacked, we might have nothing to retransmit
as high as the retransmit_high is pointing to, so place
it lower to avoid unnecessary walking.
This is mainly for the case where high L'ed skbs gets sacked.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Most importantly avoid doing it with cumulative ACK. However,
since we have lost_cnt_hint in the picture as well needing
adjustments, it's not as trivial as dealing with
retransmit_skb_hint (and cannot be done in the all place we
could trivially leave retransmit_skb_hint untouched).
With the previous patch, this should mostly remove O(n^2)
behavior while cumulative ACKs start flowing once rexmit
after a lossy round-trip made it through.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Most importantly avoid doing it with cumulative ACK. Not clearing
means that we no longer need n^2 processing in resolution of each
fast recovery.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
This doesn't much sense here afaict, probably never has. Since
fragmenting and collapsing deal the hints by themselves, there
should be very little reason for the rexmit loop to do that.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both loops are quite similar, so they can be combined
with little effort. As a result, forward_skb_hint becomes
obsolete as well.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
The validity of the retransmit_high must then be ensured
if no L'ed skb exits!
This makes a minor change to behavior, we now have to
iterate the head to find out that the loop terminates.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Because lost counter no longer requires tuning, this is
trivial to remove (the tuning wouldn't have been too
hard either) because no "new" retransmittable skb appeared
below retransmit_skb_hint when SACKing for sure.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
I suspect it might have been related to the changed amount
of lost skbs, which was counted by retransmit_cnt_hint that
got changed.
The place for this clearing was very illogical anyway,
it should have been after the LOST-bit clearing loop to
make any sense.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Main benefit in this is that we can then freely point
the retransmit_skb_hint to anywhere we want to because
there's no longer need to know what would be the count
changes involve, and since this is really used only as a
terminator, unnecessary work is one time walk at most,
and if some retransmissions are necessary after that
point later on, the walk is not full waste of time
anyway.
Since retransmit_high must be kept valid, all lost
markers must ensure that.
Now I also have learned how those "holes" in the
rexmittable skbs can appear, mtu probe does them. So
I removed the misleading comment as well.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
This useful because we'd need to verifying soon in many places
which makes things slightly more complex than it used to be.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ie., the difference between partial and all clearing doesn't
exists anymore since the SACK optimizations got dropped by
an sacktag rewrite.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change __contant_htons() to htons() in the IPVS code when not in an
initializer.
-Brian
Signed-off-by: Brian Haley <brian.haley@hp.com>
Acked-by: Julius Volz <juliusv@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
This teaches sparse that the following are not problems:
make C=1
CHECK net/ipv4/ipvs/ip_vs_ctl.c
net/ipv4/ipvs/ip_vs_ctl.c:1793:14: warning: context imbalance in 'ip_vs_info_seq_start' - wrong count at exit
net/ipv4/ipvs/ip_vs_ctl.c:1842:13: warning: context imbalance in 'ip_vs_info_seq_stop' - unexpected unlock
Acked-by: Sven Wegener <sven.wegener@stealer.net>
Acked-by: Julius Volz <juliusv@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
ip_vs_conn_new expects a union nf_inet_addr as the type for its address
parameters, not a plain integer.
This problem was detected by sparse.
make C=1
CHECK net/ipv4/ipvs/ip_vs_core.c
net/ipv4/ipvs/ip_vs_core.c:469:9: warning: Using plain integer as NULL pointer
Acked-by: Sven Wegener <sven.wegener@stealer.net>
Acked-by: Julius Volz <juliusv@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Jumping to out unlocks __ip_vs_svc_lock, but that lock is not taken until
after code that may jump to out.
This problem was detected by sparse.
make C=1
CHECK net/ipv4/ipvs/ip_vs_ctl.c
net/ipv4/ipvs/ip_vs_ctl.c:1332:2: warning: context imbalance in 'ip_vs_edit_service' - unexpected unlock
Acked-by: Sven Wegener <sven.wegener@stealer.net>
Acked-by: Julius Volz <juliusv@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
The previous patch in response to the recursive locking on IPsec
reception is broken as it tries to drop the BH socket lock while in
user context.
This patch fixes it by shrinking the section protected by the
socket lock to sock_queue_rcv_skb only. The only reason we added
the lock is for the accounting which happens in that function.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of duplicating the fields, integrate a user stats structure into
the kernel stats structure. This is more robust when the members are
changed, because they are now automatically kept in sync.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Reviewed-by: Julius Volz <juliusv@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Instead of checking the value in include/net/ip_vs.h, we can just
restrict the range in our Kconfig file. This will prevent values outside
of the range early.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Reviewed-by: Julius Volz <juliusv@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Remove an incorrect ip_route_me_harder() that was probably a result of
merging my IPv6 patches with the local client patches. With this, IPv6+NAT
are working again.
Signed-off-by: Julius Volz <juliusv@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Now that LVS can load balance locally generated traffic, packets may come
from the loopback device and thus may have a partial checksum.
The existing code allows for the case where there is no checksum at all for
TCP, however Herbert Xu has confirmed that this is not legal.
Signed-off-by: Simon Horman <horms@verge.net.au>
Acked-by: Julius Volz <juliusv@google.com>
How to reproduce ?
- create a network namespace
- use tcp protocol and get timewait socket
- exit the network namespace
- after a moment (when the timewait socket is destroyed), the kernel
panics.
# BUG: unable to handle kernel NULL pointer dereference at
0000000000000007
IP: [<ffffffff821e394d>] inet_twdr_do_twkill_work+0x6e/0xb8
PGD 119985067 PUD 11c5c0067 PMD 0
Oops: 0000 [1] SMP
CPU 1
Modules linked in: ipv6 button battery ac loop dm_mod tg3 libphy ext3 jbd
edd fan thermal processor thermal_sys sg sata_svw libata dock serverworks
sd_mod scsi_mod ide_disk ide_core [last unloaded: freq_table]
Pid: 0, comm: swapper Not tainted 2.6.27-rc2 #3
RIP: 0010:[<ffffffff821e394d>] [<ffffffff821e394d>]
inet_twdr_do_twkill_work+0x6e/0xb8
RSP: 0018:ffff88011ff7fed0 EFLAGS: 00010246
RAX: ffffffffffffffff RBX: ffffffff82339420 RCX: ffff88011ff7ff30
RDX: 0000000000000001 RSI: ffff88011a4d03c0 RDI: ffff88011ac2fc00
RBP: ffffffff823392e0 R08: 0000000000000000 R09: ffff88002802a200
R10: ffff8800a5c4b000 R11: ffffffff823e4080 R12: ffff88011ac2fc00
R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000000
FS: 0000000041cbd940(0000) GS:ffff8800bff839c0(0000)
knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000007 CR3: 00000000bd87c000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffff8800bff9e000, task
ffff88011ff76690)
Stack: ffffffff823392e0 0000000000000100 ffffffff821e3a3a
0000000000000008
0000000000000000 ffffffff821e3a61 ffff8800bff7c000 ffffffff8203c7e7
ffff88011ff7ff10 ffff88011ff7ff10 0000000000000021 ffffffff82351108
Call Trace:
<IRQ> [<ffffffff821e3a3a>] ? inet_twdr_hangman+0x0/0x9e
[<ffffffff821e3a61>] ? inet_twdr_hangman+0x27/0x9e
[<ffffffff8203c7e7>] ? run_timer_softirq+0x12c/0x193
[<ffffffff820390d1>] ? __do_softirq+0x5e/0xcd
[<ffffffff8200d08c>] ? call_softirq+0x1c/0x28
[<ffffffff8200e611>] ? do_softirq+0x2c/0x68
[<ffffffff8201a055>] ? smp_apic_timer_interrupt+0x8e/0xa9
[<ffffffff8200cad6>] ? apic_timer_interrupt+0x66/0x70
<EOI> [<ffffffff82011f4c>] ? default_idle+0x27/0x3b
[<ffffffff8200abbd>] ? cpu_idle+0x5f/0x7d
Code: e8 01 00 00 4c 89 e7 41 ff c5 e8 8d fd ff ff 49 8b 44 24 38 4c 89 e7
65 8b 14 25 24 00 00 00 89 d2 48 8b 80 e8 00 00 00 48 f7 d0 <48> 8b 04 d0
48 ff 40 58 e8 fc fc ff ff 48 89 df e8 c0 5f 04 00
RIP [<ffffffff821e394d>] inet_twdr_do_twkill_work+0x6e/0xb8
RSP <ffff88011ff7fed0>
CR2: 0000000000000007
This patch provides a function to purge all timewait sockets related
to a network namespace. The timewait sockets life cycle is not tied with
the network namespace, that means the timewait sockets stay alive while
the network namespace dies. The timewait sockets are for avoiding to
receive a duplicate packet from the network, if the network namespace is
freed, the network stack is removed, so no chance to receive any packets
from the outside world. Furthermore, having a pending destruction timer
on these sockets with a network namespace freed is not safe and will lead
to an oops if the timer callback which try to access data belonging to
the namespace like for example in:
inet_twdr_do_twkill_work
-> NET_INC_STATS_BH(twsk_net(tw), LINUX_MIB_TIMEWAITED);
Purging the timewait sockets at the network namespace destruction will:
1) speed up memory freeing for the namespace
2) fix kernel panic on asynchronous timewait destruction
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Acked-by: Denis V. Lunev <den@openvz.org>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is standard to use ipv6_addr_copy() to fill in
the in6 element of a union nf_inet_addr snet.
Thanks to Julius Volz for pointing this out.
Cc: Brian Haley <brian.haley@hp.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Acked-by: Julius Volz <juliusv@google.com>
Sorry, this was my error.
Thanks to Julius Volz for pointing it out.
Signed-off-by: Simon Horman <horms@verge.net.au>
Acked-by: Julius Volz <juliusv@google.com>
We can't use non-local link-local addresses for destinations, without
knowing the interface on which we can reach the address. Reject them for
now.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Acked-by: Julius Volz <juliusv@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
They are only used in this file, so they should be static
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Acked-by: Julius Volz <juliusv@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Like the other code in this function does.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Acked-by: Julius Volz <juliusv@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
We want a pointer to it, not the value casted to a pointer.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Acked-by: Julius Volz <juliusv@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
This allows IPVS to load balance IPv6 connections made by a local process.
For example a proxy server running locally.
External client --> pound:443 -> Local:443 --> IPVS:80 --> RealServer
This is an extenstion to the IPv4 work done in this area
by Siim Põder and Malcolm Turnbull.
Cc: Siim Põder <siim@p6drad-teel.net>
Cc: Malcolm Turnbull <malcolm@loadbalancer.org>
Signed-off-by: Simon Horman <horms@verge.net.au>
This allows IPVS to load balance connections made by a local process.
For example a proxy server running locally.
External client --> pound:443 -> Local:443 --> IPVS:80 --> RealServer
Signed-off-by: Siim Põder <siim@p6drad-teel.net>
Signed-off-by: Malcolm Turnbull <malcolm@loadbalancer.org>
Signed-off-by: Simon Horman <horms@verge.net.au>
Allow adding IPv6 services through the genetlink interface and add checks
to see if the chosen scheduler is supported with IPv6 and whether the
supplied prefix length is sane. Make sure the service count exported via
the sockopt interface only counts IPv4 services.
Signed-off-by: Julius Volz <juliusv@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Register the previously defined or adapted netfilter hook functions for
IPv6 as PF_INET6 hooks.
Signed-off-by: Julius Volz <juliusv@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Adjust various debug outputs to use the new *_BUF macro variants for
correct output of v4/v6 addresses.
Signed-off-by: Julius Volz <juliusv@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Add __ip_vs_addr_is_local_v6() to find out if an IPv6 address belongs to a
local interface. Use this function to decide whether to set the
IP_VS_CONN_F_LOCALNODE flag for IPv6 destinations.
Signed-off-by: Vince Busam <vbusam@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Immediately return from FTP application helper and do nothing when dealing
with IPv6 packets. IPv6 is not supported by this helper yet.
Signed-off-by: Julius Volz <juliusv@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Disable the sync daemon for IPv6 connections, works only with IPv4 for now.
Signed-off-by: Julius Volz <juliusv@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Convert functions for looking up destinations (real servers) to support
IPv6 services/dests.
Signed-off-by: Julius Volz <juliusv@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Add Netfilter hook functions or modify existing ones, if possible, to
process IPv6 packets. Some support functions are also added/modified for
this. ip_vs_nat_icmp_v6() was already added in the patch that added the v6
xmit functions, as it is called from one of them.
Signed-off-by: Julius Volz <juliusv@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Convert ip_vs_schedule() and ip_vs_sched_persist() to support scheduling of
IPv6 connections.
Signed-off-by: Julius Volz <juliusv@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Add xmit functions for IPv6. Also add the already needed __ip_vs_get_out_rt_v6()
to ip_vs_core.c. Bind the new xmit functions to v6 connections.
Signed-off-by: Julius Volz <juliusv@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Add IPv6 support to IP_VS_XMIT() and to the xmit routing cache, introducing
a new function __ip_vs_get_out_rt_v6().
Signed-off-by: Julius Volz <juliusv@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Extend functions for getting/creating connections and connection
templates for IPv6 support and fix the callers.
Signed-off-by: Julius Volz <juliusv@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Extend protocol DNAT/SNAT and state handlers to work with IPv6. Also
change/introduce new checksumming helper functions for this.
Signed-off-by: Julius Volz <juliusv@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Add 'af' arguments to conn_schedule(), conn_in_get(), conn_out_get() and
csum_check() function pointers in struct ip_vs_protocol. Extend the
respective functions for TCP, UDP, AH and ESP and adjust the callers.
The changes in the callers need to be somewhat extensive, since they now
need to pass a filled out struct ip_vs_iphdr * to the modified functions
instead of a struct iphdr *.
Signed-off-by: Julius Volz <juliusv@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Add 'supports_ipv6' flag to struct ip_vs_scheduler to indicate whether a
scheduler supports IPv6. Set the flag to 1 in schedulers that work with
IPv6, 0 otherwise. This flag is checked in a later patch while trying to
add a service with a specific scheduler. Adjust debug in v6-supporting
schedulers to work with both address families.
Signed-off-by: Julius Volz <juliusv@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Add support for selecting services based on their address family to
ip_vs_service_get() and adjust the callers.
Signed-off-by: Julius Volz <juliusv@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Add support for getting services based on their address family to
__ip_vs_service_get(), __ip_vs_fwm_get() and the helper hash function
ip_vs_svc_hashkey(). Adjust the callers.
Signed-off-by: Julius Volz <juliusv@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Add extended internal versions of struct ip_vs_service_user and struct
ip_vs_dest_user (the originals can't be modified as they are part
of the old sockopt interface). Adjust ip_vs_ctl.c to work with the new
data structures and add some minor AF-awareness.
Signed-off-by: Julius Volz <juliusv@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Introduce new 'af' fields into IPVS data structures for specifying an
entry's address family. Convert IP addresses to be of type union
nf_inet_addr.
Signed-off-by: Julius Volz <juliusv@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Add boolean config option CONFIG_IP_VS_IPV6 for enabling experimental IPv6
support in IPVS. Only visible if IPv6 support is set to 'y' or both IPv6
and IPVS are modules.
Signed-off-by: Julius Volz <juliusv@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
This patch consolidates the code common to TCP and CCID-2:
* TCP uses RFC 3390 in a packet-oriented manner (tcp_input.c) and
* CCID-2 uses RFC 3390 in packet-oriented manner (RFC 4341).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Re-enable IP when the MTU gets back to a valid size.
This patch just checks if the in_dev is NULL on a NETDEV_CHANGEMTU event
and if MTU is valid (bigger than 68), then re-enable in_dev.
Also a function that checks valid MTU size was created.
Signed-off-by: Breno Leitao <leitao@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When scanning route cache hash table, we can avoid taking locks for
empty buckets. Both /proc/net/rt_cache and NETLINK RTM_GETROUTE
interface are taken into account.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On most systems most of the TCP established/time-wait hash buckets are empty.
When walking the hash table for /proc/net/tcp their read locks would
always be aquired just to find out they're empty. This patch changes the code
to check first if the buckets have any entries before taking the lock, which
is much cheaper than taking a lock. Since the hash tables are large
this makes a measurable difference on processing /proc/net/tcp,
especially on architectures with slow read_lock (e.g. PPC)
On a 2GB Core2 system time cat /proc/net/tcp > /dev/null (with a mostly
empty hash table) goes from 0.046s to 0.005s.
On systems with slower atomics (like P4 or POWER4) or larger hash tables
(more RAM) the difference is much higher.
This can be noticeable because there are some daemons around who regularly
scan /proc/net/tcp.
Original idea for this patch from Marcus Meissner, but redone by me.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The size of the TCP header is miscalculated when the window scale ends
up being 0. Additionally, this can be induced by sending a SYN to a
passive open port with a window scale option with value 0.
Signed-off-by: Philip Love <love_phil@emc.com>
Signed-off-by: Adam Langley <agl@imperialviolet.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
After integrating ESP into ip_vs_proto_ah, rename it (and the references to
it) to ip_vs_proto_ah_esp.c and delete the old ip_vs_proto_esp.c.
Signed-off-by: Julius Volz <juliusv@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Rename all ah_* functions to ah_esp_* (and adjust comments). Move ESP
protocol definition into ip_vs_proto_ah.c and remove all usage of
ip_vs_proto_esp.c.
Make the compilation of ip_vs_proto_ah.c dependent on a new config
variable, IP_VS_PROTO_AH_ESP, which is selected either by
IP_VS_PROTO_ESP or IP_VS_PROTO_AH. Only compile the selected protocols'
structures within this file.
Signed-off-by: Julius Volz <juliusv@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
net.ipv4.neigh should be a part of skeleton to avoid ordering problems
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>