When you've enabled conntrack and NAT as a module (standard case in all
distributions), and you've also enabled the new conntrack netlink
interface, loading ip_conntrack_netlink.ko will auto-load iptable_nat.ko.
This causes a huge performance penalty, since for every packet you iterate
the nat code, even if you don't want it.
This patch splits iptable_nat.ko into the NAT core (ip_nat.ko) and the
iptables frontend (iptable_nat.ko). Threfore, ip_conntrack_netlink.ko will
only pull ip_nat.ko, but not the frontend. ip_nat.ko will "only" allocate
some resources, but not affect runtime performance.
This separation is also a nice step in anticipation of new packet filters
(nf-hipac, ipset, pkttables) being able to use the NAT core.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The GRE, SCTP and TCP protocol helpers did not call
ip_conntrack_event_cache() when updating ct->status. This patch adds
the respective calls.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have to introduce a separate Kconfig menu entry for the NFQUEUE targets.
They cannot "just" depend on nfnetlink_queue, since nfnetlink_queue could
be linked into the kernel, whereas iptables can be a module.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a number of bugs. It cannot be reasonably split up in
multiple fixes, since all bugs interact with each other and affect the same
function:
Bug #1:
The event cache code cannot be called while a lock is held. Therefore, the
call to ip_conntrack_event_cache() within ip_ct_refresh_acct() needs to be
moved outside of the locked section. This fixes a number of 2.6.14-rcX
oops and deadlock reports.
Bug #2:
We used to call ct_add_counters() for unconfirmed connections without
holding a lock. Since the add operations are not atomic, we could race
with another CPU.
Bug #3:
ip_ct_refresh_acct() lost REFRESH events in some cases where refresh
(and the corresponding event) are desired, but no accounting shall be
performed. Both, evenst and accounting implicitly depended on the skb
parameter bein non-null. We now re-introduce a non-accounting
"ip_ct_refresh()" variant to explicitly state the desired behaviour.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
As noted by Alexey Dobriyan, the DEBUGP statement prints the wrong
callID.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the introduction of TSO pcount a year ago, it has been possible
for tcp_fragment() to cause packets_out to decrease. Prior to that,
tcp_retrans_try_collapse() was the only way for that to happen on the
retransmission path.
When this happens with Reno, it is possible for sasked_out to become
invalid because it is only an estimate and not tied to any particular
packet on the retransmission queue.
Therefore we need to adjust sacked_out as well as left_out in the Reno
case. The following patch does exactly that.
This bug is pretty difficult to trigger in practice though since you
need a SACKless peer with a retransmission that occurs just as the
cached MTU value expires.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Patch from Joel Sing to fix the default congestion control algorithm
for incoming connections. If a new congestion control handler is added
(via module), it should become the default for new
connections. Instead, the incoming connections use reno. The cause is
incorrect initialisation causes the tcp_init_congestion_control()
function to return after the initial if test fails.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Acked-by: Ian McDonald <imcdnzl@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cleanup the printk's in fib_trie:
* Convert a couple of places in the dump code to BUG_ON
* Put log level's on each message
The version message really needed the message since it leaks out
on the pretty Fedora bootup.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Acked-by: Robert Olsson <Robert.Olsson@data.slu.se>,
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix unchecked __get_user that could be tricked into generating a
memory read on an arbitrary address. The result of the read is not
returned directly but you may be able to divine some information about
it, or use the read to cause a crash on some architectures by reading
hardware state. CAN-2004-2492.
Fix from Al Viro, ack from Dave Miller.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The problem is that we're now calling tcp_fragment() in a context
where the packets might be marked as SACKED_ACKED or SACKED_RETRANS.
This was not possible before as you never retransmitted packets that
are so marked.
Because of this, we need to adjust sacked_out and retrans_out in
tcp_fragment(). This is exactly what the following patch does.
We also need to preserve the SACKED_ACKED/SACKED_RETRANS marking
if they exist.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Those exports are needed by the PPTP helper following in the next
couple of changes.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both __ip_conntrack_expect_find and ip_conntrack_expect_find_get take
a reference to the expectation, the difference is that callers of
__ip_conntrack_expect_find must hold ip_conntrack_lock.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This new "version 3" PPTP conntrack/nat helper is finally ready for
mainline inclusion. Special thanks to lots of last-minute bugfixing
by Patric McHardy.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
* This patch is from Paul McKenney's RCU reviewing.
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
* Prints the route tnode and set the stats level deepth as before.
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_ct_refresh_acct() can be called without a valid "skb" pointer.
This used to work, since ct_add_counters() deals with that fact.
However, the recently-added event cache doesn't handle this at all.
This patch is a quick fix that is supposed to be replaced soon by a cleaner
solution during the pending redesign of the event cache.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of maintaining an array containing a list of nodes this instance
is responsible for let's use a simple bitmap. This provides the
following features:
* clusterip_responsible() and the add_node()/delete_node() operations
become very simple and don't need locking
* the config structure is much smaller
In spite of the completely different internal data representation the
user-space interface remains almost unchanged; the only difference is
that the proc file does not list nodes in the order they were added.
(The target info structure remains the same.)
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The CLUSTERIP target creates a procfs entry for all different cluster
IPs. Although more than one rules can refer to a single cluster IP (and
thus a single config structure), removal of the procfs entry is done
unconditionally in destroy(). In more complicated situations involving
deferred dereferencing of the config structure by procfs and creating a
new rule with the same cluster IP it's also possible that no entry will
be created for the new rule.
This patch fixes the problem by counting the number of entries
referencing a given config structure and moving the config list
manipulation and procfs entry deletion parts to the
clusterip_config_entry_put() function.
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_vs_ftp when loaded can create NAT connections with unknown client
port for passive FTP. For such expectations we lookup with cport=0 on
incoming packet but it matches the format of the persistence templates
causing packets to other persistent virtual servers to be forwarded to
real server without creating connection. Later the reply packets are
treated as foreign and not SNAT-ed.
This patch changes the connection lookup for packets from clients:
* introduce IP_VS_CONN_F_TEMPLATE connection flag to mark the
connection as template
* create new connection lookup function just for templates -
ip_vs_ct_in_get
* make sure ip_vs_conn_in_get hits only connections with
IP_VS_CONN_F_NO_CPORT flag set when s_port is 0. By this way
we avoid returning template when looking for cport=0 (ftp)
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
Agostino di Salle noticed that persistent templates are not
invalidated due to buggy optimization.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes line dupes at /ipv4/igmp.c and /ipv6/mcast.c in the
2.6 kernel, where MCAST_EXCLUDE is mistakenly used instead of
MCAST_INCLUDE.
Signed-off-by: Denis Lukianov <denis@voxelsoft.com>
Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The problem is that the SACK fragmenting code may incorrectly call
tcp_fragment() with a length larger than the skb->len. This happens
when the skb on the transmit queue completely falls to the LHS of the
SACK.
And add a BUG() check to tcp_fragment() so we can spot this kind of
error more quickly in the future.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
In 2.6.13-rcX the MASQUERADE target was changed not to exclude local
packets for better source address consistency. This breaks DHCP clients
using UDP sockets when the DHCP requests are caught by a MASQUERADE rule
because the MASQUERADE target drops packets when no address is configured
on the outgoing interface. This patch makes it ignore packets with a
source address of 0.
Thanks to Rusty for this suggestion.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't parse the packet, the data is already available in the conntrack
structure.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
With large port numbers the helper_names buffer can overflow.
Noticed by Samir Bellabes <sbellabes@mandriva.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use schedule_timeout_{,un}interruptible() instead of
set_current_state()/schedule_timeout() to reduce kernel size. Also use
human-time conversion functions instead of hard-coded division to avoid
rounding issues.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is an extra left_out/lost_out adjustment in tcp_fragment which
means that the lost_out accounting is always wrong. This patch removes
that chunk of code.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Clean up timer initialization by introducing DEFINE_TIMER a'la
DEFINE_SPINLOCK. Build and boot-tested on x86. A similar patch has been
been in the -RT tree for some time.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
With the use of RCU in files structure, the look-up of files using fds can now
be lock-free. The lookup is protected by rcu_read_lock()/rcu_read_unlock().
This patch changes the readers to use lock-free lookup.
Signed-off-by: Maneesh Soni <maneesh@in.ibm.com>
Signed-off-by: Ravikiran Thirumalai <kiran_th@gmail.com>
Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Create one iterator for walking over FIB trie, and use it
for all the /proc functions. Add a /proc/net/route
output for backwards compatibility with old applications.
Make initialization of fib_trie same as fib_hash so no #ifdef
is needed in af_inet.c
Fixes: http://bugzilla.kernel.org/show_bug.cgi?id=5209
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
One such place that can damage the dst refcnts is route.c with
CONFIG_IP_ROUTE_MULTIPATH_CACHED enabled, i don't see the user's
.config. In this new code i see that rt_intern_hash is called before
dst->refcnt is set to 1, dst is the 2nd arg to rt_intern_hash.
Arg 2 of rt_intern_hash must come with refcnt 1 as it is added to
table or dropped depending on error/add/update. One such example is
ip_mkroute_input where __mkroute_input return rth with refcnt 0 which
is provided to rt_intern_hash. ip_mkroute_output looks like a 2nd such
place. Appending untested patch for comments and review. The idea is
to put previous reference as we are going to return next result/error.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
A UDP packet may contain extra data that needs to be trimmed off.
But when doing so, UDP forgets to fixup the skb checksum if CHECKSUM_HW
is being used.
I think this explains the case of a NFS receive using skge driver
causing 'udp hw checksum failures' when interacting with a crufty
settop box.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This was found by inspection while looking for checksum problems
with the skge driver that sets CHECKSUM_HW. It did not fix the
problem, but it looks like it is needed.
If IP reassembly is trimming an overlapping fragment, it
should reset (or adjust) the hardware checksum flag on the skb.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The following patch kills __ip_ct_expect_unlink_destroy and export
unlink_expect as ip_ct_unlink_expect. As it was discussed [1], the function
__ip_ct_expect_unlink_destroy is a bit confusing so better do the following
sequence: ip_ct_destroy_expect and ip_conntrack_expect_put.
[1] https://lists.netfilter.org/pipermail/netfilter-devel/2005-August/020794.html
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the NAT module is loaded when connections are already confirmed
it must not change their tuples anymore. This is especially important
with CONFIG_NETFILTER_DEBUG, the netfilter listhelp functions will
refuse to remove an entry from a list when it can not be found on
the list, so when a changed tuple hashes to a new bucket the entry
is kept in the list until and after the conntrack is freed.
Allocate the exact conntrack tuple for NAT for already confirmed
connections or drop them if that fails.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Connection mark tracking support is one of the feature in connection
tracking, so IP_NF_CONNTRACK_MARK depends on IP_NF_CONNTRACK.
Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
A permanent expectation exists until timeing out and can expect
multiple related connections.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The TCP_OFF assignment at the bottom of that if block can indeed set
TCP_OFF without setting TCP_PAGE. Since there is not much to be
gained from avoiding this situation, we might as well just zap the
offset. The following patch should fix it.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>