Commit Graph

2696 Commits

Author SHA1 Message Date
Ilpo Järvinen
a1c1f281b8 tcp FRTO: Fix fallback to conventional recovery
It seems that commit 009a2e3e4e ("[TCP] FRTO: Improve
interoperability with other undo_marker users") run into
another land-mine which caused fallback to conventional
recovery to break:

1. Cumulative ACK arrives after FRTO retransmission
2. tcp_try_to_open sees zero retrans_out, clears retrans_stamp
   which should be kept like in CA_Loss state it would be
3. undo_marker change allowed tcp_packet_delayed to return
   true because of the cleared retrans_stamp once FRTO is
   terminated causing LossUndo to occur, which means all loss
   markings FRTO made are reverted.

This means that the conventional recovery basically recovered
one loss per RTT, which is not that efficient. It was quite
unobvious that the undo_marker change broken something like
this, I had a quite long session to track it down because of
the non-intuitiviness of the bug (luckily I had a trivial
reproducer at hand and I was also able to learn to use kprobes
in the process as well :-)).

This together with the NewReno+FRTO fix and FRTO in-order
workaround this fixes Damon's problems, this and the first
mentioned are enough to fix Bugzilla #10063.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Reported-by: Damon L. Chesser <damon@damtek.com>
Tested-by: Damon L. Chesser <damon@damtek.com>
Tested-by: Sebastian Hyrwall <zibbe@cisko.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-13 02:53:26 -07:00
Johannes Berg
f5184d267c net: Allow netdevices to specify needed head/tailroom
This patch adds needed_headroom/needed_tailroom members to struct
net_device and updates many places that allocate sbks to use them. Not
all of them can be converted though, and I'm sure I missed some (I
mostly grepped for LL_RESERVED_SPACE)

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-12 20:48:31 -07:00
Linus Torvalds
28a4acb485 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (32 commits)
  net: Added ASSERT_RTNL() to dev_open() and dev_close().
  can: Fix can_send() handling on dev_queue_xmit() failures
  netns: Fix arbitrary net_device-s corruptions on net_ns stop.
  netfilter: Kconfig: default DCCP/SCTP conntrack support to the protocol config values
  netfilter: nf_conntrack_sip: restrict RTP expect flushing on error to last request
  macvlan: Fix memleak on device removal/crash on module removal
  net/ipv4: correct RFC 1122 section reference in comment
  tcp FRTO: SACK variant is errorneously used with NewReno
  e1000e: don't return half-read eeprom on error
  ucc_geth: Don't use RX clock as TX clock.
  cxgb3: Use CAP_SYS_RAWIO for firmware
  pcnet32: delete non NAPI code from driver.
  fs_enet: Fix a memory leak in fs_enet_mdio_probe
  [netdrvr] eexpress: IPv6 fails - multicast problems
  3c59x: use netstats in net_device structure
  3c980-TX needs EXTRA_PREAMBLE
  fix warning in drivers/net/appletalk/cops.c
  e1000e: Add support for BM PHYs on ICH9
  uli526x: fix endianness issues in the setup frame
  uli526x: initialize the hardware prior to requesting interrupts
  ...
2008-05-08 19:03:26 -07:00
J.H.M. Dassen (Ray)
c67fa02799 net/ipv4: correct RFC 1122 section reference in comment
RFC 1122 does not have a section 3.1.2.2. The requirement to silently
discard datagrams with a bad checksum is in section 3.2.1.2 instead.

Addresses http://bugzilla.kernel.org/show_bug.cgi?id=10611

Signed-off-by: J.H.M. Dassen (Ray) <jdassen@debian.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-08 01:11:04 -07:00
Ilpo Järvinen
62ab222783 tcp FRTO: SACK variant is errorneously used with NewReno
Note: there's actually another bug in FRTO's SACK variant, which
is the causing failure in NewReno case because of the error
that's fixed here. I'll fix the SACK case separately (it's
a separate bug really, though related, but in order to fix that
I need to audit tp->snd_nxt usage a bit).

There were two places where SACK variant of FRTO is getting
incorrectly used even if SACK wasn't negotiated by the TCP flow.
This leads to incorrect setting of frto_highmark with NewReno
if a previous recovery was interrupted by another RTO.

An eventual fallback to conventional recovery then incorrectly
considers one or couple of segments as forward transmissions
though they weren't, which then are not LOST marked during
fallback making them "non-retransmittable" until the next RTO.
In a bad case, those segments are really lost and are the only
one left in the window. Thus TCP needs another RTO to continue.
The next FRTO, however, could again repeat the same events
making the progress of the TCP flow extremely slow.

In order for these events to occur at all, FRTO must occur
again in FRTOs step 3 while the key segments must be lost as
well, which is not too likely in practice. It seems to most
frequently with some small devices such as network printers
that *seem* to accept TCP segments only in-order. In cases
were key segments weren't lost, things get automatically
resolved because those wrongly marked segments don't need to be
retransmitted in order to continue.

I found a reproducer after digging up relevant reports (few
reports in total, none at netdev or lkml I know of), some
cases seemed to indicate middlebox issues which seems now
to be a false assumption some people had made. Bugzilla
#10063 _might_ be related. Damon L. Chesser <damon@damtek.com>
had a reproducable case and was kind enough to tcpdump it
for me. With the tcpdump log it was quite trivial to figure
out.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-08 01:09:11 -07:00
Linus Torvalds
4880d10927 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
  net_cls_act: act_simple dont ignore realloc code
  iwlwifi: make IWLWIFI a tristate
  Revert "atm: Do not free already unregistered net device."
  dccp: return -EINVAL on invalid feature length
  irda: fix !PNP support for drivers/net/irda/smsc-ircc2.c
  irda: fix !PNP support in drivers/net/irda/nsc-ircc.c
  net_cls_act: Make act_simple use of netlink policy.
  ip: Use inline function dst_metric() instead of direct access to dst->metric[]
  ip: Make use of the inline function dst_metric_locked()
  atm: Bad locking on br2684_devs modifications.
  atm: Do not free already unregistered net device.
  mac80211: Do not free net device after it is unregistered.
  bridge: Consolidate error paths in br_add_bridge().
  bridge: Net device leak in br_add_bridge().
  niu: Fix probing regression for maramba on-board chips.
  lapbeth: Release ->ethdev when unregistering device.
  xfrm: convert empty xfrm_audit_* macros to functions
  net: Fix useless comment reference loop.
  sch_htb: remove from event queue in htb_parent_to_leaf()
2008-05-06 07:49:20 -07:00
Satoru SATOH
5ffc02a158 ip: Use inline function dst_metric() instead of direct access to dst->metric[]
There are functions to refer to the value of dst->metric[THE_METRIC-1]
directly without use of a inline function "dst_metric" defined in
net/dst.h.

The following patch changes them to use the inline function
consistently.

Signed-off-by: Satoru SATOH <satoru.satoh@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-04 22:14:42 -07:00
Satoru SATOH
0bbeafd011 ip: Make use of the inline function dst_metric_locked()
Signed-off-by: Satoru SATOH <satoru.satoh@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-04 22:12:43 -07:00
Linus Torvalds
4f9faaace2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (47 commits)
  rose: Wrong list_lock argument in rose_node seqops
  netns: Fix reassembly timer to use the right namespace
  netns: Fix device renaming for sysfs
  bnx2: Update version to 1.7.5.
  bnx2: Update RV2P firmware for 5709.
  bnx2: Zero out context memory for 5709.
  bnx2: Fix register test on 5709.
  bnx2: Fix remote PHY initial link state.
  bnx2: Refine remote PHY locking.
  bridge: forwarding table information for >256 devices
  tg3: Update version to 3.92
  tg3: Add link state reporting to UMP firmware
  tg3: Fix ethtool loopback test for 5761 BX devices
  tg3: Fix 5761 NVRAM sizes
  tg3: Use constant 500KHz MI clock on adapters with a CPMU
  hci_usb.h: fix hard-to-trigger race
  dccp: ccid2.c, ccid3.c use clamp(), clamp_t()
  net: remove NR_CPUS arrays in net/core/dev.c
  net: use get/put_unaligned_* helpers
  bluetooth: use get/put_unaligned_* helpers
  ...
2008-05-03 10:18:21 -07:00
Harvey Harrison
d3e2ce3bcd net: use get/put_unaligned_* helpers
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-02 16:26:16 -07:00
Denis V. Lunev
84841c3c6c ipv4: assign PDE->data before gluing PDE into /proc tree
The check for PDE->data != NULL becomes useless after the replacement
of proc_net_fops_create with proc_create_data.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-02 04:10:08 -07:00
Denis V. Lunev
6e79d85d9a netfilter: assign PDE->data before gluing PDE into /proc tree
Simply replace proc_create and further data assigned with proc_create_data.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-02 02:45:42 -07:00
Roman Zippel
6f6d6a1a6a rename div64_64 to div64_u64
Rename div64_64 to div64_u64 to make it consistent with the other divide
functions, so it clearly includes the type of the divide.  Move its definition
to math64.h as currently no architecture overrides the generic implementation.
 They can still override it of course, but the duplicated declarations are
avoided.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: Avi Kivity <avi@qumranet.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-01 08:03:58 -07:00
Harvey Harrison
ab59859de1 net: fix returning void-valued expression warnings
drivers/net/8390.c:37:2: warning: returning void-valued expression
drivers/net/bnx2.c:1635:3: warning: returning void-valued expression
drivers/net/xen-netfront.c:1806:2: warning: returning void-valued expression
net/ipv4/tcp_hybla.c:105:3: warning: returning void-valued expression
net/ipv4/tcp_vegas.c:171:3: warning: returning void-valued expression
net/ipv4/tcp_veno.c:123:3: warning: returning void-valued expression
net/sysctl_net.c:85:2: warning: returning void-valued expression

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Acked-by: Alan Cox <alan@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-01 02:47:38 -07:00
Linus Torvalds
95dfec6ae1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (53 commits)
  tcp: Overflow bug in Vegas
  [IPv4] UFO: prevent generation of chained skb destined to UFO device
  iwlwifi: move the selects to the tristate drivers
  ipv4: annotate a few functions __init in ipconfig.c
  atm: ambassador: vcc_sf semaphore to mutex
  MAINTAINERS: The socketcan-core list is subscribers-only.
  netfilter: nf_conntrack: padding breaks conntrack hash on ARM
  ipv4: Update MTU to all related cache entries in ip_rt_frag_needed()
  sch_sfq: use del_timer_sync() in sfq_destroy()
  net: Add compat support for getsockopt (MCAST_MSFILTER)
  net: Several cleanups for the setsockopt compat support.
  ipvs: fix oops in backup for fwmark conn templates
  bridge: kernel panic when unloading bridge module
  bridge: fix error handling in br_add_if()
  netfilter: {nfnetlink,ip,ip6}_queue: fix skb_over_panic when enlarging packets
  netfilter: x_tables: fix net namespace leak when reading /proc/net/xxx_tables_names
  netfilter: xt_TCPOPTSTRIP: signed tcphoff for ipv6_skip_exthdr() retval
  tcp: Limit cwnd growth when deferring for GSO
  tcp: Allow send-limited cwnd to grow up to max_burst when gso disabled
  [netdrvr] gianfar: Determine TBIPA value dynamically
  ...
2008-04-30 08:45:48 -07:00
Lachlan Andrew
159131149c tcp: Overflow bug in Vegas
From: Lachlan Andrew <lachlan.andrew@gmail.com>

There is an overflow bug in net/ipv4/tcp_vegas.c for large BDPs
(e.g. 400Mbit/s, 400ms).  The multiplication (old_wnd *
vegas->baseRTT) << V_PARAM_SHIFT overflows a u32.

[ Fix tcp_veno.c too, it has similar calculations. -DaveM ]

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-30 01:04:03 -07:00
Kostya B
be9164e769 [IPv4] UFO: prevent generation of chained skb destined to UFO device
Problem: ip_append_data() could wrongly generate a chained skb for
devices which support UFO.  When sk_write_queue is not empty
(e.g. MSG_MORE), __instead__ of appending data into the next nr_frag
of the queued skb, a new chained skb is created.

I would normally assume UFO device should get data in nr_frags and not
in frag_list.  Later the udp4_hwcsum_outgoing() resets csum to NONE
and skb_gso_segment() has oops.

Proposal:
1. Even length is less than mtu, employ ip_ufo_append_data()
and append data to the __existed__ skb in the sk_write_queue.

2. ip_ufo_append_data() is fixed due to a wrong manipulation of
peek-ing and later enqueue-ing of the same skb.  Now, enqueuing is
always performed, because on error the further
ip_flush_pending_frames() would release the queued skb.

Signed-off-by: Kostya B <bkostya@hotmail.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-29 22:36:30 -07:00
Sam Ravnborg
45e741b890 ipv4: annotate a few functions __init in ipconfig.c
A few functions are only used from __init context.
So annotate these with __init for consistency and silence
the following warnings:

WARNING: net/ipv4/built-in.o(.text+0x2a876): Section mismatch
         in reference from the function ic_bootp_init() to
         the variable .init.data:bootp_packet_type
WARNING: net/ipv4/built-in.o(.text+0x2a907): Section mismatch
         in reference from the function ic_bootp_cleanup() to
         the variable .init.data:bootp_packet_type

Note: The warnings only appear with CONFIG_DEBUG_SECTION_MISMATCH=y

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-29 20:58:15 -07:00
Hirofumi Nakagawa
801678c5a3 Remove duplicated unlikely() in IS_ERR()
Some drivers have duplicated unlikely() macros.  IS_ERR() already has
unlikely() in itself.

This patch cleans up such pointless code.

Signed-off-by: Hirofumi Nakagawa <hnakagawa@miraclelinux.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Jeff Garzik <jeff@garzik.org>
Cc: Paul Clements <paul.clements@steeleye.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Cc: David Brownell <david-b@pacbell.net>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Takashi Iwai <tiwai@suse.de>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29 08:06:25 -07:00
Philip Craig
443a70d50b netfilter: nf_conntrack: padding breaks conntrack hash on ARM
commit 0794935e "[NETFILTER]: nf_conntrack: optimize hash_conntrack()"
results in ARM platforms hashing uninitialised padding.  This padding
doesn't exist on other architectures.

Fix this by replacing NF_CT_TUPLE_U_BLANK() with memset() to ensure
everything is initialised.  There were only 4 bytes that
NF_CT_TUPLE_U_BLANK() wasn't clearing anyway (or 12 bytes on ARM).

Signed-off-by: Philip Craig <philipc@snapgear.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-29 03:35:10 -07:00
Timo Teras
0010e46577 ipv4: Update MTU to all related cache entries in ip_rt_frag_needed()
Add struct net_device parameter to ip_rt_frag_needed() and update MTU to
cache entries where ifindex is specified. This is similar to what is
already done in ip_rt_redirect().

Signed-off-by: Timo Teras <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-29 03:32:25 -07:00
David L Stevens
42908c69f6 net: Add compat support for getsockopt (MCAST_MSFILTER)
This patch adds support for getsockopt for MCAST_MSFILTER for
both IPv4 and IPv6. It depends on the previous setsockopt patch,
and uses the same method.

Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-29 03:23:22 -07:00
Julian Anastasov
2ad17defd5 ipvs: fix oops in backup for fwmark conn templates
Fixes bug http://bugzilla.kernel.org/show_bug.cgi?id=10556
where conn templates with protocol=IPPROTO_IP can oops backup box.

        Result from ip_vs_proto_get() should be checked because
protocol value can be invalid or unsupported in backup. But
for valid message we should not fail for templates which use
IPPROTO_IP. Also, add checks to validate message limits and
connection state. Show state NONE for templates using IPPROTO_IP.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-29 03:21:23 -07:00
Arnaud Ebalard
9a732ed6d0 netfilter: {nfnetlink,ip,ip6}_queue: fix skb_over_panic when enlarging packets
While reinjecting *bigger* modified versions of IPv6 packets using
libnetfilter_queue, things work fine on a 2.6.24 kernel (2.6.22 too)
but I get the following on recents kernels (2.6.25, trace below is
against today's net-2.6 git tree):

skb_over_panic: text:c04fddb0 len:696 put:632 head:f7592c00 data:f7592c00 tail:0xf7592eb8 end:0xf7592e80 dev:eth0
------------[ cut here ]------------
invalid opcode: 0000 [#1] PREEMPT 
Process sendd (pid: 3657, ti=f6014000 task=f77c31d0 task.ti=f6014000)
Stack: c071e638 c04fddb0 000002b8 00000278 f7592c00 f7592c00 f7592eb8 f7592e80 
       f763c000 f6bc5200 f7592c40 f6015c34 c04cdbfc f6bc5200 00000278 f6015c60 
       c04fddb0 00000020 f72a10c0 f751b420 00000001 0000000a 000002b8 c065582c 
Call Trace:
 [<c04fddb0>] ? nfqnl_recv_verdict+0x1c0/0x2e0
 [<c04cdbfc>] ? skb_put+0x3c/0x40
 [<c04fddb0>] ? nfqnl_recv_verdict+0x1c0/0x2e0
 [<c04fd115>] ? nfnetlink_rcv_msg+0xf5/0x160
 [<c04fd03e>] ? nfnetlink_rcv_msg+0x1e/0x160
 [<c04fd020>] ? nfnetlink_rcv_msg+0x0/0x160
 [<c04f8ed7>] ? netlink_rcv_skb+0x77/0xa0
 [<c04fcefc>] ? nfnetlink_rcv+0x1c/0x30
 [<c04f8c73>] ? netlink_unicast+0x243/0x2b0
 [<c04cfaba>] ? memcpy_fromiovec+0x4a/0x70
 [<c04f9406>] ? netlink_sendmsg+0x1c6/0x270
 [<c04c8244>] ? sock_sendmsg+0xc4/0xf0
 [<c011970d>] ? set_next_entity+0x1d/0x50
 [<c0133a80>] ? autoremove_wake_function+0x0/0x40
 [<c0118f9e>] ? __wake_up_common+0x3e/0x70
 [<c0342fbf>] ? n_tty_receive_buf+0x34f/0x1280
 [<c011d308>] ? __wake_up+0x68/0x70
 [<c02cea47>] ? copy_from_user+0x37/0x70
 [<c04cfd7c>] ? verify_iovec+0x2c/0x90
 [<c04c837a>] ? sys_sendmsg+0x10a/0x230
 [<c011967a>] ? __dequeue_entity+0x2a/0xa0
 [<c011970d>] ? set_next_entity+0x1d/0x50
 [<c0345397>] ? pty_write+0x47/0x60
 [<c033d59b>] ? tty_default_put_char+0x1b/0x20
 [<c011d2e9>] ? __wake_up+0x49/0x70
 [<c033df99>] ? tty_ldisc_deref+0x39/0x90
 [<c033ff20>] ? tty_write+0x1a0/0x1b0
 [<c04c93af>] ? sys_socketcall+0x7f/0x260
 [<c0102ff9>] ? sysenter_past_esp+0x6a/0x91
 [<c05f0000>] ? snd_intel8x0m_probe+0x270/0x6e0
 =======================
Code: 00 00 89 5c 24 14 8b 98 9c 00 00 00 89 54 24 0c 89 5c 24 10 8b 40 50 89 4c 24 04 c7 04 24 38 e6 71 c0 89 44 24 08 e8 c4 46 c5 ff <0f> 0b eb fe 55 89 e5 56 89 d6 53 89 c3 83 ec 0c 8b 40 50 39 d0 
EIP: [<c04ccdfc>] skb_over_panic+0x5c/0x60 SS:ESP 0068:f6015bf8


Looking at the code, I ended up in nfq_mangle() function (called by
nfqnl_recv_verdict()) which performs a call to skb_copy_expand() due to
the increased size of data passed to the function. AFAICT, it should ask
for 'diff' instead of 'diff - skb_tailroom(e->skb)'. Because the
resulting sk_buff has not enough space to support the skb_put(skb, diff)
call a few lines later, this results in the call to skb_over_panic().

The patch below asks for allocation of a copy with enough space for
mangled packet and the same amount of headroom as old sk_buff. While
looking at how the regression appeared (e2b58a67), I noticed the same
pattern in ipq_mangle_ipv6() and ipq_mangle_ipv4(). The patch corrects
those locations too.

Tested with bigger reinjected IPv6 packets (nfqnl_mangle() path), things
are ok (2.6.25 and today's net-2.6 git tree).

Signed-off-by: Arnaud Ebalard <arno@natisbad.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-29 03:16:34 -07:00
John Heffner
246eb2af06 tcp: Limit cwnd growth when deferring for GSO
This fixes inappropriately large cwnd growth on sender-limited flows
when GSO is enabled, limiting cwnd growth to 64k.

Signed-off-by: John Heffner <johnwheffner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-29 03:13:52 -07:00
John Heffner
ce447eb914 tcp: Allow send-limited cwnd to grow up to max_burst when gso disabled
This changes the logic in tcp_is_cwnd_limited() so that cwnd may grow
up to tcp_max_burst() even when sk_can_gso() is false, or when
sysctl_tcp_tso_win_divisor != 0.

Signed-off-by: John Heffner <johnwheffner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-29 03:13:02 -07:00
Linus Torvalds
77a50df2b1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
  iwlwifi: Allow building iwl3945 without iwl4965.
  wireless: Fix compile error with wifi & leds
  tcp: Fix slab corruption with ipv6 and tcp6fuzz
  ipv4/ipv6 compat: Fix SSM applications on 64bit kernels.
  [IPSEC]: Use digest_null directly for auth
  sunrpc: fix missing kernel-doc
  can: Fix copy_from_user() results interpretation
  Revert "ipv6: Fix typo in net/ipv6/Kconfig"
  tipc: endianness annotations
  ipv6: result of csum_fold() is already 16bit, no need to cast
  [XFRM] AUDIT: Fix flowlabel text format ambibuity.
2008-04-28 09:44:11 -07:00
Evgeniy Polyakov
9ae27e0adb tcp: Fix slab corruption with ipv6 and tcp6fuzz
From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

This fixes a regression added by ec3c0982a2
("[TCP]: TCP_DEFER_ACCEPT updates - process as established")

tcp_v6_do_rcv()->tcp_rcv_established(), the latter goes to step5, where
eventually skb can be freed via tcp_data_queue() (drop: label), then if
check for tcp_defer_accept_check() returns true and thus
tcp_rcv_established() returns -1, which forces tcp_v6_do_rcv() to jump
to reset: label, which in turn will pass through discard: label and free
the same skb again.

Tested by Eric Sesterhenn.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-By: Patrick McManus <mcmanus@ducksong.com>
2008-04-27 15:27:30 -07:00
David L Stevens
dae5029548 ipv4/ipv6 compat: Fix SSM applications on 64bit kernels.
Add support on 64-bit kernels for seting 32-bit compatible MCAST*
socket options.

Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-27 14:26:53 -07:00
Linus Torvalds
2e561c7b7e Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (48 commits)
  net: Fix wrong interpretation of some copy_to_user() results.
  xfrm: alg_key_len & alg_icv_len should be unsigned
  [netdrvr] tehuti: move ioctl perm check closer to function start
  ipv6: Fix typo in net/ipv6/Kconfig
  via-velocity: fix vlan receipt
  tg3: sparse cleanup
  forcedeth: realtek phy crossover detection
  ibm_newemac: Increase MDIO timeouts
  gianfar: Fix skb allocation strategy
  netxen: reduce stack usage of netxen_nic_flash_print
  smc911x: test after postfix decrement fails in smc911x_{reset,drop_pkt}
  net drivers: fix platform driver hotplug/coldplug
  forcedeth: new backoff implementation
  ehea: make things static
  phylib: Add support for board-level PHY fixups
  [netdrvr] atlx: code movement: move atl1 parameter parsing
  atlx: remove flash vendor parameter
  korina: misc cleanup
  korina: fix misplaced return statement
  WAN: Fix confusing insmod error code for C101 too.
  ...
2008-04-25 12:28:28 -07:00
Tom Quetchenbach
8d390efd90 tcp: tcp_probe buffer overflow and incorrect return value
tcp_probe has a bounds-checking bug that causes many programs (less,
python) to crash reading /proc/net/tcp_probe. When it outputs a log
line to the reader, it only checks if that line alone will fit in the
reader's buffer, rather than that line and all the previous lines it
has already written.

tcpprobe_read also returns the wrong value if copy_to_user fails--it
just passes on the return value of copy_to_user (number of bytes not
copied), which makes a failure look like a success.

This patch fixes the buffer overflow and sets the return value to
-EFAULT if copy_to_user fails.

Patch is against latest net-2.6; tested briefly and seems to fix the
crashes in less and python.

Signed-off-by: Tom Quetchenbach <virtualphtn@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-24 21:11:58 -07:00
Linus Torvalds
d02aacff44 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (22 commits)
  tun: Multicast handling in tun_chr_ioctl() needs proper locking.
  [NET]: Fix heavy stack usage in seq_file output routines.
  [AF_UNIX] Initialise UNIX sockets before general device initcalls
  [RTNETLINK]: Fix bogus ASSERT_RTNL warning
  iwlwifi: Fix built-in compilation of iwlcore (part 2)
  tun: Fix minor race in TUNSETLINK ioctl handling.
  ppp_generic: use stats from net_device structure
  iwlwifi: Don't unlock priv->mutex if it isn't locked
  wireless: rndis_wlan: modparam_workaround_interval is never below 0.
  prism54: prism54_get_encode() test below 0 on unsigned index
  mac80211: update mesh EID values
  b43: Workaround DMA quirks
  mac80211: fix use before check of Qdisc length
  net/mac80211/rx.c: fix off-by-one
  mac80211: Fix race between ieee80211_rx_bss_put and lookup routines.
  ath5k: Fix radio identification on AR5424/2424
  ssb: Fix all-ones boardflags
  b43: Add more btcoexist workarounds
  b43: Fix HostFlags data types
  b43: Workaround invalid bluetooth settings
  ...
2008-04-24 08:40:34 -07:00
Pavel Emelyanov
5e659e4cb0 [NET]: Fix heavy stack usage in seq_file output routines.
Plan C: we can follow the Al Viro's proposal about %n like in this patch.
The same applies to udp, fib (the /proc/net/route file), rt_cache and 
sctp debug. This is minus ~150-200 bytes for each.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-24 01:02:16 -07:00
Linus Torvalds
b0d19a378a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
  iwlwifi: Fix built-in compilation of iwlcore
  net: Unexport move_addr_to_{kernel,user}
  rt2x00: Select LEDS_CLASS.
  iwlwifi: Select LEDS_CLASS.
  leds: Do not guard NEW_LEDS with HAS_IOMEM
  [IPSEC]: Fix catch-22 with algorithm IDs above 31
  time: Export set_normalized_timespec.
  tcp: Make use of before macro in tcp_input.c
  hamradio: Remove unneeded and deprecated cli()/sti() calls in dmascc.c
  [NETNS]: Remove empty ->init callback.
  [DCCP]: Convert do_gettimeofday() to getnstimeofday().
  [NETNS]: Don't initialize err variable twice.
  [NETNS]: The ip6_fib_timer can work with garbage on net namespace stop.
  [IPV4]: Convert do_gettimeofday() to getnstimeofday().
  [IPV4]: Make icmp_sk_init() static.
  [IPV6]: Make struct ip6_prohibit_entry_template static.
  tcp: Trivial fix to correct function name in a comment in net/ipv4/tcp.c
  [NET]: Expose netdevice dev_id through sysfs
  skbuff: fix missing kernel-doc notation
  [ROSE]: Fix soft lockup wrt. rose_node_list_lock
2008-04-23 12:23:45 -07:00
Linus Torvalds
529a41e366 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
  rose: Socket lock was not released before returning to user space
  hci_usb: remove code obfuscation
  drivers/net/appletalk: use time_before, time_before_eq, etc
  drivers/atm: use time_before, time_before_eq, etc
  hci_usb: do not initialize static variables to 0
  tg3: 5701 DMA corruption fix
  atm nicstar: Removal of debug code containing deprecated calls to cli()/sti()
  iwlwifi: Fix unconditional access to station->tidp[].agg.
  netfilter: Fix SIP conntrack build with NAT disabled.
  netfilter: Fix SCTP nat build.
2008-04-21 15:46:17 -07:00
Arnd Hannemann
d7ee147d4f tcp: Make use of before macro in tcp_input.c
Make use of tcp before macro.

Signed-off-by: Arnd Hannemann <hannemann@nets.rwth-aachen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-21 14:46:22 -07:00
YOSHIFUJI Hideaki
f25c3d613b [IPV4]: Convert do_gettimeofday() to getnstimeofday().
What do_gettimeofday() does is to call getnstimeofday() and
to convert the result from timespec{} to timeval{}.
After that, these callers convert the result again to msec.
Use getnstimeofday() and convert the units at once.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-21 02:34:08 -07:00
Adrian Bunk
263173af5b [IPV4]: Make icmp_sk_init() static.
This patch makes the needlessly global icmp_sk_init() static.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-21 02:31:23 -07:00
Satoru SATOH
1f29b0584d tcp: Trivial fix to correct function name in a comment in net/ipv4/tcp.c
This is a trivial fix to correct function name in a comment in
net/ipv4/tcp.c.

Signed-off-by: Satoru SATOH <satoru.satoh@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-21 02:27:58 -07:00
Patrick McHardy
4e9d8a70e4 netfilter: Fix SCTP nat build.
We need to select LIBCRC32C.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-19 17:52:51 -07:00
Matthew Wilcox
5f090dcb4d net: Remove unnecessary inclusions of asm/semaphore.h
None of these files use any of the functionality promised by
asm/semaphore.h.  It's possible that they rely on it dragging in some
unrelated header file, but I can't build all these files, so we'll have
fix any build failures as they come up.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
2008-04-18 22:15:50 -04:00
David S. Miller
1e42198609 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6 2008-04-17 23:56:30 -07:00
Pavel Emelyanov
53083773dc [INET]: Uninline the __inet_inherit_port call.
This deblats ~200 bytes when ipv6 and dccp are 'y'.

Besides, this will ease compilation issues for patches
I'm working on to make inet hash tables more scalable 
wrt net namespaces.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-17 23:18:15 -07:00
Linus Torvalds
b4b8f57965 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
  [TCP]: Add return value indication to tcp_prune_ofo_queue().
  PS3: gelic: fix the oops on the broken IE returned from the hypervisor
  b43legacy: fix DMA mapping leakage
  mac80211: remove message on receiving unexpected unencrypted frames
  Update rt2x00 MAINTAINERS entry
  Add rfkill to MAINTAINERS file
  rfkill: Fix device type check when toggling states
  b43legacy: Fix usage of struct device used for DMAing
  ssb: Fix usage of struct device used for DMAing
  MAINTAINERS: move to generic repository for iwlwifi
  b43legacy: fix initvals loading on bcm4303
  rtl8187: Add missing priv->vif assignments
  netconsole: only set CON_PRINTBUFFER if the user specifies a netconsole
  [CAN]: Update documentation of struct sockaddr_can
  MAINTAINERS: isdn4linux@listserv.isdn4linux.de is subscribers-only
  [TCP]: Fix never pruned tcp out-of-order queue.
  [NET_SCHED] sch_api: fix qdisc_tree_decrease_qlen() loop
2008-04-16 07:44:27 -07:00
Denis V. Lunev
8c5da49a63 [NETNS]: Add netns refcnt debug for inet bind buckets.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-16 02:01:11 -07:00
Denis V. Lunev
57d7a60092 [NETNS]: Add netns refcnt debug into fib_info.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-16 02:00:50 -07:00
Denis V. Lunev
cd5342d905 [NETNS]: Add netns refcnt debug for timewait buckets.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-16 02:00:28 -07:00
Pavel Emelyanov
b0970c428b [SIT]: Allow for IPPROTO_IPV6 protocol in namespaces.
This makes sit-generated traffic enter the namespace.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-16 01:17:39 -07:00
Pavel Emelyanov
f96c148fd5 [GRE]: Allow for IPPROTO_GRE protocol in namespaces.
This one was also disabled by default for sanity.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-16 01:11:36 -07:00
Pavel Emelyanov
0b67eceb19 [GRE]: Allow to create IPGRE tunnels in net namespaces.
I.e. set the proper net and mark as NETNS_LOCAL.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-16 01:11:13 -07:00
Pavel Emelyanov
96635522f7 [GRE]: Use proper net in routing calls.
As for the IPIP tunnel, there are some ip_route_output_key()
calls in there that require a proper net so give one to them.

And a proper net for the __get_dev_by_index hanging around.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-16 01:10:44 -07:00
Pavel Emelyanov
eb8ce741a3 [GRE]: Make tunnels hashes per-net.
Very similar to what was done for the IPIP code.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-16 01:10:26 -07:00
Pavel Emelyanov
7daa000489 [GRE]: Make the fallback tunnel device per-net.
Everything is prepared for this change now. Create on in
init callback, use it over the code and destroy on net exit.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-16 01:10:05 -07:00
Pavel Emelyanov
3b4667f3db [GRE]: Use proper net in hash-lookup functions.
This is the part#2 of the patch #2 - get the proper net for
these functions. This change in a separate patch in order not
to get lost in a large previous patch.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-16 01:09:44 -07:00
Pavel Emelyanov
f57e7d5a7b [GRE]: Add net/gre_net argument to some functions.
The fallback device and hashes are to become per-net, but many
code doesn't have anything to get the struct net pointer from.

So pass the proper net there with an extra argument.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-16 01:09:22 -07:00
Pavel Emelyanov
59a4c7594b [GRE]: Introduce empty ipgre_net structure and net init/exit ops.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-16 01:08:53 -07:00
Pavel Emelyanov
4597a0ce08 [IPIP]: Allow for IPPROTO_IPIP protocol in namespaces.
This one was disabled by default for sanity.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-16 01:06:56 -07:00
Pavel Emelyanov
0a826406d4 [IPIP]: Allow to create IPIP tunnels in net namespaces.
Set the proper net before calling register_netdev and disable 
the tunnel device netns changing.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-16 01:06:18 -07:00
Pavel Emelyanov
b99f0152e5 [IPIP]: Use proper net in (mostly) routing calls.
There are some ip_route_output_key() calls in there that require
a proper net so give one to them.

Besides - give a proper net to a single __get_dev_by_index call 
in ipip_tunnel_bind_dev().

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-16 01:05:57 -07:00
Pavel Emelyanov
44d3c299dc [IPIP]: Make tunnels hashes per net.
Either net or ipip_net already exists in all the required 
places, so just use one.

Besides, tune net_init and net_exit calls to respectively 
initialize the hashes and destroy devices.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-16 01:05:32 -07:00
Pavel Emelyanov
cec3ffae1a [IPIP]: Use proper net in hash-lookup functions.
This is the part#2 of the previous patch - get the proper
net for these functions.

I make it in a separate patch, so that this change does not
get lost in a large previous patch.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-16 01:05:03 -07:00
Pavel Emelyanov
b9fae5c913 [IPIP]: Add net/ipip_net argument to some functions.
The hashes of tunnels will be per-net too, so prepare all the 
functions that uses them for this change by adding an argument.

Use init_net temporarily in places, where the net does not exist
explicitly yet.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-16 01:04:35 -07:00
Pavel Emelyanov
b9855c54da [IPIP]: Make the fallback tunnel device per-net.
Create on in ipip_init_net(), use it all over the code (the
proper place to get the net from already exists) and destroy
in ipip_net_exit().

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-16 01:04:13 -07:00
Pavel Emelyanov
10dc4c7bb7 [IPIP]: Introduce empty ipip_net structure and net init/exit ops.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-16 01:03:13 -07:00
Ilpo Järvinen
17515408a1 [TCP]: Remove superflushious skb == write_queue_tail() check
Needed can only be more strict than what was checked by the
earlier common case check for non-tail skbs, thus
cwnd_len <= needed will never match in that case anyway.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-15 20:36:55 -07:00
Vitaliy Gusev
56f367bbfd [TCP]: Add return value indication to tcp_prune_ofo_queue().
Returns non-zero if tp->out_of_order_queue was seen non-empty.
This allows tcp_try_rmem_schedule() to return early.

Signed-off-by: Vitaliy Gusev <vgusev@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-15 20:26:34 -07:00
Vitaliy Gusev
b000cd3707 [TCP]: Fix never pruned tcp out-of-order queue.
tcp_prune_queue() doesn't prune an out-of-order queue at all.
Therefore sk_rmem_schedule() can fail but the out-of-order queue isn't
pruned . This can lead to tcp deadlock state if the next two
conditions are held:

1. There are a sequence hole between last received in
   order segment and segments enqueued to the out-of-order queue.

2. Size of all segments in the out-of-order queue is more than tcp_mem[2].

Signed-off-by: Vitaliy Gusev <vgusev@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-15 00:33:38 -07:00
Linus Torvalds
533bb8a4d7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (31 commits)
  [BRIDGE]: Fix crash in __ip_route_output_key with bridge netfilter
  [NETFILTER]: ipt_CLUSTERIP: fix race between clusterip_config_find_get and _entry_put
  [IPV6] ADDRCONF: Don't generate temporary address for ip6-ip6 interface.
  [IPV6] ADDRCONF: Ensure disabling multicast RS even if privacy extensions are disabled.
  [IPV6]: Use appropriate sock tclass setting for routing lookup.
  [IPV6]: IPv6 extension header structures need to be packed.
  [IPV6]: Fix ipv6 address fetching in raw6_icmp_error().
  [NET]: Return more appropriate error from eth_validate_addr().
  [ISDN]: Do not validate ISDN net device address prior to interface-up
  [NET]: Fix kernel-doc for skb_segment
  [SOCK] sk_stamp: should be initialized to ktime_set(-1L, 0)
  net: check for underlength tap writes
  net: make struct tun_struct private to tun.c
  [SCTP]: IPv4 vs IPv6 addresses mess in sctp_inet[6]addr_event.
  [SCTP]: Fix compiler warning about const qualifiers
  [SCTP]: Fix protocol violation when receiving an error lenght INIT-ACK
  [SCTP]: Add check for hmac_algo parameter in sctp_verify_param()
  [NET_SCHED] cls_u32: refcounting fix for u32_delete()
  [DCCP]: Fix skb->cb conflicts with IP
  [AX25]: Potential ax25_uid_assoc-s leaks on module unload.
  ...
2008-04-14 07:56:24 -07:00
YOSHIFUJI Hideaki
569508c964 [TCP]: Format addresses appropriately in debug messages.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-14 04:09:36 -07:00
YOSHIFUJI Hideaki
a7d632b6b4 [IPV4]: Use NIPQUAD_FMT to format ipv4 addresses.
And use %u to format port.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-14 04:09:00 -07:00
David S. Miller
334f8b2afd Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-2.6.26 2008-04-14 03:50:43 -07:00
Pavel Emelyanov
7477fd2e6b [SOCK]: Add some notes about per-bind-bucket sock lookup.
I was asked about "why don't we perform a sk_net filtering in
bind_conflict calls, like we do in other sock lookup places"
for a couple of times.

Can we please add a comment about why we do not need one?

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-14 02:42:27 -07:00
David S. Miller
df39e8ba56 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:

	drivers/net/ehea/ehea_main.c
	drivers/net/wireless/iwlwifi/Kconfig
	drivers/net/wireless/rt2x00/rt61pci.c
	net/ipv4/inet_timewait_sock.c
	net/ipv6/raw.c
	net/mac80211/ieee80211_sta.c
2008-04-14 02:30:23 -07:00
Jan Engelhardt
3c9fba656a [NETFILTER]: nf_conntrack: replace NF_CT_DUMP_TUPLE macro indrection by function call
Directly call IPv4 and IPv6 variants where the address family is
easily known.

Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14 11:15:54 +02:00
Jan Engelhardt
12c33aa20e [NETFILTER]: nf_conntrack: const annotations in nf_conntrack_sctp, nf_nat_proto_gre
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14 11:15:54 +02:00
Jan Engelhardt
f2ea825f48 [NETFILTER]: nf_nat: use bool type in nf_nat_proto
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14 11:15:53 +02:00
Jan Engelhardt
09f263cd39 [NETFILTER]: nf_conntrack: use bool type in struct nf_conntrack_l4proto
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14 11:15:53 +02:00
Jan Engelhardt
8ce8439a31 [NETFILTER]: nf_conntrack: use bool type in struct nf_conntrack_l3proto
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14 11:15:52 +02:00
Patrick McHardy
5e8fbe2ac8 [NETFILTER]: nf_conntrack: add tuplehash l3num/protonum accessors
Add accessors for l3num and protonum and get rid of some overly long
expressions.

Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14 11:15:52 +02:00
Patrick McHardy
dd13b01036 [NETFILTER]: nf_nat: kill helper and seq_adjust hooks
Connection tracking helpers (specifically FTP) need to be called
before NAT sequence numbers adjustments are performed to be able
to compare them against previously seen ones. We've introduced
two new hooks around 2.6.11 to maintain this ordering when NAT
modules were changed to get called from conntrack helpers directly.

The cost of netfilter hooks is quite high and sequence number
adjustments are only rarely needed however. Add a RCU-protected
sequence number adjustment function pointer and call it from
IPv4 conntrack after calling the helper.

Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14 11:15:52 +02:00
Patrick McHardy
8c87238b72 [NETFILTER]: nf_nat: don't add NAT extension for confirmed conntracks
Adding extensions to confirmed conntracks is not allowed to avoid races
on reallocation. Don't setup NAT for confirmed conntracks in case NAT
module is loaded late.

The has one side-effect, the connections existing before the NAT module
was loaded won't enter the bysource hash. The only case where this actually
makes a difference is in case of SNAT to a multirange where the IP before
NAT is also part of the range. Since old connections don't enter the
bysource hash the first new connection from the IP will have a new address
selected. This shouldn't matter at all.

Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14 11:15:51 +02:00
Patrick McHardy
42cf800c24 [NETFILTER]: nf_nat: remove obsolete check for ICMP redirects
Locally generated ICMP packets have a reference to the conntrack entry
of the original packet manually attached by icmp_send(). Therefore the
check for locally originated untracked ICMP redirects can never be
true.

Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14 11:15:50 +02:00
Patrick McHardy
9d908a69a3 [NETFILTER]: nf_nat: add SCTP protocol support
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14 11:15:50 +02:00
Patrick McHardy
4910a08799 [NETFILTER]: nf_nat: add DCCP protocol support
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14 11:15:50 +02:00
Patrick McHardy
d63a650736 [NETFILTER]: Add partial checksum validation helper
Move the UDP-Lite conntrack checksum validation to a generic helper
similar to nf_checksum() and make it fall back to nf_checksum()
in case the full packet is to be checksummed and hardware checksums
are available. This is to be used by DCCP conntrack, which also
needs to verify partial checksums.

Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14 11:15:49 +02:00
Patrick McHardy
6185f870e2 [NETFILTER]: nf_nat: add UDP-Lite support
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14 11:15:48 +02:00
Patrick McHardy
2d2d84c40e [NETFILTER]: nf_nat: remove unused name from struct nf_nat_protocol
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14 11:15:48 +02:00
Patrick McHardy
ca6a507490 [NETFILTER]: nf_conntrack_netlink: clean up NAT protocol parsing
Move responsibility for setting the IP_NAT_RANGE_PROTO_SPECIFIED flag
to the NAT protocol, properly propagate errors and get rid of ugly
return value convention.

Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14 11:15:47 +02:00
Patrick McHardy
535b57c7c1 [NETFILTER]: nf_nat: move NAT ctnetlink helpers to nf_nat_proto_common
Move to nf_nat_proto_common and rename to nf_nat_proto_... since they're
also used by protocols that don't have port numbers.

Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14 11:15:47 +02:00
Patrick McHardy
5abd363f73 [NETFILTER]: nf_nat: fix random mode not to overwrite port rover
The port rover should not get overwritten when using random mode,
otherwise other rules will also use more or less random ports.

Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14 11:15:46 +02:00
Patrick McHardy
937e0dfd87 [NETFILTER]: nf_nat: add helpers for common NAT protocol operations
Add generic ->in_range and ->unique_tuple ops to avoid duplicating them
again and again for future NAT modules and save a few bytes of text:

net/ipv4/netfilter/nf_nat_proto_tcp.c:
  tcp_in_range     |  -62 (removed)
  tcp_unique_tuple | -259 # 271 -> 12, # inlines: 1 -> 0, size inlines: 7 -> 0
 2 functions changed, 321 bytes removed

net/ipv4/netfilter/nf_nat_proto_udp.c:
  udp_in_range     |  -62 (removed)
  udp_unique_tuple | -259 # 271 -> 12, # inlines: 1 -> 0, size inlines: 7 -> 0
 2 functions changed, 321 bytes removed

net/ipv4/netfilter/nf_nat_proto_gre.c:
  gre_in_range |  -62 (removed)
 1 function changed, 62 bytes removed

vmlinux:
 5 functions changed, 704 bytes removed

Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14 11:15:46 +02:00
Patrick McHardy
544473c166 [NETFILTER]: {ip,ip6,arp}_tables: return EAGAIN for invalid SO_GET_ENTRIES size
Rule dumping is performed in two steps: first userspace gets the
ruleset size using getsockopt(SO_GET_INFO) and allocates memory,
then it calls getsockopt(SO_GET_ENTRIES) to actually dump the
ruleset. When another process changes the ruleset in between the
sizes from the first getsockopt call doesn't match anymore and
the kernel aborts. Unfortunately it returns EAGAIN, as for multiple
other possible errors, so userspace can't distinguish this case
from real errors.

Return EAGAIN so userspace can retry the operation.

Fixes (with current iptables SVN version) netfilter bugzilla #104.

Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14 11:15:45 +02:00
Jan Engelhardt
c2f9c68398 [NETFILTER]: Explicitly initialize .priority in arptable_filter
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14 11:15:44 +02:00
Jan Engelhardt
3bb0362d2f [NETFILTER]: remove arpt_(un)register_target indirection macros
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14 11:15:44 +02:00
Jan Engelhardt
95eea855af [NETFILTER]: remove arpt_target indirection macro
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14 11:15:43 +02:00
Jan Engelhardt
4abff0775d [NETFILTER]: remove arpt_table indirection macro
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14 11:15:43 +02:00
Jan Engelhardt
72b72949db [NETFILTER]: annotate rest of nf_nat_* with const
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14 11:15:42 +02:00
Jan Engelhardt
5452e425ad [NETFILTER]: annotate {arp,ip,ip6,x}tables with const
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14 11:15:35 +02:00
Jan Engelhardt
3cf93c96af [NETFILTER]: annotate xtables targets with const and remove casts
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14 09:56:05 +02:00
Robert P. J. Day
fdccecd0cc [NETFILTER]: Use non-deprecated __RW_LOCK_UNLOCKED macro
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14 09:56:03 +02:00
Alexey Dobriyan
666953df35 [NETFILTER]: ip_tables: per-netns FILTER/MANGLE/RAW tables for real
Commit 9335f047fe aka
"[NETFILTER]: ip_tables: per-netns FILTER, MANGLE, RAW"
added per-netns _view_ of iptables rules. They were shown to user, but
ignored by filtering code. Now that it's possible to at least ping loopback,
per-netns tables can affect filtering decisions.

netns is taken in case of
	PRE_ROUTING, LOCAL_IN -- from in device,
	POST_ROUTING, LOCAL_OUT -- from out device,
	FORWARD -- from in device which should be equal to out device's netns.
		   This code is relatively new, so BUG_ON was plugged.

Wrappers were added to a) keep code the same from CONFIG_NET_NS=n users
(overwhelming majority), b) consolidate code in one place -- similar
changes will be done in ipv6 and arp netfilter code.

Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14 09:56:02 +02:00
Patrick McHardy
36e2a1b0f7 [NETFILTER]: {ip,ip6}t_LOG: print MARK value in log output
Dump the mark value in log messages similar to nfnetlink_log. This
is useful for debugging complex setups where marks are used for
routing or traffic classification.

Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14 09:56:01 +02:00
Pavel Emelyanov
4dee959723 [NETFILTER]: ipt_CLUSTERIP: fix race between clusterip_config_find_get and _entry_put
Consider we are putting a clusterip_config entry with the "entries"
count == 1, and on the other CPU there's a clusterip_config_find_get
in progress:

CPU1:							CPU2:
clusterip_config_entry_put:				clusterip_config_find_get:
if (atomic_dec_and_test(&c->entries)) {
	/* true */
							read_lock_bh(&clusterip_lock);
							c = __clusterip_config_find(clusterip);
							/* found - it's still in list */
							...
							atomic_inc(&c->entries);
							read_unlock_bh(&clusterip_lock);

	write_lock_bh(&clusterip_lock);
	list_del(&c->list);
	write_unlock_bh(&clusterip_lock);
	...
	dev_put(c->dev);

Oops! We have an entry returned by the clusterip_config_find_get,
which is a) not in list b) has a stale dev pointer.

The problems will happen when the CPU2 will release the entry - it
will remove it from the list for the 2nd time, thus spoiling it, and
will put a stale dev pointer.

The fix is to make atomic_dec_and_test under the clusterip_lock.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14 00:44:52 -07:00
Gerrit Renker
7de6c03336 [SKB]: __skb_append = __skb_queue_after
This expresses __skb_append in terms of __skb_queue_after, exploiting that

  __skb_append(old, new, list) = __skb_queue_after(list, old, new).

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-14 00:05:09 -07:00
Denis V. Lunev
5f4472c5a6 [TCP]: Remove owner from tcp_seq_afinfo.
Move it to tcp_seq_afinfo->seq_fops as should be.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-13 22:13:53 -07:00
Denis V. Lunev
68fcadd16c [TCP]: Place file operations directly into tcp_seq_afinfo.
No need to have separate never-used variable.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-13 22:13:30 -07:00
Denis V. Lunev
52d6f3f11b [TCP]: Cleanup /proc/tcp[6] creation/removal.
Replace seq_open with seq_open_net and remove tcp_seq_release
completely.  seq_release_net will do this job just fine.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-13 22:12:41 -07:00
Denis V. Lunev
9427c4b36b [TCP]: Move seq_ops from tcp_iter_state to tcp_seq_afinfo.
No need to create seq_operations for each instance of 'netstat'.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-13 22:12:13 -07:00
Denis V. Lunev
1abf4fb20d [TCP]: No need to check afinfo != NULL in tcp_proc_(un)register.
tcp_proc_register/tcp_proc_unregister are called with a static pointer only.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-13 22:11:46 -07:00
Denis V. Lunev
a4146b1b2c [TCP]: Replace struct net on tcp_iter_state with seq_net_private.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-13 22:11:14 -07:00
Gerrit Renker
ac6f781920 [INET]: sk_reuse is valbool
sk_reuse is declared as "unsigned char", but is set as type valbool in net/core/sock.c.
There is no other place in net/ where sk->sk_reuse is set to a value > 1, so the test 
"sk_reuse > 1" can not be true.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-13 21:50:08 -07:00
Linus Torvalds
14897e35fd Merge branch 'docs' of git://git.lwn.net/linux-2.6
* 'docs' of git://git.lwn.net/linux-2.6:
  Add additional examples in Documentation/spinlocks.txt
  Move sched-rt-group.txt to scheduler/
  Documentation: move rpc-cache.txt to filesystems/
  Documentation: move nfsroot.txt to filesystems/
  Spell out behavior of atomic_dec_and_lock() in kerneldoc
  Fix a typo in highres.txt
  Fixes to the seq_file document
  Fill out information on patch tags in SubmittingPatches
  Add the seq_file documentation
2008-04-11 13:24:16 -07:00
J. Bruce Fields
6ded55da6b Documentation: move nfsroot.txt to filesystems/
Documentation/ is a little large, and filesystems/ seems an obvious
place for this file.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2008-04-11 13:18:01 -06:00
Daniel Lezcano
7951f0b03a [NETNS][IPV6] tcp - assign the netns for timewait sockets
Copy the network namespace from the socket to the timewait socket.

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Acked-by: Mark Lord <mlord@pobox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-10 20:53:10 -07:00
Stephen Hemminger
c0b8c32b1c IPV4: use xor rather than multiple ands for route compare
The comparison in ip_route_input is a hot path, by recoding the C
"and" as bit operations, fewer conditional branches get generated
so the code should be faster. Maybe someday Gcc will be smart
enough to do this?

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-10 04:00:28 -07:00
Stephen Hemminger
387a5487f5 ipv4: fib_trie leaf free optimization
Avoid unneeded test in the case where object to be freed
has to be a leaf. Don't need to use the generic tnode_free()
function, instead just setup leaf to be freed.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-10 03:47:34 -07:00
Stephen Hemminger
ef3660ce06 ipv4: fib_trie remove unused argument
The trie pointer is passed down to flush_list and flush_leaf
but never used.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-10 03:46:12 -07:00
Florian Westphal
4dfc281702 [Syncookies]: Add support for TCP options via timestamps.
Allow the use of SACK and window scaling when syncookies are used
and the client supports tcp timestamps. Options are encoded into
the timestamp sent in the syn-ack and restored from the timestamp
echo when the ack is received.

Based on earlier work by Glenn Griffin.
This patch avoids increasing the size of structs by encoding TCP
options into the least significant bits of the timestamp and
by not using any 'timestamp offset'.

The downside is that the timestamp sent in the packet after the synack
will increase by several seconds.

changes since v1:
 don't duplicate timestamp echo decoding function, put it into ipv4/syncookie.c
 and have ipv6/syncookies.c use it.
 Feedback from Glenn Griffin: fix line indented with spaces, kill redundant if ()

Reviewed-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-10 03:12:40 -07:00
Stephen Hemminger
15be75cdb5 IPV4: fib_trie use vmalloc for large tnodes
Use vmalloc rather than alloc_pages to avoid wasting memory.
The problem is that tnode structure has a power of 2 sized array,
plus a header. So the current code wastes almost half the memory
allocated because it always needs the next bigger size to hold
that small header.

This is similar to an earlier patch by Eric, but instead of a list
and lock, I used a workqueue to handle the fact that vfree can't
be done in interrupt context.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-10 02:56:38 -07:00
Stephen Hemminger
2fa7527ba1 IPV4: route rekey timer can be deferrable
No urgency on the rehash interval timer, so mark it as deferrable.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-10 01:55:27 -07:00
Stephen Hemminger
1294fc4a48 IPV4: route use jhash3
Since route hash is a triple, use jhash_3words rather doing the mixing
directly. This should be as fast and give better distribution.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-10 01:54:01 -07:00
Stephen Hemminger
5969f71d57 IPV4: route inline changes
Don't mark functions that are large as inline, let compiler decide.
Also, use inline rather than __inline__.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-10 01:52:09 -07:00
David S. Miller
951e07c930 [IPV4]: Fix byte value boundary check in do_ip_getsockopt().
This fixes kernel bugzilla 10371.

As reported by M.Piechaczek@osmosys.tv, if we try to grab a
char sized socket option value, as in:

  unsigned char ttl = 255;
  socklen_t     len = sizeof(ttl);
  setsockopt(socket, IPPROTO_IP, IP_MULTICAST_TTL, &ttl, &len);

  getsockopt(socket, IPPROTO_IP, IP_MULTICAST_TTL, &ttl, &len);

The ttl returned will be wrong on big-endian, and on both little-
endian and big-endian the next three bytes in userspace are written
with garbage.

It's because of this test in do_ip_getsockopt():

	if (len < sizeof(int) && len > 0 && val>=0 && val<255) {

It should allow a 'val' of 255 to pass here, but it doesn't so it
copies a full 'int' back to userspace.

On little-endian that will write the correct value into the location
but it spams on the next three bytes in userspace.  On big endian it
writes the wrong value into the location and spams the next three
bytes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-10 01:29:36 -07:00
Jan Engelhardt
475959d477 [NETFILTER]: nf_nat: autoload IPv4 connection tracking
Without this patch, the generic L3 tracker would kick in
if nf_conntrack_ipv4 was not loaded before nf_nat, which
would lead to translation problems with ICMP errors.

NAT does not make sense without IPv4 connection tracking
anyway, so just add a call to need_ipv4_conntrack().

Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-09 15:14:58 -07:00
Ilpo Järvinen
6adb4f733e [TCP]: Don't allow FRTO to take place while MTU is being probed
MTU probe can cause some remedies for FRTO because the normal
packet ordering may be violated allowing FRTO to make a wrong
decision (it might not be that serious threat for anything
though). Thus it's safer to not run FRTO while MTU probe is
underway.

It seems that the basic FRTO variant should also look for an
skb at probe_seq.start to check if that's retransmitted one
but I didn't implement it now (plain seqno in window check
isn't robust against wraparounds).

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-07 22:33:57 -07:00
Ilpo Järvinen
882bebaaca [TCP]: tcp_simple_retransmit can cause S+L
This fixes Bugzilla #10384

tcp_simple_retransmit does L increment without any checking
whatsoever for overflowing S+L when Reno is in use.

The simplest scenario I can currently think of is rather
complex in practice (there might be some more straightforward
cases though). Ie., if mss is reduced during mtu probing, it
may end up marking everything lost and if some duplicate ACKs
arrived prior to that sacked_out will be non-zero as well,
leading to S+L > packets_out, tcp_clean_rtx_queue on the next
cumulative ACK or tcp_fastretrans_alert on the next duplicate
ACK will fix the S counter.

More straightforward (but questionable) solution would be to
just call tcp_reset_reno_sack() in tcp_simple_retransmit but
it would negatively impact the probe's retransmission, ie.,
the retransmissions would not occur if some duplicate ACKs
had arrived.

So I had to add reno sacked_out reseting to CA_Loss state
when the first cumulative ACK arrives (this stale sacked_out
might actually be the explanation for the reports of left_out
overflows in kernel prior to 2.6.23 and S+L overflow reports
of 2.6.24). However, this alone won't be enough to fix kernel
before 2.6.24 because it is building on top of the commit
1b6d427bb7 ([TCP]: Reduce sacked_out with reno when purging
write_queue) to keep the sacked_out from overflowing.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Reported-by: Alessandro Suardi <alessandro.suardi@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-07 22:33:07 -07:00
Ilpo Järvinen
c137f3dda0 [TCP]: Fix NewReno's fast rexmit/recovery problems with GSOed skb
Fixes a long-standing bug which makes NewReno recovery crippled.
With GSO the whole head skb was marked as LOST which is in
violation of NewReno procedure that only wants to mark one packet
and ended up breaking our TCP code by causing counter overflow
because our code was built on top of assumption about valid
NewReno procedure. This manifested as triggering a WARN_ON for
the overflow in a number of places.

It seems relatively safe alternative to just do nothing if
tcp_fragment fails due to oom because another duplicate ACK is
likely to be received soon and the fragmentation will be retried.

Special thanks goes to Soeren Sonnenburg <kernel@nn7.de> who was
lucky enough to be able to reproduce this so that the warning
for the overflow was hit. It's not as easy task as it seems even
if this bug happens quite often because the amount of outstanding
data is pretty significant for the mismarkings to lead to an
overflow.

Because it's very late in 2.6.25-rc cycle (if this even makes in
time), I didn't want to touch anything with SACK enabled here.
Fragmenting might be useful for it as well but it's more or less
a policy decision rather than mandatory fix. Thus there's no need
to rush and we can postpone considering tcp_fragment with SACK
for 2.6.26.

In 2.6.24 and earlier, this very same bug existed but the effect
is slightly different because of a small changes in the if
conditions that fit to the patch's context. With them nothing
got lost marker and thus no retransmissions happened.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-07 22:32:38 -07:00
Ilpo Järvinen
1b69d74539 [TCP]: Restore 2.6.24 mark_head_lost behavior for newreno/fack
The fast retransmission can be forced locally to the rfc3517
branch in tcp_update_scoreboard instead of making such fragile
constructs deeper in tcp_mark_head_lost.

This is necessary for the next patch which must not have
loopholes for cnt > packets check. As one can notice,
readability got some improvements too because of this :-).

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-07 22:31:38 -07:00
Denis V. Lunev
7feb49c82a [NETNS]: Use TCP control socket from a correct namespace.
Signed-off-by: Denis V.Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-03 14:32:00 -07:00
Denis V. Lunev
046ee90235 [NETNS]: Create tcp control socket in the each namespace.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-03 14:31:33 -07:00
Denis V. Lunev
5616bdd6df [INET]: uc_ttl assignment in inet_ctl_sock_create is redundant.
uc_ttl is initialized in inet(6)_create and never changed except
setsockopt ioctl. Remove this assignment.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-03 14:30:12 -07:00
Denis V. Lunev
c1e9894d48 [ICMP]: Simplify ICMP control socket creation.
Replace sock_create_kern with inet_ctl_sock_create.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-03 14:29:00 -07:00
Denis V. Lunev
5677242f43 [NETNS]: Inet control socket should not hold a namespace.
This is a generic requirement, so make inet_ctl_sock_create namespace
aware and create a inet_ctl_sock_destroy wrapper around
sk_release_kernel.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-03 14:28:30 -07:00
Denis V. Lunev
eee4fe4ded [INET]: Let inet_ctl_sock_create return sock rather than socket.
All upper protocol layers are already use sock internally.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-03 14:27:58 -07:00
Denis V. Lunev
3d58b5fa8e [INET]: Rename inet_csk_ctl_sock_create to inet_ctl_sock_create.
This call is nothing common with INET connection sockets code. It
simply creates an unhashes kernel sockets for protocol messages.

Move the new call into af_inet.c after the rename.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-03 14:22:32 -07:00
Denis V. Lunev
14c0c8e8e0 [TCP]: Replace socket with sock for reset sending.
Replace tcp_socket with tcp_sock. This is more effective (less
derefferences on fast paths). Additionally, the approach is unified to
one used in ICMP.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-03 14:19:38 -07:00
Herbert Xu
af2681828a [ICMP]: Ensure that ICMP relookup maintains status quo
The ICMP relookup path is only meant to modify behaviour when
appropriate IPsec policies are in place and marked as requiring
relookups.  It is certainly not meant to modify behaviour when
IPsec policies don't exist at all.

However, due to an oversight on the error paths existing behaviour
may in fact change should one of the relookup steps fail.

This patch corrects this by redirecting all errors on relookup
failures to the previous code path.  That is, if the initial
xfrm_lookup let the packet pass, we will stand by that decision
should the relookup fail due to an error.

This should be safe from a security point-of-view because compliant
systems must install a default deny policy so the packet would'nt
have passed in that case.

Many thanks to Julian Anastasov for pointing out this error.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-03 12:52:19 -07:00
David S. Miller
e1ec1b8ccd Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:

	drivers/net/s2io.c
2008-04-02 22:35:23 -07:00
Pavel Emelyanov
fd4e7b5045 [IPV4][NETNS]: Display per-net info in sockstat file.
Besides, now we can see per-net fragments statistics in the
same file, since this stats is already per-net.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-31 19:43:18 -07:00
Pavel Emelyanov
d0538ca355 [SOCK][NETNS]: Register sockstat(6) files in each net.
Currently they live in init_net only, but now almost all the info
they can provide is available per-net.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-31 19:42:37 -07:00
Pavel Emelyanov
c29a0bc4df [SOCK][NETNS]: Add a struct net argument to sock_prot_inuse_add and _get.
This counter is about to become per-proto-and-per-net, so we'll need 
two arguments to determine which cell in this "table" to work with.

All the places, but proc already pass proper net to it - proc will be
tuned a bit later.

Some indentation with spaces in proc files is done to keep the file
coding style consistent.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-31 19:41:46 -07:00
YOSHIFUJI Hideaki
b50660f1fe [IP] UDP: Use SEQ_START_TOKEN.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-31 19:38:15 -07:00
Denis V. Lunev
4ad96d39a2 [UDP]: Remove owner from udp_seq_afinfo.
Move it to udp_seq_afinfo->seq_fops as should be.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-28 18:25:53 -07:00
Denis V. Lunev
3ba9441bdf [UDP]: Place file operations directly into udp_seq_afinfo.
No need to have separate never-used variable.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-28 18:25:32 -07:00
Denis V. Lunev
a2be75c182 [UDP]: Cleanup /proc/udp[6] creation/removal.
Replace seq_open with seq_open_net and remove udp_seq_release
completely.  seq_release_net will do this job just fine.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-28 18:25:06 -07:00
Denis V. Lunev
dda61925f8 [UDP]: Move seq_ops from udp_iter_state to udp_seq_afinfo.
No need to create seq_operations for each instance of 'netstat'.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-28 18:24:26 -07:00
Denis V. Lunev
997feb5e7a [UDP]: No need to check afinfo != NULL in udp_proc_(un)register.
udp_proc_register/udp_proc_unregister are called with a static pointer only.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-28 18:24:01 -07:00
Denis V. Lunev
6f191efe48 [UDP]: Replace struct net on udp_iter_state with seq_net_private.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-28 18:23:33 -07:00
David S. Miller
e8e16b706e [INET]: inet_frag_evictor() must run with BH disabled
Based upon a lockdep trace from Dave Jones.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-28 17:30:18 -07:00
Pavel Emelyanov
bdcde3d71a [SOCK]: Drop inuse pcounter from struct proto (v2).
An uppercut - do not use the pcounter on struct proto.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-28 16:39:33 -07:00
Joe Perches
bc578a54f0 [NET]: Rename inet_frag.h identifiers COMPLETE, FIRST_IN, LAST_IN to INET_FRAG_*
On Fri, 2008-03-28 at 03:24 -0700, Andrew Morton wrote:
> they should all be renamed.

Done for include/net and net

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-28 16:35:27 -07:00
Rusty Russell
32aced7509 [NET]: Don't send ICMP_FRAG_NEEDED for GSO packets
Commit 9af3912ec9 ("[NET] Move DF check
to ip_forward") added a new check to send ICMP fragmentation needed
for large packets.

Unlike the check in ip_finish_output(), it doesn't check for GSO.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-28 16:23:19 -07:00
David S. Miller
8e8e43843b Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:

	drivers/net/usb/rndis_host.c
	drivers/net/wireless/b43/dma.c
	net/ipv6/ndisc.c
2008-03-27 18:48:56 -07:00
Denis V. Lunev
8eeee8b152 [NETFILTER]: Replate direct proc_fops assignment with proc_create call.
This elliminates infamous race during module loading when one could lookup
proc entry without proc_fops assigned.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-27 16:55:53 -07:00
Thomas Graf
920fc941a9 [ESP]: Ensure IV is in linear part of the skb to avoid BUG() due to OOB access
ESP does not account for the IV size when calling pskb_may_pull() to
ensure everything it accesses directly is within the linear part of a
potential fragment. This results in a BUG() being triggered when the
both the IPv4 and IPv6 ESP stack is fed with an skb where the first
fragment ends between the end of the esp header and the end of the IV.

This bug was found by Dirk Nehring <dnehring@gmx.net> .

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-27 16:08:03 -07:00
Herbert Xu
732c8bd590 [IPSEC]: Fix BEET output
The IPv6 BEET output function is incorrectly including the inner
header in the payload to be protected.  This causes a crash as
the packet doesn't actually have that many bytes for a second
header.

The IPv4 BEET output on the other hand is broken when it comes
to handling an inner IPv6 header since it always assumes an
inner IPv4 header.

This patch fixes both by making sure that neither BEET output
function touches the inner header at all.  All access is now
done through the protocol-independent cb structure.  Two new
attributes are added to make this work, the IP header length
and the IPv4 option length.  They're filled in by the inner
mode's output function.

Thanks to Joakim Koskela for finding this problem.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-26 16:51:09 -07:00
Pavel Emelyanov
7c0ecc4c4f [ICMP]: Dst entry leak in icmp_send host re-lookup code (v2).
Commit 8b7817f3a9 ([IPSEC]: Add ICMP host
relookup support) introduced some dst leaks on error paths: the rt
pointer can be forgotten to be put. Fix it bu going to a proper label.

Found after net namespace's lo refused to unregister :) Many thanks to 
Den for valuable help during debugging.

Herbert pointed out, that xfrm_lookup() will put the rtable in case
of error itself, so the first goto fix is redundant.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-26 02:27:09 -07:00
Pavel Emelyanov
789e41e6f4 [NETNS][ICMP]: Build fix for NET_NS=n case (dev->nd_net is omitted).
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-26 02:19:25 -07:00
Pavel Emelyanov
b34a95ee6e [NETNS][ICMP]: Use per-net sysctls in ipv4/icmp.c.
This mostly re-uses the net, used in icmp netnsization patches from Denis.

After this ICMP sysctls are completely virtualized.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-26 02:00:21 -07:00
Pavel Emelyanov
68528f0998 [NETNS][ICMP]: Make ctl tables for ICMP sysctls per-net.
Add some flesh to ipv4_sysctl_init_net and ipv4_sysctl_exit_net,
i.e. copy the table, alter .data pointers and register it per-net.

Other ipv4_table's sysctls are now global, but this is going to
change once sysctl permissions patches migrate from -mm tree to 
mainline in 2.6.26 merge window :)

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-26 01:56:24 -07:00
Pavel Emelyanov
a24022e188 [NETNS][ICMP]: Move ICMP sysctls on struct net.
Initialization is moved to icmp_sk_init, all the places, that
refer to them use init_net for now.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-26 01:55:37 -07:00
Pavel Emelyanov
1577519d6b [NETNS][ICMP]: Register pernet subsys to make ICMP sysctls per-net.
This includes adding pernet_operations, empty init and exit
hooks and a bit of changes in sysctl_ipv4_init just not to
have this part in next patches.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-26 01:54:18 -07:00
Patrick McHardy
f49e1aa133 [NETFILTER]: nf_conntrack_sip: update copyright
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-25 20:27:05 -07:00
Patrick McHardy
c7f485abd6 [NETFILTER]: nf_conntrack_sip: RTP routing optimization
Optimize call routing between NATed endpoints: when an external
registrar sends a media description that contains an existing RTP
expectation from a different SNATed connection, the gatekeeper
is trying to route the call directly between the two endpoints.

We assume both endpoints can reach each other directly and
"un-NAT" the addresses, which makes the media stream go between
the two endpoints directly.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-25 20:26:43 -07:00
Patrick McHardy
4ab9e64e5e [NETFILTER]: nf_nat_sip: split up SDP mangling
The SDP connection addresses may be contained in the payload multiple
times (in the session description and/or once per media description),
currently only the session description is properly updated. Split up
SDP mangling so the function setting up expectations only updates the
media port, update connection addresses from media descriptions while
parsing them and at the end update the session description when the
final addresses are known.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-25 20:26:08 -07:00
Patrick McHardy
a9c1d35917 [NETFILTER]: nf_conntrack_sip: create RTCP expectations
Create expectations for the RTCP connections in addition to RTP connections.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-25 20:25:49 -07:00
Patrick McHardy
0f32a40fc9 [NETFILTER]: nf_conntrack_sip: create signalling expectations
Create expectations for incoming signalling connections when seeing
a REGISTER request. This is needed when the registrar uses a
different source port number for signalling messages and for receiving
incoming calls from other endpoints than the registrar.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-25 20:25:13 -07:00
Patrick McHardy
c978cd3a93 [NETFILTER]: nf_nat_sip: translate all Contact headers
The SIP message may contain multiple Contact: addresses referring to
the NATed endpoint, translate all of them.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-25 20:24:57 -07:00
Patrick McHardy
720ac7085c [NETFILTER]: nf_nat_sip: translate all Via headers
Update maddr=, received= and rport= Via-header parameters refering to
the signalling connection.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-25 20:24:41 -07:00
Patrick McHardy
33cb1e9a93 [NETFILTER]: nf_conntrack_sip: perform NAT after parsing
Perform NAT last after parsing the packet. This makes no difference
currently, but is needed when dealing with registrations to make
sure we seen the unNATed addresses.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-25 20:22:37 -07:00
Patrick McHardy
624f8b7bba [NETFILTER]: nf_nat_sip: get rid of text based header translation
Use the URI parsing helper to get the numerical addresses and get rid of the
text based header translation.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-25 20:19:30 -07:00
Patrick McHardy
ea45f12a27 [NETFILTER]: nf_conntrack_sip: parse SIP headers properly
Introduce new function for SIP header parsing that properly deals with
continuation lines and whitespace in headers and use it.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-25 20:18:57 -07:00
Patrick McHardy
ac3677406d [NETFILTER]: nf_conntrack_sip: kill request URI "header" definitions
The request URI is not a header and needs to be treated differently than
real SIP headers. Add a seperate function for parsing it and get rid of
the POS_REQ_URI/POS_REG_REQ_URI definitions.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-25 20:18:40 -07:00
Patrick McHardy
3e9b4600b4 [NETFILTER]: nf_conntrack_sip: add seperate SDP header parsing function
SDP and SIP headers are quite different, SIP can have continuation lines,
leading and trailing whitespace after the colon and is mostly case-insensitive
while SDP headers always begin on a new line and are followed by an equal
sign and the value, without any whitespace.

Introduce new SDP header parsing function and convert all users that used
the SIP header parsing function. This will allow to properly deal with the
special SIP cases in the SIP header parsing function later.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-25 20:17:55 -07:00
Patrick McHardy
779382eb32 [NETFILTER]: nf_conntrack_sip: use strlen/strcmp
Replace sizeof/memcmp by strlen/strcmp. Use case-insensitive comparison
for SIP methods and the SIP/2.0 string, as specified in RFC 3261.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-25 20:17:36 -07:00
Patrick McHardy
212440a7d0 [NETFILTER]: nf_conntrack_sip: remove redundant function arguments
The conntrack reference and ctinfo can be derived from the packet.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-25 20:17:13 -07:00
Patrick McHardy
2a6cfb22ae [NETFILTER]: nf_conntrack_sip: adjust dptr and datalen after packet mangling
After mangling the packet, the pointer to the data and the length of the data
portion may change and need to be adjusted.

Use double data pointers and a pointer to the length everywhere and add a
helper function to the NAT helper for performing the adjustments.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-25 20:16:54 -07:00
Patrick McHardy
3d244121d8 [NETFILTER]: nf_nat_sip: fix NAT setup order
We need to set up the destination NAT mapping before the source NAT
mapping, so the NAT core gets to see the final tuple and can decide
whether the source port needs to be remapped.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-25 20:09:51 -07:00
Patrick McHardy
6002f266b3 [NETFILTER]: nf_conntrack: introduce expectation classes and policies
Introduce expectation classes and policies. An expectation class
is used to distinguish different types of expectations by the
same helper (for example audio/video/t.120). The expectation
policy is used to hold the maximum number of expectations and
the initial timeout for each class.

The individual classes are isolated from each other, which means
that for example an audio expectation will only evict other audio
expectations.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-25 20:09:15 -07:00
Patrick McHardy
30c69fed7d [NETFILTER]: ipt_CLUSTERIP: fix non-existant macro-name
With nf_conntrack DUMP_TUPLE got renamed to NF_CT_DUMP_TUPLE, fix
CLUSTERIP to use the proper macro name.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-25 20:06:59 -07:00
YOSHIFUJI Hideaki
878628fbf2 [NET] NETNS: Omit namespace comparision without CONFIG_NET_NS.
Introduce an inline net_eq() to compare two namespaces.
Without CONFIG_NET_NS, since no namespace other than &init_net
exists, it is always 1.

We do not need to convert 1) inline vs inline and
2) inline vs &init_net comparisons.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-03-26 04:40:00 +09:00
YOSHIFUJI Hideaki
1218854afa [NET] NETNS: Omit seq_net_private->net without CONFIG_NET_NS.
Without CONFIG_NET_NS, no namespace other than &init_net exists,
no need to store net in seq_net_private.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-03-26 04:39:56 +09:00
YOSHIFUJI Hideaki
3b1e0a655f [NET] NETNS: Omit sock->sk_net without CONFIG_NET_NS.
Introduce per-sock inlines: sock_net(), sock_net_set()
and per-inet_timewait_sock inlines: twsk_net(), twsk_net_set().
Without CONFIG_NET_NS, no namespace other than &init_net exists.
Let's explicitly define them to help compiler optimizations.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-03-26 04:39:55 +09:00
YOSHIFUJI Hideaki
c346dca108 [NET] NETNS: Omit net_device->nd_net without CONFIG_NET_NS.
Introduce per-net_device inlines: dev_net(), dev_net_set().
Without CONFIG_NET_NS, no namespace other than &init_net exists.
Let's explicitly define them to help compiler optimizations.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-03-26 04:39:53 +09:00
YOSHIFUJI Hideaki
c8cdaf998d [IPV4,IPV6]: Share cork.rt between IPv4 and IPv6.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-03-25 10:23:59 +09:00
Denis V. Lunev
92f1fecb45 [NETNS]: Enable TCP/UDP/ICMP inside namespace.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-24 15:34:06 -07:00
Denis V. Lunev
2342fd7e14 [NETNS]: Allow to create sockets in non-initial namespace.
Allow to create sockets in the namespace if the protocol ok with this.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-24 15:33:42 -07:00
Denis V. Lunev
f145049a06 [NETNS]: Drop packets in the non-initial namespace on the per/protocol basis.
IP layer now can handle multiple namespaces normally. So, process such
packets normally and drop them only if the transport layer is not
aware about namespaces.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-24 15:33:00 -07:00
Denis V. Lunev
05cf89d40c [NETNS]: Process INET socket layer in the correct namespace.
Replace all the reast of the init_net with a proper net on the socket
layer.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-24 15:31:35 -07:00
Denis V. Lunev
cb84663e4d [NETNS]: Process IP layer in the context of the correct namespace.
Replace all the rest of the init_net with a proper net on the IP layer.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-24 15:31:00 -07:00
Denis V. Lunev
7a6adb92fe [NETNS]: Add namespace parameter to ip_cmsg_send.
Pass the init_net there for now.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-24 15:30:27 -07:00
Denis V. Lunev
f2c4802b3f [NETNS]: Add namespace parameter to ip_options_get(...).
Pass the init_net there for now.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-24 15:29:55 -07:00
Denis V. Lunev
0e6bd4a1c6 [NETNS]: Add namespace parameter to ip_options_compile.
ip_options_compile uses inet_addr_type which requires a namespace. The
packet argument is optional, so parameter is the only way to obtain
it. Pass the init_net there for now.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-24 15:29:23 -07:00
Denis V. Lunev
ffc31d3d77 [NETNS]: /proc/net/arp namespacing.
Seqfile operation showing /proc/net/arp are already namespace
aware. All we need is to register this file for each namespace.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-24 15:28:43 -07:00
Denis V. Lunev
49e8a279a1 [NETNS]: Process ARP in the context of the correct namespace.
Get namespace from a device and pass it to the routing engine. Enable
ARP packet processing and device notifiers after that.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-24 15:28:12 -07:00
Pavel Emelyanov
84c375af0f [NETNS][UDP-Lite]: Register /proc/net/udplite(6) in a namespace.
UDP-Lite sockets are displayed in another files, rather than
UDP ones, so make the present in namespaces as well.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-24 14:56:57 -07:00
Pavel Emelyanov
ff2bac6a63 [UDP-Lite]: Clean up proc creation a bit.
Just introduce a helper to remove ifdefs from inside the
udplite4_register function. This will help to make the next patch
nicer.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-24 14:56:34 -07:00
Pavel Emelyanov
757764f61d [NETNS][TCP]: Register /proc/net/tcp in a namespace.
After the commit f40c8174d3 ([NETNS][IPV4] 
tcp - make proc handle the network namespaces) it is now possible to make
this file present in newly created namespaces.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-24 14:56:02 -07:00
Pavel Emelyanov
15439febb0 [NETNS][UDP]: Register /proc/net/udp in a namespace.
After the commit a91275eff4 ([NETNS][IPV6]
udp - make proc handle the network namespace) it is now possible to make
this file present in newly created namespaces.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-24 14:53:49 -07:00
Kazunori MIYAZAWA
df9dcb4588 [IPSEC]: Fix inter address family IPsec tunnel handling.
Signed-off-by: Kazunori MIYAZAWA <kazunori@miyazawa.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-24 14:51:51 -07:00
David S. Miller
06802a819a Merge branch 'master' of ../net-2.6/
Conflicts:

	net/ipv6/ndisc.c
2008-03-23 22:54:03 -07:00
Stephen Hemminger
3d3b2d25a4 fib_trie: print information on all routing tables
Make /proc/net/fib_trie and /proc/net/fib_triestat display
all routing tables, not just local and main.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-23 22:43:56 -07:00
Florian Westphal
2051f11fb8 [TCP]: Shrink syncookie_secret by 8 byte.
the first u32 copied from syncookie_secret is overwritten by the
minute-counter four lines below.  After adjusting the destination
address, the size of syncookie_secret can be reduced accordingly.

AFAICS, the only other user of syncookie_secret[] is the ipv6
syncookie support.  Because ipv6 syncookies only grab 44 bytes from
syncookie_secret[], this shouldn't affect them in any way.

With fixes from Glenn Griffin.

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Glenn Griffin <ggriffin.kernel@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-23 22:21:28 -07:00
Stephen Hemminger
6440cc9e0f [IPV4] fib_trie: fix warning from rcu_assign_poinger
This gets rid of a warning caused by the test in rcu_assign_pointer.
I tried to fix rcu_assign_pointer, but that devolved into a long set
of discussions about doing it right that came to no real solution.
Since the test in rcu_assign_pointer for constant NULL would never
succeed in fib_trie, just open code instead.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-22 17:59:58 -07:00
Stephen Hemminger
817bc4db77 [IPV4] route: use read_mostly
The route table parameters are set based on system memory and sysctl
values that almost never change. Also the genid only changes every
10 minutes.

RTprint is defined by never used.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-22 17:43:59 -07:00
Denis V. Lunev
ce25999078 [IPV4]: sk parameter is unused in ipv4_dst_blackhole.
Just remove it.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-22 17:42:37 -07:00
Pavel Emelyanov
fc8717baa8 [RAW]: Add raw_hashinfo member on struct proto.
Sorry for the patch sequence confusion :| but I found that the similar
thing can be done for raw sockets easily too late.

Expand the proto.h union with the raw_hashinfo member and use it in
raw_prot and rawv6_prot. This allows to drop the protocol specific
versions of hash and unhash callbacks.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-22 16:56:51 -07:00
Pavel Emelyanov
6ba5a3c52d [UDP]: Make full use of proto.h.udp_hash innovation.
After this we have only udp_lib_get_port to get the port and two 
stubs for ipv4 and ipv6. No difference in udp and udplite except
for initialized h.udp_hash member.

I tried to find a graceful way to drop the only difference between
udp_v4_get_port and udp_v6_get_port (i.e. the rcv_saddr comparison 
routine), but adding one more callback on the struct proto didn't 
appear such :( Maybe later.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-22 16:51:21 -07:00
Pavel Emelyanov
39d8cda76c [SOCK]: Add udp_hash member to struct proto.
Inspired by the commit ab1e0a13 ([SOCK] proto: Add hashinfo member to 
struct proto) from Arnaldo, I made similar thing for UDP/-Lite IPv4 
and -v6 protocols.

The result is not that exciting, but it removes some levels of
indirection in udpxxx_get_port and saves some space in code and text.

The first step is to union existing hashinfo and new udp_hash on the
struct proto and give a name to this union, since future initialization 
of tcpxxx_prot, dccp_vx_protinfo and udpxxx_protinfo will cause gcc 
warning about inability to initialize anonymous member this way.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-22 16:50:58 -07:00
Denis V. Lunev
22aba383ce [IPV4]: Always pass ip_options pointer into ip_options_compile.
This makes code a bit more uniform and straigthforward.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-22 16:36:20 -07:00
Denis V. Lunev
ef722495c8 [IPV4]: Remove unused ip_options->is_data.
ip_options->is_data is assigned only and never checked. The structure is
not a part of kernel interface to the userspace. So, it is safe to remove
this field.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-22 16:35:29 -07:00
Denis V. Lunev
10fe7d85e2 [IPV4]: Remove unnecessary check for opt->is_data in ip_options_compile.
There is the only way to reach ip_options compile with opt != NULL:

ip_options_get_finish
    opt->is_data = 1;
    ip_options_compile(opt, NULL)

So, checking for is_data inside opt != NULL branch is not needed.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-22 16:35:00 -07:00
Herbert Xu
69d1506731 [TCP]: Let skbs grow over a page on fast peers
While testing the virtio-net driver on KVM with TSO I noticed
that TSO performance with a 1500 MTU is significantly worse
compared to the performance of non-TSO with a 16436 MTU.  The
packet dump shows that most of the packets sent are smaller
than a page.

Looking at the code this actually is quite obvious as it always
stop extending the packet if it's the first packet yet to be
sent and if it's larger than the MSS.  Since each extension is
bound by the page size, this means that (given a 1500 MTU) we're
very unlikely to construct packets greater than a page, provided
that the receiver and the path is fast enough so that packets can
always be sent immediately.

The fix is also quite obvious.  The push calls inside the loop
is just an optimisation so that we don't end up doing all the
sending at the end of the loop.  Therefore there is no specific
reason why it has to do so at MSS boundaries.  For TSO, the
most natural extension of this optimisation is to do the pushing
once the skb exceeds the TSO size goal.

This is what the patch does and testing with KVM shows that the
TSO performance with a 1500 MTU easily surpasses that of a 16436
MTU and indeed the packet sizes sent are generally larger than
16436.

I don't see any obvious downsides for slower peers or connections,
but it would be prudent to test this extensively to ensure that
those cases don't regress.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-22 15:47:05 -07:00
Patrick McManus
ec3c0982a2 [TCP]: TCP_DEFER_ACCEPT updates - process as established
Change TCP_DEFER_ACCEPT implementation so that it transitions a
connection to ESTABLISHED after handshake is complete instead of
leaving it in SYN-RECV until some data arrvies. Place connection in
accept queue when first data packet arrives from slow path.

Benefits:
  - established connection is now reset if it never makes it
   to the accept queue

 - diagnostic state of established matches with the packet traces
   showing completed handshake

 - TCP_DEFER_ACCEPT timeouts are expressed in seconds and can now be
   enforced with reasonable accuracy instead of rounding up to next
   exponential back-off of syn-ack retry.

Signed-off-by: Patrick McManus <mcmanus@ducksong.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-21 16:33:01 -07:00
Patrick McManus
e4c7884028 [TCP]: TCP_DEFER_ACCEPT updates - dont retxmt synack
a socket in LISTEN that had completed its 3 way handshake, but not notified
userspace because of SO_DEFER_ACCEPT, would retransmit the already
acked syn-ack during the time it was waiting for the first data byte
from the peer.

Signed-off-by: Patrick McManus <mcmanus@ducksong.com>
Acked-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-21 16:29:22 -07:00
Patrick McManus
539fae89be [TCP]: TCP_DEFER_ACCEPT updates - defer timeout conflicts with max_thresh
timeout associated with SO_DEFER_ACCEPT wasn't being honored if it was
less than the timeout allowed by the maximum syn-recv queue size
algorithm. Fix by using the SO_DEFER_ACCEPT value if the ack has
arrived.

Signed-off-by: Patrick McManus <mcmanus@ducksong.com>
Acked-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-21 16:27:38 -07:00
Pavel Emelyanov
28518fc170 [NET]: NULL pointer dereference and other nasty things in /proc/net/(tcp|udp)[6]
Commits f40c81 ([NETNS][IPV4] tcp - make proc handle the network
namespaces) and a91275 ([NETNS][IPV6] udp - make proc handle the
network namespace) both introduced bad checks on sockets and tw
buckets to belong to proper net namespace.

I.e. when checking for socket to belong to given net and family the

	do {
		sk = sk_next(sk);
	} while (sk && sk->sk_net != net && sk->sk_family != family);

constructions were used. This is wrong, since as soon as the
sk->sk_net fits the net the socket is immediately returned, even if it
belongs to other family.

As the result four /proc/net/(udp|tcp)[6] entries show wrong info.
The udp6 entry even oopses when dereferencing inet6_sk(sk) pointer:

static void udp6_sock_seq_show(struct seq_file *seq, struct sock *sp, int bucket)
{
	...
        struct ipv6_pinfo *np = inet6_sk(sp);
	...

        dest  = &np->daddr; /* will be NULL for AF_INET sockets */
	...
	seq_printf(...
	           dest->s6_addr32[0], dest->s6_addr32[1],
                   dest->s6_addr32[2], dest->s6_addr32[3],
	...

Fix it by converting && to ||.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-21 15:52:00 -07:00
Phil Oester
12b101555f [IPV4]: Fix null dereference in ip_defrag
Been seeing occasional panics in my testing of 2.6.25-rc in ip_defrag.
Offending line in ip_defrag is here:

	net = skb->dev->nd_net

where dev is NULL.  Bisected the problem down to commit
ac18e7509e ([NETNS][FRAGS]: Make the
inet_frag_queue lookup work in namespaces).  

Below patch (idea from Patrick McHardy) fixes the problem for me.

Signed-off-by: Phil Oester <kernel@linuxace.com>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-21 15:01:50 -07:00
Daniel Lezcano
6f8b13bcb3 [NETNS][IPV6] tcp6 - make proc per namespace
Make the proc for tcp6 to be per namespace.

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-21 04:14:45 -07:00
Daniel Lezcano
0c96d8c50b [NETNS][IPV6] udp6 - make proc per namespace
The proc init/exit functions take a new network namespace parameter in
order to register/unregister /proc/net/udp6 for a namespace.

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-21 04:14:17 -07:00
Daniel Lezcano
f40c8174d3 [NETNS][IPV4] tcp - make proc handle the network namespaces
This patch, like udp proc, makes the proc functions to take care of
which namespace the socket belongs.

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-21 04:13:54 -07:00
Daniel Lezcano
8d9f1744ca [NETNS][IPV6] tcp - assign the netns for timewait sockets
Copy the network namespace from the socket to the timewait socket.

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-21 04:12:54 -07:00
Daniel Lezcano
a91275eff4 [NETNS][IPV6] udp - make proc handle the network namespace
This patch makes the common udp proc functions to take care of which
socket they should show taking into account the namespace it belongs.

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-21 04:11:58 -07:00
Peter P Waskiewicz Jr
82cc1a7a56 [NET]: Add per-connection option to set max TSO frame size
Update: My mailer ate one of Jarek's feedback mails...  Fixed the
parameter in netif_set_gso_max_size() to be u32, not u16.  Fixed the
whitespace issue due to a patch import botch.  Changed the types from
u32 to unsigned int to be more consistent with other variables in the
area.  Also brought the patch up to the latest net-2.6.26 tree.

Update: Made gso_max_size container 32 bits, not 16.  Moved the
location of gso_max_size within netdev to be less hotpath.  Made more
consistent names between the sock and netdev layers, and added a
define for the max GSO size.

Update: Respun for net-2.6.26 tree.

Update: changed max_gso_frame_size and sk_gso_max_size from signed to
unsigned - thanks Stephen!

This patch adds the ability for device drivers to control the size of
the TSO frames being sent to them, per TCP connection.  By setting the
netdevice's gso_max_size value, the socket layer will set the GSO
frame size based on that value.  This will propogate into the TCP
layer, and send TSO's of that size to the hardware.

This can be desirable to help tune the bursty nature of TSO on a
per-adapter basis, where one may have 1 GbE and 10 GbE devices
coexisting in a system, one running multiqueue and the other not, etc.

This can also be desirable for devices that cannot support full 64 KB
TSO's, but still want to benefit from some level of segmentation
offloading.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-21 03:43:19 -07:00
David S. Miller
a25606c845 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 2008-03-21 03:42:24 -07:00
Patrick McHardy
607bfbf2d5 [TCP]: Fix shrinking windows with window scaling
When selecting a new window, tcp_select_window() tries not to shrink
the offered window by using the maximum of the remaining offered window
size and the newly calculated window size. The newly calculated window
size is always a multiple of the window scaling factor, the remaining
window size however might not be since it depends on rcv_wup/rcv_nxt.
This means we're effectively shrinking the window when scaling it down.


The dump below shows the problem (scaling factor 2^7):

- Window size of 557 (71296) is advertised, up to 3111907257:

IP 172.2.2.3.33000 > 172.2.2.2.33000: . ack 3111835961 win 557 <...>

- New window size of 514 (65792) is advertised, up to 3111907217, 40 bytes
  below the last end:

IP 172.2.2.3.33000 > 172.2.2.2.33000: . 3113575668:3113577116(1448) ack 3111841425 win 514 <...>

The number 40 results from downscaling the remaining window:

3111907257 - 3111841425 = 65832
65832 / 2^7 = 514
65832 % 2^7 = 40

If the sender uses up the entire window before it is shrunk, this can have
chaotic effects on the connection. When sending ACKs, tcp_acceptable_seq()
will notice that the window has been shrunk since tcp_wnd_end() is before
tp->snd_nxt, which makes it choose tcp_wnd_end() as sequence number.
This will fail the receivers checks in tcp_sequence() however since it
is before it's tp->rcv_wup, making it respond with a dupack.

If both sides are in this condition, this leads to a constant flood of
ACKs until the connection times out.

Make sure the window is never shrunk by aligning the remaining window to
the window scaling factor.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-20 16:11:27 -07:00
Daniel Hokka Zakrisson
d0ebf13359 [NETFILTER]: ipt_recent: sanity check hit count
If a rule using ipt_recent is created with a hit count greater than
ip_pkt_list_tot, the rule will never match as it cannot keep track
of enough timestamps. This patch makes ipt_recent refuse to create such
rules.

With ip_pkt_list_tot's default value of 20, the following can be used
to reproduce the problem.

nc -u -l 0.0.0.0 1234 &
for i in `seq 1 100`; do echo $i | nc -w 1 -u 127.0.0.1 1234; done

This limits it to 20 packets:
iptables -A OUTPUT -p udp --dport 1234 -m recent --set --name test \
         --rsource
iptables -A OUTPUT -p udp --dport 1234 -m recent --update --seconds \
         60 --hitcount 20 --name test --rsource -j DROP

While this is unlimited:
iptables -A OUTPUT -p udp --dport 1234 -m recent --set --name test \
         --rsource
iptables -A OUTPUT -p udp --dport 1234 -m recent --update --seconds \
         60 --hitcount 21 --name test --rsource -j DROP

With the patch the second rule-set will throw an EINVAL.

Reported-by: Sean Kennedy <skennedy@vcn.com>
Signed-off-by: Daniel Hokka Zakrisson <daniel@hozac.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-20 15:07:10 -07:00
Robert P. J. Day
938b93adb2 [NET]: Add debugging names to __RW_LOCK_UNLOCKED macros.
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-18 00:59:23 -07:00
David S. Miller
577f99c1d0 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:

	drivers/net/wireless/rt2x00/rt2x00dev.c
	net/8021q/vlan_dev.c
2008-03-18 00:37:55 -07:00
Al Viro
5e226e4d90 [IPV4]: esp_output() misannotations
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-17 22:50:23 -07:00
Al Viro
e6f1cebf71 [NET] endianness noise: INADDR_ANY
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-17 22:44:53 -07:00
Ilpo Järvinen
5ea3a74806 [TCP]: Prevent sending past receiver window with TSO (at last skb)
With TSO it was possible to send past the receiver window when the skb
to be sent was the last in the write queue while the receiver window
is the limiting factor. One can notice that there's a loophole in the
tcp_mss_split_point that lacked a receiver window check for the
tcp_write_queue_tail() if also cwnd was smaller than the full skb.

Noticed by Thomas Gleixner <tglx@linutronix.de> in form of "Treason
uncloaked! Peer ... shrinks window .... Repaired."  messages (the peer
didn't actually shrink its window as the message suggests, we had just
sent something past it without a permission to do so).

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-11 17:55:27 -07:00
David S. Miller
db8dac20d5 [UDP]: Revert udplite and code split.
This reverts commit db1ed684f6 ("[IPV6]
UDP: Rename IPv6 UDP files."), commit
8be8af8fa4 ("[IPV4] UDP: Move
IPv4-specific bits to other file.") and commit
e898d4db27 ("[UDP]: Allow users to
configure UDP-Lite.").

First, udplite is of such small cost, and it is a core protocol just
like TCP and normal UDP are.

We spent enormous amounts of effort to make udplite share as much code
with core UDP as possible.  All of that work is less valuable if we're
just going to slap a config option on udplite support.

It is also causing build failures, as reported on linux-next, showing
that the changeset was not tested very well.  In fact, this is the
second build failure resulting from the udplite change.

Finally, the config options provided was a bool, instead of a modular
option.  Meaning the udplite code does not even get build tested
by allmodconfig builds, and furthermore the user is not presented
with a reasonable modular build option which is particularly needed
by distribution vendors.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-06 16:22:02 -08:00
Harvey Harrison
0dc47877a3 net: replace remaining __FUNCTION__ occurrences
__FUNCTION__ is gcc-specific, use __func__

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-05 20:47:47 -08:00
Eric Dumazet
ee6b967301 [IPV4]: Add 'rtable' field in struct sk_buff to alias 'dst' and avoid casts
(Anonymous) unions can help us to avoid ugly casts.

A common cast it the (struct rtable *)skb->dst one.

Defining an union like  :
union {
     struct dst_entry *dst;
     struct rtable *rtable;
};
permits to use skb->rtable in place.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-05 18:30:47 -08:00
David S. Miller
255333c1db Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:

	net/mac80211/rc80211_pid_algo.c
2008-03-05 12:26:41 -08:00
Stephen Hemminger
dea75bdfa5 [IPCONFIG]: The kernel gets no IP from some DHCP servers
From: Stephen Hemminger <shemminger@linux-foundation.org>

Based upon a patch by Marcel Wappler:
 
   This patch fixes a DHCP issue of the kernel: some DHCP servers
   (i.e.  in the Linksys WRT54Gv5) are very strict about the contents
   of the DHCPDISCOVER packet they receive from clients.
 
   Table 5 in RFC2131 page 36 requests the fields 'ciaddr' and
   'siaddr' MUST be set to '0'.  These DHCP servers ignore Linux
   kernel's DHCP discovery packets with these two fields set to
   '255.255.255.255' (in contrast to popular DHCP clients, such as
   'dhclient' or 'udhcpc').  This leads to a not booting system.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-04 17:03:49 -08:00
Herbert Xu
ed58dd41f3 [ESP]: Add select on AUTHENC
Now the ESP uses the AEAD interface even for algorithms which are
not combined mode, we need to select CONFIG_CRYPTO_AUTHENC as
otherwise only combined mode algorithms will work.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-04 14:29:21 -08:00
Sangtae Ha
6b3d626321 [TCP]: TCP cubic v2.2
We have updated CUBIC to fix some issues with slow increase in large
BDP networks. We also improved its convergence speed. The fix is in
fact very simple -- the window increase limit of smax during the
window probing phase (i.e., convex growth phase) is removed. We found
that this does not affect TCP friendliness, but only improves its
scalability. We have run some tests in our lab and also over the
Internet path from NCSU to Japan. These results can be seen from the
following page:

http://netsrv.csc.ncsu.edu/wiki/index.php/Intra_protocol_fairness_testing_with_linux-2.6.23.9
http://netsrv.csc.ncsu.edu/wiki/index.php/RTT_fairness_testing_with_linux-2.6.23.9
http://netsrv.csc.ncsu.edu/wiki/index.php/TCP_friendliness_testing_with_linux-2.6.23.9

Signed-off-by: Sangtae Ha <sha2@ncsu.edu>
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-04 14:17:41 -08:00
YOSHIFUJI Hideaki
8be8af8fa4 [IPV4] UDP: Move IPv4-specific bits to other file.
Move IPv4-specific UDP bits from net/ipv4/udp.c into (new) net/ipv4/udp_ipv4.c.
Rename net/ipv4/udplite.c to net/ipv4/udplite_ipv4.c.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-03-04 15:18:22 +09:00
YOSHIFUJI Hideaki
e898d4db27 [UDP]: Allow users to configure UDP-Lite.
Let's give users an option for disabling UDP-Lite (~4K).

old:
|    text	   data	    bss	    dec	    hex	filename
|  286498	  12432	   6072	 305002	  4a76a	net/ipv4/built-in.o
|  193830	   8192	   3204	 205226	  321aa	net/ipv6/ipv6.o

new (without UDP-Lite):
|    text	   data	    bss	    dec	    hex	filename
|  284086	  12136	   5432	 301654	  49a56	net/ipv4/built-in.o
|  191835	   7832	   3076	 202743	  317f7	net/ipv6/ipv6.o

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-03-04 15:18:22 +09:00
Glenn Griffin
c6aefafb7e [TCP]: Add IPv6 support to TCP SYN cookies
Updated to incorporate Eric's suggestion of using a per cpu buffer
rather than allocating on the stack.  Just a two line change, but will
resend in it's entirety.

Signed-off-by: Glenn Griffin <ggriffin.kernel@gmail.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-03-04 15:18:21 +09:00
Eric Dumazet
11baab7ac3 [TCP]: lower stack usage in cookie_hash() function
400 bytes allocated on stack might be a litle bit too much. Using a
per_cpu var is more friendly.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-03-04 15:18:21 +09:00
Pavel Emelyanov
988b705077 [ARP]: Introduce the arp_hdr_len helper.
There are some place, that calculate the ARP header length. These
calculations are correct, but 
 a) some operate with "magic" constants,
 b) enlarge the code length (sometimes at the cost of coding style),
 c) are not informative from the first glance.

The proposal is to introduce a helper, that includes all the good
sides of these calculations.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-03 12:20:57 -08:00
Ilpo Järvinen
d152a7d88a [TCP]: Must count fack_count also when skipping
It makes fackets_out to grow too slowly compared with the
real write queue.

This shouldn't cause those BUG_TRAP(packets <= tp->packets_out)
to trigger but how knows how such inconsistent fackets_out
affects here and there around TCP when everything is nowadays
assuming accurate fackets_out. So lets see if this silences
them all.

Reported by Guillaume Chazarain <guichaz@gmail.com>.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-03 12:10:16 -08:00
Denis V. Lunev
7cd04fa7e3 [TCP]: Merge exit paths in tcp_v4_conn_request.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-03 11:59:32 -08:00
Denis V. Lunev
da7ef338a2 [IPV4]: skb->dst can't be NULL in ip_options_echo.
ip_options_echo is called on the packet input path after the initial
routing. The dst entry on the packet is cleared only in the several
very specific places and immidiately assigned back (may be new).

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-03 11:50:10 -08:00
Denis V. Lunev
1d1c8d13c4 [ICMP]: Section conflict between icmp_sk_init/icmp_sk_exit.
Functions from __exit section should not be called from ones in __init
section. Fix this conflict.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-29 14:15:19 -08:00
Denis V. Lunev
fd80eb942a [INET]: Remove struct dst_entry *dst from request_sock_ops.rtx_syn_ack.
It looks like dst parameter is used in this API due to historical
reasons.  Actually, it is really used in the direct call to
tcp_v4_send_synack only.  So, create a wrapper for tcp_v4_send_synack
and remove dst from rtx_syn_ack.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-29 11:43:03 -08:00
Pavel Emelyanov
665bba1087 [NETFILTER/RXRPC]: Don't use seq_release_private where inappropriate.
Some netfilter code and rxrpc one use seq_open() to open
a proc file, but seq_release_private to release one.

This is harmless, but ambiguous.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-29 11:39:17 -08:00