The btrfs IO submission threads try to service a bunch of devices with a small
number of threads. They do a congestion check to try and avoid waiting
on requests for a busy device.
The checks make sure we've sent a few requests down to a given device just so
that we aren't bouncing between busy devices without actually sending down
any IO. The counter used to decide if we can switch to the next device
is somewhat overloaded. It is also being used to decide if we've done
a good batch of requests between the WRITE_SYNC or regular priority lists.
It may get reset to zero often, leaving us hammering on a busy device
instead of moving on to another disk.
This commit adds a new counter for the number of bios sent while
servicing a device. It doesn't get reset or fiddled with. On
multi-device filesystems, this fixes IO stalls in streaming
write workloads.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs uses dedicated threads to submit bios when checksumming is on,
which allows us to make sure the threads dedicated to checksumming don't get
stuck waiting for requests. For each btrfs device, there are
two lists of bios. One list is for WRITE_SYNC bios and the other
is for regular priority bios.
The IO submission threads used to process all of the WRITE_SYNC bios first and
then switch to the regular bios. This commit makes sure we don't completely
starve the regular bios by rotating between the two lists.
WRITE_SYNC bios are still favored 2:1 over the regular bios, and this tries
to run in batches to avoid seeking. Benchmarking shows this eliminates
stalls during streaming buffered writes on both multi-device and
single device filesystems.
If the regular bios starve, the system can end up with a large amount of ram
pinned down in writeback pages. If we are a little more fair between the two
classes, we're able to keep throughput up and make progress on the bulk of
our dirty ram.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Once a metadata block has been written, it must be recowed, so the
btrfs dirty balancing call has a check to make sure a fair amount of metadata
was actually dirty before it started writing it back to disk.
A previous commit had changed the dirty tracking for metadata without
updating the btrfs dirty balancing checks. This commit switches it
to use the correct counter.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The block allocator in SSD mode will try to find groups of free blocks
that are close together. This commit makes it loop less on a given
group size before bumping it.
The end result is that we are less likely to fill small holes in the
available free space, but we don't waste as much CPU building the
large cluster used by ssd mode.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
With the new back reference code, the cost of a balance has gone down
in terms of the number of back reference updates done. This commit
makes us more aggressively balance leaves and nodes as they become
less full.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When the delayed reference code was added, some checks were added
to avoid extra balancing while the delayed references were being flushed.
This made for less efficient btrees, but it reduced the chances of
loops where no forward progress was made because the balances made
more delayed ref updates.
With the new dead root removal code and the mixed back references,
the extent allocation tree is no longer using precise back refs, and
the delayed reference updates don't carry the risk of looping forever
anymore. So, the balance avoidance is no longer required.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There are some 'start = state->end + 1;' like code in set_extent_bit
and clear_extent_bit. They overflow when end == (u64)-1.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch turns on parallel scanning for the ata_piix driver.
This driver is used on most netbooks (no AHCI for cheap storage it seems).
The scan is the dominating time factor in the kernel boot for these
devices; with this flag it gets cut in half for the device I used
for testing (eeepc).
Alan took a look at the driver source and concluded that it ought to be safe
to do for this driver. Alan has also checked with the hardware team.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
When I thought it was finally defeated, it came back with vengeance.
The failure cases are ever more convoluted. Now there is a single
combination which fails boot probing - MCP5x + Intel SSD and there are
two hotplug failure reports on different flavors where softreset fails
to bring up the device.
Through the many bug reports after the switch to hardreset, the
following patterns emerged.
- Softreset during boot always works.
- Hardreset during boot sometimes fails to bring up the link on
certain comibnations and device signature acquisition is unreliable.
- Hardreset is often necessary after hotplug.
It looks like the old behavior of preferring softreset was somehow
pretty close to the working reset protocol although it could have lost
a device during phy error handling by issuing hardreset.
This patch implements nv_hardreset() which kicks in only for post-boot
(!LOADING) device probing resets. This should be able to work around
all known problem cases. This isn't perfect but given the various
hardreset quirks on these controllers, I think this is as good as it
can get.
Tested on mcp5x (swncq), nf3 and ck804 for all both boot, warm and
hot probing cases.
Kudos to all the bug reporters and their painful hours with these damn
controllers. ;-)
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Robert Hancock <hancockr@shaw.ca>
Reported-by: David Lang <david@lang.hm>
Reported-by: Samo Vodopivec <lament.email.si@gmail.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Community reported one SB600 SATA issue(BZ #9412), which led to 64 bit
DMA disablement for all SB600 revisions by driver maintainers with
commits c7a42156d9 and
4cde32fc4b.
But the root cause is ASUS M2A-VM system BIOS bug in old revisions
like 0901, while forcing into 32bit DMA happens to work as workaround.
Now it's time to withdraw 4cde32fc4b
so as to restore the SB600 SATA 64bit DMA capability.
This patch is also adding the workaround for M2A-VM old BIOS revisions,
but users are suggested to upgrade their system BIOS to the latest one
if they meet this issue.
Signed-off-by: Shane Huang <shane.huang@amd.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Currently report and stat catch SIGINT (and others) without altering
their exit state. This means that things like:
while :; do perf stat ./foo ; done
Loops become hard-to-interrupt, because bash never sees perf terminate
due to interruption. Fix this.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Create the counter in a disabled state and only enable it after we
mmap() the buffer, this allows us to see the first few samples (and
observe the frequency ramp).
Furthermore, print the period in the verbose report.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Also employ the overflow handler to adjust the frequency, this results
in a stable frequency in about 40~50 samples, instead of that many ticks.
This also means we can start sampling at a sample period of 1 without
running head-first into the throttle.
It relies on sched_clock() to accurately measure the time difference
between the overflow NMIs.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If userspace specifies a memory slot that is larger than 8 petabytes, it
could overflow the largepages variable.
Cc: stable@kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
If a slots guest physical address and host virtual address unequal (mod
large page size), then we would erronously try to back guest large pages
with host large pages. Detect this misalignment and diable large page
support for the trouble slot.
Cc: stable@kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
Combined mode pci quirk hacks went away - so the table to keep in sync
no longer exists.
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
We can't do this for the later ones as they have all sorts of magic boot
time stuff that needs reviewing and the like. However we can do it for the
older ones and it turns out we need to as some IBM docking stations have a
second PIIX series device in them and without this change you can't use it
very well
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Make the following EM related cleanups.
* Use msleep(1) instead of udelay(100) and reduce retry count to 5.
* s/MAX_SLOTS/EM_MAX_SLOTS/, s/MAX_RETRY/EM_MAX_RETRY/
* Make EM constants enums as suggested by Jeff.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Milburn <dmilburn@redhat.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
We very rarely (if ever) complete more than one command in the
sactive mask at the time, even for extremely high IO rates. So
looping over the entire range of possible tags is pointless,
instead use __ffs() to just find the completed tags directly.
Updated to clear the tag from the done_mask instead of shifting
done_mask down as suggested by From: Tejun Heo <htejun@gmail.com>
Verified with a user space tester to produce the same results.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
32-bit PIO seems to work fine on sata_sil hardware (tested on SiI3114) and is
listed as OK in the Silicon Image datasheets. Enable it.
Signed-off-by: Robert Hancock <hancockrwd@gmail.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
ECC initialization takes too long. It writes zeroes by portions
of 4 byte, it takes more than 6 minutes on my machine to initialize
512Mb ECC DIMM module. Change portion to 128Kb - it significantly
reduces initialization time.
Signed-off-by: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Handling of the trailing byte in ata_sff_data_xfer() is suboptimal bacause:
- it always initializes the padding buffer to 0 which is not really needed in
both the read and write cases;
- it has to use memcpy() to transfer a single byte from/to the padding buffer;
- it uses io{read|write}16() accessors which swap bytes on the big endian CPUs
and so have to additionally convert the data from/to the little endian format
instead of using io{read|write}16_rep() accessors which are not supposed to
change the byte ordering.
Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Fix the model number of Intel Core2 processors according to the
documentation: Intel Processor Identification with the CPUID
Instruction: http://www.intel.com/support/processors/sb/cs-009861.htm
Signed-off-by: Yong Wang <yong.y.wang@intel.com>
Also-Reported-by: Arnd Bergmann <arnd@arndb.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <20090610090612.GA26580@ywang-moblin2.bj.intel.com>
[ Added two more model numbers suggested by Arnd Bergmann ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Prevent EDAC compilation units from being built by default and let the
user explicitly select the needed modules.
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Tested-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
While at it, fix a link failure when !K8_NB.
Acked-by: Doug Thompson <dougthompson@xmission.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Tested-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Also, link into Kbuild by adding Kconfig and Makefile entries.
Borislav:
- Kconfig/Makefile splitting
- use zero-sized arrays for the sysfs attrs if not enabled
- rename sysfs attrs to more conform values
- shorten CONFIG_ names
- make multiple structure members assignment vertically aligned
- fix/cleanup comments
- fix function return value patterns
- fix err labels
- fix a memleak bug caught by Ingo
- remove the NUMA dependency and use num_k8_northbrides for initializing
a driver instance per NB.
- do not copy the pvt contents into the mci struct in
amd64_init_2nd_stage() and save it in the mci->pvt_info void ptr
instead.
- cleanup debug calls
- simplify amd64_setup_pci_device()
Reviewed-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Borislav:
- convert to the new {rd|wr}msr_on_cpus interfaces.
- convert pvt->old_mcgctl to a bitmask thus saving some bytes
- fix/cleanup comments
- fix function return value patterns
- add a proper bugfix found by Doug to amd64_check_ecc_enabled where we
missed checking for the ECC enabled bit in NB CFG.
- cleanup debug calls
Reviewed-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Borislav:
- compute dct_sel_base_off in f10_match_to_this_node() correctly since
it cannot be assumed that the Reserved bits are zero and they have to be
masked out instead.
- cleanup, remove StinkyIdentifiers, simplify logic
- fix function return value patterns
- cleanup debug calls
Reviewed-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Borislav:
- fix a wrong negation in f10_determine_base_addr_offset()
- fix a wrong mask in f10_determine_base_addr_offset() which should
select DctSelBaseAddr[31:11] and not [31:16] as it was before
- remove StinkyIdentifiers, trivially simplify code.
- fix/cleanup comments
- fix function return value patterns
Reviewed-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Borislav:
Fail f10_early_channel_count() if error encountered while reading a NB
register since those cached register contents are accessed afterwards.
- fix/cleanup comments
- fix function return value patterns
- cleanup debug calls
Reviewed-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Borislav:
- rename sysfs attrs to more conform names
- cleanup/fix comments according to BKDG text
- fix function return value patterns
- cleanup debug calls
Reviewed-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
This is for dumping different registers and testing the address mapping
logic using the ECC syndromes.
Borislav:
- split sysfs attrs per file
- use more conform names for the sysfs attrs
- fix function return value patterns
- cleanup/fix comments
- cleanup debug calls
Reviewed-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>