1. 05 Nov, 2014 40 commits
    • Nadav Amit's avatar
      KVM: x86: Check non-canonical addresses upon WRMSR · 76715b56
      Nadav Amit authored
      commit 854e8bb1 upstream.
      
      Upon WRMSR, the CPU should inject #GP if a non-canonical value (address) is
      written to certain MSRs. The behavior is "almost" identical for AMD and Intel
      (ignoring MSRs that are not implemented in either architecture since they would
      anyhow #GP). However, IA32_SYSENTER_ESP and IA32_SYSENTER_EIP cause #GP if
      non-canonical address is written on Intel but not on AMD (which ignores the top
      32-bits).
      
      Accordingly, this patch injects a #GP on the MSRs which behave identically on
      Intel and AMD.  To eliminate the differences between the architecutres, the
      value which is written to IA32_SYSENTER_ESP and IA32_SYSENTER_EIP is turned to
      canonical value before writing instead of injecting a #GP.
      
      Some references from Intel and AMD manuals:
      
      According to Intel SDM description of WRMSR instruction #GP is expected on
      WRMSR "If the source register contains a non-canonical address and ECX
      specifies one of the following MSRs: IA32_DS_AREA, IA32_FS_BASE, IA32_GS_BASE,
      IA32_KERNEL_GS_BASE, IA32_LSTAR, IA32_SYSENTER_EIP, IA32_SYSENTER_ESP."
      
      According to AMD manual instruction manual:
      LSTAR/CSTAR (SYSCALL): "The WRMSR instruction loads the target RIP into the
      LSTAR and CSTAR registers.  If an RIP written by WRMSR is not in canonical
      form, a general-protection exception (#GP) occurs."
      IA32_GS_BASE and IA32_FS_BASE (WRFSBASE/WRGSBASE): "The address written to the
      base field must be in canonical form or a #GP fault will occur."
      IA32_KERNEL_GS_BASE (SWAPGS): "The address stored in the KernelGSbase MSR must
      be in canonical form."
      
      This patch fixes CVE-2014-3610.
      Signed-off-by: default avatarNadav Amit <namit@cs.technion.ac.il>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      [bwh: Backported to 3.2:
       - The various set_msr() functions all separate msr_index and data parameters]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      76715b56
    • Hannes Frederic Sowa's avatar
      ipv6: reuse ip6_frag_id from ip6_ufo_append_data · 8db33010
      Hannes Frederic Sowa authored
      commit 916e4cf4 upstream.
      
      Currently we generate a new fragmentation id on UFO segmentation. It
      is pretty hairy to identify the correct net namespace and dst there.
      Especially tunnels use IFF_XMIT_DST_RELEASE and thus have no skb_dst
      available at all.
      
      This causes unreliable or very predictable ipv6 fragmentation id
      generation while segmentation.
      
      Luckily we already have pregenerated the ip6_frag_id in
      ip6_ufo_append_data and can use it here.
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      [bwh: Backported to 3.2: adjust filename, indentation]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      8db33010
    • Theodore Ts'o's avatar
      ext4: fix BUG_ON in mb_free_blocks() · 4c844312
      Theodore Ts'o authored
      commit c99d1e6e upstream.
      
      If we suffer a block allocation failure (for example due to a memory
      allocation failure), it's possible that we will call
      ext4_discard_allocated_blocks() before we've actually allocated any
      blocks.  In that case, fe_len and fe_start in ac->ac_f_ex will still
      be zero, and this will result in mb_free_blocks(inode, e4b, 0, 0)
      triggering the BUG_ON on mb_free_blocks():
      
      	BUG_ON(last >= (sb->s_blocksize << 3));
      
      Fix this by bailing out of ext4_discard_allocated_blocks() if fs_len
      is zero.
      
      Also fix a missing ext4_mb_unload_buddy() call in
      ext4_discard_allocated_blocks().
      
      Google-Bug-Id: 16844242
      
      Fixes: 86f0afd4Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      [bwh: Backported to 3.2: adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      4c844312
    • chenweilong's avatar
      ipv6: reallocate addrconf router for ipv6 address when lo device up · a0a8667a
      chenweilong authored
      It fix the bug 67951 on bugzilla
      https://bugzilla.kernel.org/show_bug.cgi?id=67951
      
      The patch can't be applied directly, as it' used the function introduced
      by "commit 94e187c0" ip6_rt_put(), that patch can't be applied directly
      either.
      
      ====================
      
      From: Gao feng <gaofeng@cn.fujitsu.com>
      
      commit 33d99113 upstream.
      
      This commit don't have a stable tag, but it fix the bug
      no reply after loopback down-up.It's very worthy to be
      applied to stable 3.4 kernels.
      
      The bug is 67951 on bugzilla
      https://bugzilla.kernel.org/show_bug.cgi?id=67951
      
      
      CC: Sabrina Dubroca <sd@queasysnail.net>
      CC: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Reported-by: default avatarWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: default avatarWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: default avatarGao feng <gaofeng@cn.fujitsu.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      [weilong: s/ip6_rt_put/dst_release]
      Signed-off-by: default avatarChen Weilong <chenweilong@huawei.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      a0a8667a
    • Marcelo Ricardo Leitner's avatar
      ipv4: disable bh while doing route gc · 4715883b
      Marcelo Ricardo Leitner authored
      Further tests revealed that after moving the garbage collector to a work
      queue and protecting it with a spinlock may leave the system prone to
      soft lockups if bottom half gets very busy.
      
      It was reproced with a set of firewall rules that REJECTed packets. If
      the NIC bottom half handler ends up running on the same CPU that is
      running the garbage collector on a very large cache, the garbage
      collector will not be able to do its job due to the amount of work
      needed for handling the REJECTs and also won't reschedule.
      
      The fix is to disable bottom half during the garbage collecting, as it
      already was in the first place (most calls to it came from softirqs).
      Signed-off-by: default avatarMarcelo Ricardo Leitner <mleitner@redhat.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      4715883b
    • Marcelo Ricardo Leitner's avatar
      ipv4: avoid parallel route cache gc executions · ad5ca98f
      Marcelo Ricardo Leitner authored
      When rt_intern_hash() has to deal with neighbour cache overflowing,
      it triggers the route cache garbage collector in an attempt to free
      some references on neighbour entries.
      
      Such call cannot be done async but should also not run in parallel with
      an already-running one, so that they don't collapse fighting over the
      hash lock entries.
      
      This patch thus blocks parallel executions with spinlocks:
      - A call from worker and from rt_intern_hash() are not the same, and
      cannot be merged, thus they will wait each other on rt_gc_lock.
      - Calls to gc from rt_intern_hash() may happen in parallel but we must
      wait for it to finish in order to try again. This dedup and
      synchrinozation is then performed by the locking just before calling
      __do_rt_garbage_collect().
      Signed-off-by: default avatarMarcelo Ricardo Leitner <mleitner@redhat.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      ad5ca98f
    • Marcelo Ricardo Leitner's avatar
      ipv4: move route garbage collector to work queue · 6c383b3a
      Marcelo Ricardo Leitner authored
      Currently the route garbage collector gets called by dst_alloc() if it
      have more entries than the threshold. But it's an expensive call, that
      don't really need to be done by then.
      
      Another issue with current way is that it allows running the garbage
      collector with the same start parameters on multiple CPUs at once, which
      is not optimal. A system may even soft lockup if the cache is big enough
      as the garbage collectors will be fighting over the hash lock entries.
      
      This patch thus moves the garbage collector to run asynchronously on a
      work queue, much similar to how rt_expire_check runs.
      
      There is one condition left that allows multiple executions, which is
      handled by the next patch.
      Signed-off-by: default avatarMarcelo Ricardo Leitner <mleitner@redhat.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      6c383b3a
    • Yoichi Yuasa's avatar
      MIPS: Fix forgotten preempt_enable() when CPU has inclusive pcaches · d521f4ba
      Yoichi Yuasa authored
      commit 5596b0b2 upstream.
      
      [    1.904000] BUG: scheduling while atomic: swapper/1/0x00000002
      [    1.908000] Modules linked in:
      [    1.916000] CPU: 0 PID: 1 Comm: swapper Not tainted 3.12.0-rc2-lemote-los.git-5318619-dirty #1
      [    1.920000] Stack : 0000000031aac000 ffffffff810d0000 0000000000000052 ffffffff802730a4
                0000000000000000 0000000000000001 ffffffff810cdf90 ffffffff810d0000
                ffffffff8068b968 ffffffff806f5537 ffffffff810cdf90 980000009f0782e8
                0000000000000001 ffffffff80720000 ffffffff806b0000 980000009f078000
                980000009f290000 ffffffff805f312c 980000009f05b5d8 ffffffff80233518
                980000009f05b5e8 ffffffff80274b7c 980000009f078000 ffffffff8068b968
                0000000000000000 0000000000000000 0000000000000000 0000000000000000
                0000000000000000 980000009f05b520 0000000000000000 ffffffff805f2f6c
                0000000000000000 ffffffff80700000 ffffffff80700000 ffffffff806fc758
                ffffffff80700000 ffffffff8020be98 ffffffff806fceb0 ffffffff805f2f6c
                ...
      [    2.028000] Call Trace:
      [    2.032000] [<ffffffff8020be98>] show_stack+0x80/0x98
      [    2.036000] [<ffffffff805f2f6c>] __schedule_bug+0x44/0x6c
      [    2.040000] [<ffffffff805fac58>] __schedule+0x518/0x5b0
      [    2.044000] [<ffffffff805f8a58>] schedule_timeout+0x128/0x1f0
      [    2.048000] [<ffffffff80240314>] msleep+0x3c/0x60
      [    2.052000] [<ffffffff80495400>] do_probe+0x238/0x3a8
      [    2.056000] [<ffffffff804958b0>] ide_probe_port+0x340/0x7e8
      [    2.060000] [<ffffffff80496028>] ide_host_register+0x2d0/0x7a8
      [    2.064000] [<ffffffff8049c65c>] ide_pci_init_two+0x4e4/0x790
      [    2.068000] [<ffffffff8049f9b8>] amd74xx_probe+0x148/0x2c8
      [    2.072000] [<ffffffff803f571c>] pci_device_probe+0xc4/0x130
      [    2.076000] [<ffffffff80478f60>] driver_probe_device+0x98/0x270
      [    2.080000] [<ffffffff80479298>] __driver_attach+0xe0/0xe8
      [    2.084000] [<ffffffff80476ab0>] bus_for_each_dev+0x78/0xe0
      [    2.088000] [<ffffffff80478468>] bus_add_driver+0x230/0x310
      [    2.092000] [<ffffffff80479b44>] driver_register+0x84/0x158
      [    2.096000] [<ffffffff80200504>] do_one_initcall+0x104/0x160
      Signed-off-by: default avatarYoichi Yuasa <yuasa@linux-mips.org>
      Reported-by: default avatarAaro Koskinen <aaro.koskinen@iki.fi>
      Tested-by: default avatarAaro Koskinen <aaro.koskinen@iki.fi>
      Cc: linux-mips@linux-mips.org
      Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
      Patchwork: https://patchwork.linux-mips.org/patch/5941/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      d521f4ba
    • Josh Triplett's avatar
      init/Kconfig: Hide printk log config if CONFIG_PRINTK=n · ff872daa
      Josh Triplett authored
      commit 361e9dfb upstream.
      
      The buffers sized by CONFIG_LOG_BUF_SHIFT and
      CONFIG_LOG_CPU_MAX_BUF_SHIFT do not exist if CONFIG_PRINTK=n, so don't
      ask about their size at all.
      Signed-off-by: default avatarJosh Triplett <josh@joshtriplett.org>
      Acked-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      [bwh: Backported to 3.2: drop change to CONFIG_LOG_CPU_MAX_BUF_SHIFT]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      ff872daa
    • Peter Zijlstra's avatar
      perf: fix perf bug in fork() · 96cb09b8
      Peter Zijlstra authored
      commit 6c72e350 upstream.
      
      Oleg noticed that a cleanup by Sylvain actually uncovered a bug; by
      calling perf_event_free_task() when failing sched_fork() we will not yet
      have done the memset() on ->perf_event_ctxp[] and will therefore try and
      'free' the inherited contexts, which are still in use by the parent
      process.  This is bad..
      Suggested-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reported-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reported-by: default avatarSylvain 'ythier' Hitier <sylvain.hitier@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      96cb09b8
    • Mel Gorman's avatar
      mm: migrate: Close race between migration completion and mprotect · b47191f7
      Mel Gorman authored
      commit d3cb8bf6 upstream.
      
      A migration entry is marked as write if pte_write was true at the time the
      entry was created. The VMA protections are not double checked when migration
      entries are being removed as mprotect marks write-migration-entries as
      read. It means that potentially we take a spurious fault to mark PTEs write
      again but it's straight-forward. However, there is a race between write
      migrations being marked read and migrations finishing. This potentially
      allows a PTE to be write that should have been read. Close this race by
      double checking the VMA permissions using maybe_mkwrite when migration
      completes.
      
      [torvalds@linux-foundation.org: use maybe_mkwrite]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [bwh: Backported to 3.2: adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      b47191f7
    • Miklos Szeredi's avatar
      shmem: fix nlink for rename overwrite directory · 6bf3b2e3
      Miklos Szeredi authored
      commit b928095b upstream.
      
      If overwriting an empty directory with rename, then need to drop the extra
      nlink.
      
      Test prog:
      
      #include <stdio.h>
      #include <fcntl.h>
      #include <err.h>
      #include <sys/stat.h>
      
      int main(void)
      {
      	const char *test_dir1 = "test-dir1";
      	const char *test_dir2 = "test-dir2";
      	int res;
      	int fd;
      	struct stat statbuf;
      
      	res = mkdir(test_dir1, 0777);
      	if (res == -1)
      		err(1, "mkdir(\"%s\")", test_dir1);
      
      	res = mkdir(test_dir2, 0777);
      	if (res == -1)
      		err(1, "mkdir(\"%s\")", test_dir2);
      
      	fd = open(test_dir2, O_RDONLY);
      	if (fd == -1)
      		err(1, "open(\"%s\")", test_dir2);
      
      	res = rename(test_dir1, test_dir2);
      	if (res == -1)
      		err(1, "rename(\"%s\", \"%s\")", test_dir1, test_dir2);
      
      	res = fstat(fd, &statbuf);
      	if (res == -1)
      		err(1, "fstat(%i)", fd);
      
      	if (statbuf.st_nlink != 0) {
      		fprintf(stderr, "nlink is %lu, should be 0\n", statbuf.st_nlink);
      		return 1;
      	}
      
      	return 0;
      }
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      6bf3b2e3
    • Joseph Qi's avatar
      ocfs2/dlm: do not get resource spinlock if lockres is new · 40d31c49
      Joseph Qi authored
      commit 5760a97c upstream.
      
      There is a deadlock case which reported by Guozhonghua:
        https://oss.oracle.com/pipermail/ocfs2-devel/2014-September/010079.html
      
      This case is caused by &res->spinlock and &dlm->master_lock
      misordering in different threads.
      
      It was introduced by commit 8d400b81 ("ocfs2/dlm: Clean up refmap
      helpers").  Since lockres is new, it doesn't not require the
      &res->spinlock.  So remove it.
      
      Fixes: 8d400b81 ("ocfs2/dlm: Clean up refmap helpers")
      Signed-off-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Reviewed-by: default avatarjoyce.xue <xuejiufei@huawei.com>
      Reported-by: default avatarGuozhonghua <guozhonghua@h3c.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      40d31c49
    • Andreas Rohner's avatar
      nilfs2: fix data loss with mmap() · a428c0e1
      Andreas Rohner authored
      commit 56d7acc7 upstream.
      
      This bug leads to reproducible silent data loss, despite the use of
      msync(), sync() and a clean unmount of the file system.  It is easily
      reproducible with the following script:
      
        ----------------[BEGIN SCRIPT]--------------------
        mkfs.nilfs2 -f /dev/sdb
        mount /dev/sdb /mnt
      
        dd if=/dev/zero bs=1M count=30 of=/mnt/testfile
      
        umount /mnt
        mount /dev/sdb /mnt
        CHECKSUM_BEFORE="$(md5sum /mnt/testfile)"
      
        /root/mmaptest/mmaptest /mnt/testfile 30 10 5
      
        sync
        CHECKSUM_AFTER="$(md5sum /mnt/testfile)"
        umount /mnt
        mount /dev/sdb /mnt
        CHECKSUM_AFTER_REMOUNT="$(md5sum /mnt/testfile)"
        umount /mnt
      
        echo "BEFORE MMAP:\t$CHECKSUM_BEFORE"
        echo "AFTER MMAP:\t$CHECKSUM_AFTER"
        echo "AFTER REMOUNT:\t$CHECKSUM_AFTER_REMOUNT"
        ----------------[END SCRIPT]--------------------
      
      The mmaptest tool looks something like this (very simplified, with
      error checking removed):
      
        ----------------[BEGIN mmaptest]--------------------
        data = mmap(NULL, file_size - file_offset, PROT_READ | PROT_WRITE,
                    MAP_SHARED, fd, file_offset);
      
        for (i = 0; i < write_count; ++i) {
              memcpy(data + i * 4096, buf, sizeof(buf));
              msync(data, file_size - file_offset, MS_SYNC))
        }
        ----------------[END mmaptest]--------------------
      
      The output of the script looks something like this:
      
        BEFORE MMAP:    281ed1d5ae50e8419f9b978aab16de83  /mnt/testfile
        AFTER MMAP:     6604a1c31f10780331a6850371b3a313  /mnt/testfile
        AFTER REMOUNT:  281ed1d5ae50e8419f9b978aab16de83  /mnt/testfile
      
      So it is clear, that the changes done using mmap() do not survive a
      remount.  This can be reproduced a 100% of the time.  The problem was
      introduced in commit 136e8770 ("nilfs2: fix issue of
      nilfs_set_page_dirty() for page at EOF boundary").
      
      If the page was read with mpage_readpage() or mpage_readpages() for
      example, then it has no buffers attached to it.  In that case
      page_has_buffers(page) in nilfs_set_page_dirty() will be false.
      Therefore nilfs_set_file_dirty() is never called and the pages are never
      collected and never written to disk.
      
      This patch fixes the problem by also calling nilfs_set_file_dirty() if the
      page has no buffers attached to it.
      
      [akpm@linux-foundation.org: s/PAGE_SHIFT/PAGE_CACHE_SHIFT/]
      Signed-off-by: default avatarAndreas Rohner <andreas.rohner@gmx.net>
      Tested-by: default avatarAndreas Rohner <andreas.rohner@gmx.net>
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      a428c0e1
    • Markos Chandras's avatar
      MIPS: mcount: Adjust stack pointer for static trace in MIPS32 · bbc3708a
      Markos Chandras authored
      commit 8a574cfa upstream.
      
      Every mcount() call in the MIPS 32-bit kernel is done as follows:
      
      [...]
      move at, ra
      jal _mcount
      addiu sp, sp, -8
      [...]
      
      but upon returning from the mcount() function, the stack pointer
      is not adjusted properly. This is explained in details in 58b69401
      (MIPS: Function tracer: Fix broken function tracing).
      
      Commit ad8c3969 ("MIPS: Unbreak function tracer for 64-bit kernel.)
      fixed the stack manipulation for 64-bit but it didn't fix it completely
      for MIPS32.
      Signed-off-by: default avatarMarkos Chandras <markos.chandras@imgtec.com>
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/7792/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      bbc3708a
    • Robin Murphy's avatar
      ARM: 8165/1: alignment: don't break misaligned NEON load/store · edeb8f82
      Robin Murphy authored
      commit 5ca918e5 upstream.
      
      The alignment fixup incorrectly decodes faulting ARM VLDn/VSTn
      instructions (where the optional alignment hint is given but incorrect)
      as LDR/STR, leading to register corruption. Detect these and correctly
      treat them as unhandled, so that userspace gets the fault it expects.
      Reported-by: default avatarSimon Hosie <simon.hosie@arm.com>
      Signed-off-by: default avatarRobin Murphy <robin.murphy@arm.com>
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      edeb8f82
    • Wanpeng Li's avatar
      sched: Fix unreleased llc_shared_mask bit during CPU hotplug · fdb7a047
      Wanpeng Li authored
      commit 03bd4e1f upstream.
      
      The following bug can be triggered by hot adding and removing a large number of
      xen domain0's vcpus repeatedly:
      
      	BUG: unable to handle kernel NULL pointer dereference at 0000000000000004 IP: [..] find_busiest_group
      	PGD 5a9d5067 PUD 13067 PMD 0
      	Oops: 0000 [#3] SMP
      	[...]
      	Call Trace:
      	load_balance
      	? _raw_spin_unlock_irqrestore
      	idle_balance
      	__schedule
      	schedule
      	schedule_timeout
      	? lock_timer_base
      	schedule_timeout_uninterruptible
      	msleep
      	lock_device_hotplug_sysfs
      	online_store
      	dev_attr_store
      	sysfs_write_file
      	vfs_write
      	SyS_write
      	system_call_fastpath
      
      Last level cache shared mask is built during CPU up and the
      build_sched_domain() routine takes advantage of it to setup
      the sched domain CPU topology.
      
      However, llc_shared_mask is not released during CPU disable,
      which leads to an invalid sched domainCPU topology.
      
      This patch fix it by releasing the llc_shared_mask correctly
      during CPU disable.
      
      Yasuaki also reported that this can happen on real hardware:
      
        https://lkml.org/lkml/2014/7/22/1018
      
      His case is here:
      
      	==
      	Here is an example on my system.
      	My system has 4 sockets and each socket has 15 cores and HT is
      	enabled. In this case, each core of sockes is numbered as
      	follows:
      
      		 | CPU#
      	Socket#0 | 0-14 , 60-74
      	Socket#1 | 15-29, 75-89
      	Socket#2 | 30-44, 90-104
      	Socket#3 | 45-59, 105-119
      
      	Then llc_shared_mask of CPU#30 has 0x3fff80000001fffc0000000.
      
      	It means that last level cache of Socket#2 is shared with
      	CPU#30-44 and 90-104.
      
      	When hot-removing socket#2 and #3, each core of sockets is
      	numbered as follows:
      
      		 | CPU#
      	Socket#0 | 0-14 , 60-74
      	Socket#1 | 15-29, 75-89
      
      	But llc_shared_mask is not cleared. So llc_shared_mask of CPU#30
      	remains having 0x3fff80000001fffc0000000.
      
      	After that, when hot-adding socket#2 and #3, each core of
      	sockets is numbered as follows:
      
      		 | CPU#
      	Socket#0 | 0-14 , 60-74
      	Socket#1 | 15-29, 75-89
      	Socket#2 | 30-59
      	Socket#3 | 90-119
      
      	Then llc_shared_mask of CPU#30 becomes
      	0x3fff8000fffffffc0000000. It means that last level cache of
      	Socket#2 is shared with CPU#30-59 and 90-104. So the mask has
      	the wrong value.
      Signed-off-by: default avatarWanpeng Li <wanpeng.li@linux.intel.com>
      Tested-by: default avatarLinn Crosetto <linn@hp.com>
      Reviewed-by: default avatarBorislav Petkov <bp@suse.de>
      Reviewed-by: default avatarToshi Kani <toshi.kani@hp.com>
      Reviewed-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Steven Rostedt <srostedt@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1411547885-48165-1-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      fdb7a047
    • John David Anglin's avatar
      parisc: Only use -mfast-indirect-calls option for 32-bit kernel builds · cfda9893
      John David Anglin authored
      commit d26a7730 upstream.
      
      In spite of what the GCC manual says, the -mfast-indirect-calls has
      never been supported in the 64-bit parisc compiler. Indirect calls have
      always been done using function descriptors irrespective of the
      -mfast-indirect-calls option.
      
      Recently, it was noticed that a function descriptor was always requested
      when the -mfast-indirect-calls option was specified. This caused
      problems when the option was used in  application code and doesn't make
      any sense because the whole point of the option is to avoid using a
      function descriptor for indirect calls.
      
      Fixing this broke 64-bit kernel builds.
      
      I will fix GCC but for now we need the attached change. This results in
      the same kernel code as before.
      Signed-off-by: default avatarJohn David Anglin <dave.anglin@bell.net>
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      cfda9893
    • Anton Altaparmakov's avatar
      Fix nasty 32-bit overflow bug in buffer i/o code. · 918b2160
      Anton Altaparmakov authored
      commit f2d5a944 upstream.
      
      On 32-bit architectures, the legacy buffer_head functions are not always
      handling the sector number with the proper 64-bit types, and will thus
      fail on 4TB+ disks.
      
      Any code that uses __getblk() (and thus bread(), breadahead(),
      sb_bread(), sb_breadahead(), sb_getblk()), and calls it using a 64-bit
      block on a 32-bit arch (where "long" is 32-bit) causes an inifinite loop
      in __getblk_slow() with an infinite stream of errors logged to dmesg
      like this:
      
        __find_get_block_slow() failed. block=6740375944, b_blocknr=2445408648
        b_state=0x00000020, b_size=512
        device sda1 blocksize: 512
      
      Note how in hex block is 0x191C1F988 and b_blocknr is 0x91C1F988 i.e. the
      top 32-bits are missing (in this case the 0x1 at the top).
      
      This is because grow_dev_page() is broken and has a 32-bit overflow due
      to shifting the page index value (a pgoff_t - which is just 32 bits on
      32-bit architectures) left-shifted as the block number.  But the top
      bits to get lost as the pgoff_t is not type cast to sector_t / 64-bit
      before the shift.
      
      This patch fixes this issue by type casting "index" to sector_t before
      doing the left shift.
      
      Note this is not a theoretical bug but has been seen in the field on a
      4TiB hard drive with logical sector size 512 bytes.
      
      This patch has been verified to fix the infinite loop problem on 3.17-rc5
      kernel using a 4TB disk image mounted using "-o loop".  Without this patch
      doing a "find /nt" where /nt is an NTFS volume causes the inifinite loop
      100% reproducibly whilst with the patch it works fine as expected.
      Signed-off-by: default avatarAnton Altaparmakov <aia21@cantab.net>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      918b2160
    • Clemens Ladisch's avatar
      ALSA: pcm: fix fifo_size frame calculation · 53412c3f
      Clemens Ladisch authored
      commit a9960e6a upstream.
      
      The calculated frame size was wrong because snd_pcm_format_physical_width()
      actually returns the number of bits, not bytes.
      
      Use snd_pcm_format_size() instead, which not only returns bytes, but also
      simplifies the calculation.
      
      Fixes: 8bea869c ("ALSA: PCM midlevel: improve fifo_size handling")
      Signed-off-by: default avatarClemens Ladisch <clemens@ladisch.de>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      53412c3f
    • David Dueck's avatar
      can: at91_can: add missing prepare and unprepare of the clock · 51562cd4
      David Dueck authored
      commit e77980e5 upstream.
      
      In order to make the driver work with the common clock framework, this patch
      converts the clk_enable()/clk_disable() to
      clk_prepare_enable()/clk_disable_unprepare(). While there, add the missing
      error handling.
      Signed-off-by: default avatarDavid Dueck <davidcdueck@googlemail.com>
      Signed-off-by: default avatarAnthony Harivel <anthony.harivel@emtrion.de>
      Acked-by: default avatarBoris Brezillon <boris.brezillon@free-electrons.com>
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      51562cd4
    • Marc Kleine-Budde's avatar
      can: flexcan: put TX mailbox into TX_INACTIVE mode after tx-complete · 3158bdb2
      Marc Kleine-Budde authored
      commit de594488 upstream.
      
      After sending a RTR frame the TX mailbox becomes a RX_EMPTY mailbox. To avoid
      side effects when the RX-FIFO is full, this patch puts the TX mailbox into
      TX_INACTIVE mode in the transmission complete interrupt handler. This, of
      course, leaves a race window between the actual completion of the transmission
      and the handling of tx-complete interrupt. However this is the best we can do
      without busy polling the tx complete interrupt.
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      [bwh: Backported to 3.2: adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      3158bdb2
    • David Jander's avatar
      can: flexcan: implement workaround for errata ERR005829 · 7056cbe8
      David Jander authored
      commit 25e92445 upstream.
      
      This patch implements the workaround mentioned in ERR005829:
      
          ERR005829: FlexCAN: FlexCAN does not transmit a message that is enabled to
          be transmitted in a specific moment during the arbitration process.
      
      Workaround: The workaround consists of two extra steps after setting up a
      message for transmission:
      
      Step 8: Reserve the first valid mailbox as an inactive mailbox (CODE=0b1000).
      If RX FIFO is disabled, this mailbox must be message buffer 0. Otherwise, the
      first valid mailbox can be found using the "RX FIFO filters" table in the
      FlexCAN chapter of the chip reference manual.
      
      Step 9: Write twice INACTIVE code (0b1000) into the first valid mailbox.
      Signed-off-by: default avatarDavid Jander <david@protonic.nl>
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      7056cbe8
    • David Jander's avatar
      can: flexcan: correctly initialize mailboxes · d306d951
      David Jander authored
      commit fc05b884 upstream.
      
      Apparently mailboxes may contain random data at startup, causing some of them
      being prepared for message reception. This causes overruns being missed or even
      confusing the IRQ check for trasmitted messages, increasing the transmit
      counter instead of the error counter.
      
      This patch initializes all mailboxes after the FIFO as RX_INACTIVE.
      Signed-off-by: default avatarDavid Jander <david@protonic.nl>
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      d306d951
    • Marc Kleine-Budde's avatar
      can: flexcan: mark TX mailbox as TX_INACTIVE · 1b184fd1
      Marc Kleine-Budde authored
      commit c32fe4ad upstream.
      
      This patch fixes the initialization of the TX mailbox. It is now correctly
      initialized as TX_INACTIVE not RX_EMPTY.
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      1b184fd1
    • Johannes Berg's avatar
      nl80211: clear skb cb before passing to netlink · a93db944
      Johannes Berg authored
      commit bd8c78e7 upstream.
      
      In testmode and vendor command reply/event SKBs we use the
      skb cb data to store nl80211 parameters between allocation
      and sending. This causes the code for CONFIG_NETLINK_MMAP
      to get confused, because it takes ownership of the skb cb
      data when the SKB is handed off to netlink, and it doesn't
      explicitly clear it.
      
      Clear the skb cb explicitly when we're done and before it
      gets passed to netlink to avoid this issue.
      Reported-by: default avatarAssaf Azulay <assaf.azulay@intel.com>
      Reported-by: default avatarDavid Spinadel <david.spinadel@intel.com>
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      [bwh: Backported to 3.2: adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      a93db944
    • Mark's avatar
      USB: storage: Add quirks for Entrega/Xircom USB to SCSI converters · b2d0a271
      Mark authored
      commit c80b4495 upstream.
      
      This patch adds quirks for Entrega Technologies (later Xircom PortGear) USB-
      SCSI converters. They use Shuttle Technology EUSB-01/EUSB-S1 chips. The
      US_FL_SCM_MULT_TARG quirk is needed to allow multiple devices on the SCSI
      chain to be accessed. Without it only the (single) device with SCSI ID 0
      can be used.
      
      The standalone converter sold by Entrega had model number U1-SC25. Xircom
      acquired Entrega and re-branded the product line PortGear. The PortGear USB
      to SCSI Converter (model PGSCSI) is internally identical to the Entrega
      product, but later models may use a different USB ID. The Entrega-branded
      units have USB ID 1645:0007, as does my Xircom PGSCSI, but the Windows and
      Macintosh drivers also support 085A:0028.
      
      Entrega also sold the "Mac USB Dock", which provides two USB ports, a Mac
      (8-pin mini-DIN) serial port and a SCSI port. It appears to the computer as
      a four-port hub, USB-serial, and USB-SCSI converters. The USB-SCSI part may
      have initially used the same ID as the standalone U1-SC25 (1645:0007), but
      later production used 085A:0026.
      
      My Xircom PortGear PGSCSI has bcdDevice=0x0100. Units with bcdDevice=0x0133
      probably also exist.
      
      This patch adds quirks for 1645:0007, 085A:0026 and 085A:0028. The Windows
      driver INF file also mentions 085A:0032 "PortStation SCSI Module", but I
      couldn't find any mention of that actually existing in the wild; perhaps it
      was cancelled before release?
      Signed-off-by: default avatarMark Knibbs <markk@clara.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      b2d0a271
    • Mark's avatar
      USB: storage: Add quirk for Ariston Technologies iConnect USB to SCSI adapter · eeaff23a
      Mark authored
      commit b6a3ed67 upstream.
      
      Hi,
      
      The Ariston Technologies iConnect 025 and iConnect 050 (also known as e.g.
      iSCSI-50) are SCSI-USB converters which use Shuttle Technology/SCM
      Microsystems chips. Only the connectors differ; both have the same USB ID.
      The US_FL_SCM_MULT_TARG quirk is required to use SCSI devices with ID other
      than 0.
      
      I don't have one of these, but based on the other entries for Shuttle/
      SCM-based converters this patch is very likely correct. I used 0x0000 and
      0x9999 for bcdDeviceMin and bcdDeviceMax because I'm not sure which
      bcdDevice value the products use.
      Signed-off-by: default avatarMark Knibbs <markk@clara.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      eeaff23a
    • Mark's avatar
      USB: storage: Add quirk for Adaptec USBConnect 2000 USB-to-SCSI Adapter · 7d81e603
      Mark authored
      commit 67d365a5 upstream.
      
      The Adaptec USBConnect 2000 is another SCSI-USB converter which uses
      Shuttle Technology/SCM Microsystems chips. The US_FL_SCM_MULT_TARG quirk is
      required to use SCSI devices with ID other than 0.
      
      I don't have a USBConnect 2000, but based on the other entries for Shuttle/
      SCM-based converters this patch is very likely correct. I used 0x0000 and
      0x9999 for bcdDeviceMin and bcdDeviceMax because I'm not sure which
      bcdDevice value the product uses.
      Signed-off-by: default avatarMark Knibbs <markk@clara.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      7d81e603
    • Mike Christie's avatar
      libiscsi: fix potential buffer overrun in __iscsi_conn_send_pdu · 74cb1722
      Mike Christie authored
      commit db9bfd64 upstream.
      
      This patches fixes a potential buffer overrun in __iscsi_conn_send_pdu.
      This function is used by iscsi drivers and userspace to send iscsi PDUs/
      commands. For login commands, we have a set buffer size. For all other
      commands we do not support data buffers.
      
      This was reported by Dan Carpenter here:
      http://www.spinics.net/lists/linux-scsi/msg66838.htmlReported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarMike Christie <michaelc@cs.wisc.edu>
      Reviewed-by: default avatarSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJames Bottomley <JBottomley@Parallels.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      74cb1722
    • Trond Myklebust's avatar
      NFSv4: Fix another bug in the close/open_downgrade code · 5bd3c047
      Trond Myklebust authored
      commit cd9288ff upstream.
      
      James Drew reports another bug whereby the NFS client is now sending
      an OPEN_DOWNGRADE in a situation where it should really have sent a
      CLOSE: the client is opening the file for O_RDWR, but then trying to
      do a downgrade to O_RDONLY, which is not allowed by the NFSv4 spec.
      Reported-by: default avatarJames Drews <drews@engr.wisc.edu>
      Link: http://lkml.kernel.org/r/541AD7E5.8020409@engr.wisc.edu
      Fixes: aee7af35 (NFSv4: Fix problems with close in the presence...)
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      [bwh: Backported to 3.2: adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      5bd3c047
    • Joern Engel's avatar
      iscsi-target: avoid NULL pointer in iscsi_copy_param_list failure · 63cb95ba
      Joern Engel authored
      commit 8ae757d0 upstream.
      
      In iscsi_copy_param_list() a failed iscsi_param_list memory allocation
      currently invokes iscsi_release_param_list() to cleanup, and will promptly
      trigger a NULL pointer dereference.
      
      Instead, go ahead and return for the first iscsi_copy_param_list()
      failure case.
      
      Found by coverity.
      Signed-off-by: default avatarJoern Engel <joern@logfs.org>
      Signed-off-by: default avatarNicholas Bellinger <nab@linux-iscsi.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      63cb95ba
    • Nicholas Bellinger's avatar
      iscsi-target: Fix memory corruption in iscsit_logout_post_handler_diffcid · 9c0e738c
      Nicholas Bellinger authored
      commit b53b0d99 upstream.
      
      This patch fixes a bug in iscsit_logout_post_handler_diffcid() where
      a pointer used as storage for list_for_each_entry() was incorrectly
      being used to determine if no matching entry had been found.
      
      This patch changes iscsit_logout_post_handler_diffcid() to key off
      bool conn_found to determine if the function needs to exit early.
      Reported-by: default avatarJoern Engel <joern@logfs.org>
      Signed-off-by: default avatarNicholas Bellinger <nab@linux-iscsi.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      9c0e738c
    • Al Viro's avatar
      be careful with nd->inode in path_init() and follow_dotdot_rcu() · 1a4ba51a
      Al Viro authored
      commit 4023bfc9 upstream.
      
      in the former we simply check if dentry is still valid after picking
      its ->d_inode; in the latter we fetch ->d_inode in the same places
      where we fetch dentry and its ->d_seq, under the same checks.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      [bwh: Backported to 3.2: adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      1a4ba51a
    • Ben Hutchings's avatar
      vfs: Fold follow_mount_rcu() into follow_dotdot_rcu() · a7caf254
      Ben Hutchings authored
      This is needed before commit 4023bfc9 ('be careful with nd->inode
      in path_init() and follow_dotdot_rcu()').  A similar change was made
      upstream as part of commit b37199e6 ('rcuwalk: recheck mount_lock
      after mountpoint crossing attempts').
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      a7caf254
    • Al Viro's avatar
      don't bugger nd->seq on set_root_rcu() from follow_dotdot_rcu() · 035cbfd3
      Al Viro authored
      commit 7bd88377 upstream.
      
      return the value instead, and have path_init() do the assignment.  Broken by
      "vfs: Fix absolute RCU path walk failures due to uninitialized seq number",
      which was Cc-stable with 2.6.38+ as destination.  This one should go where
      it went.
      
      To avoid dummy value returned in case when root is already set (it would do
      no harm, actually, since the only caller that doesn't ignore the return value
      is guaranteed to have nd->root *not* set, but it's more obvious that way),
      lift the check into callers.  And do the same to set_root(), to keep them
      in sync.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      [bwh: Backported to 3.2: adjust context, indentation]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      035cbfd3
    • Richard Larocque's avatar
      alarmtimer: Lock k_itimer during timer callback · 8601a7ad
      Richard Larocque authored
      commit 474e941b upstream.
      
      Locks the k_itimer's it_lock member when handling the alarm timer's
      expiry callback.
      
      The regular posix timers defined in posix-timers.c have this lock held
      during timout processing because their callbacks are routed through
      posix_timer_fn().  The alarm timers follow a different path, so they
      ought to grab the lock somewhere else.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Sharvil Nanavati <sharvil@google.com>
      Signed-off-by: default avatarRichard Larocque <rlarocque@google.com>
      Signed-off-by: default avatarJohn Stultz <john.stultz@linaro.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      8601a7ad
    • Richard Larocque's avatar
      alarmtimer: Do not signal SIGEV_NONE timers · 62bd84fa
      Richard Larocque authored
      commit 265b81d2 upstream.
      
      Avoids sending a signal to alarm timers created with sigev_notify set to
      SIGEV_NONE by checking for that special case in the timeout callback.
      
      The regular posix timers avoid sending signals to SIGEV_NONE timers by
      not scheduling any callbacks for them in the first place.  Although it
      would be possible to do something similar for alarm timers, it's simpler
      to handle this as a special case in the timeout.
      
      Prior to this patch, the alarm timer would ignore the sigev_notify value
      and try to deliver signals to the process anyway.  Even worse, the
      sanity check for the value of sigev_signo is skipped when SIGEV_NONE was
      specified, so the signal number could be bogus.  If sigev_signo was an
      unitialized value (as it often would be if SIGEV_NONE is used), then
      it's hard to predict which signal will be sent.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Sharvil Nanavati <sharvil@google.com>
      Signed-off-by: default avatarRichard Larocque <rlarocque@google.com>
      Signed-off-by: default avatarJohn Stultz <john.stultz@linaro.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      62bd84fa
    • Richard Larocque's avatar
      alarmtimer: Return relative times in timer_gettime · a1b01afa
      Richard Larocque authored
      commit e86fea76 upstream.
      
      Returns the time remaining for an alarm timer, rather than the time at
      which it is scheduled to expire.  If the timer has already expired or it
      is not currently scheduled, the it_value's members are set to zero.
      
      This new behavior matches that of the other posix-timers and the POSIX
      specifications.
      
      This is a change in user-visible behavior, and may break existing
      applications.  Hopefully, few users rely on the old incorrect behavior.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Sharvil Nanavati <sharvil@google.com>
      Signed-off-by: default avatarRichard Larocque <rlarocque@google.com>
      [jstultz: minor style tweak]
      Signed-off-by: default avatarJohn Stultz <john.stultz@linaro.org>
      [bwh: Backported to 3.2: Add definition of alarm_expires_remaining() from
       commit 6cffe00f ('alarmtimer: Add functions for timerfd support')]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      a1b01afa
    • Andrew Hunter's avatar
      jiffies: Fix timeval conversion to jiffies · d8aaaebb
      Andrew Hunter authored
      commit d78c9300 upstream.
      
      timeval_to_jiffies tried to round a timeval up to an integral number
      of jiffies, but the logic for doing so was incorrect: intervals
      corresponding to exactly N jiffies would become N+1. This manifested
      itself particularly repeatedly stopping/starting an itimer:
      
      setitimer(ITIMER_PROF, &val, NULL);
      setitimer(ITIMER_PROF, NULL, &val);
      
      would add a full tick to val, _even if it was exactly representable in
      terms of jiffies_ (say, the result of a previous rounding.)  Doing
      this repeatedly would cause unbounded growth in val.  So fix the math.
      
      Here's what was wrong with the conversion: we essentially computed
      (eliding seconds)
      
      jiffies = usec  * (NSEC_PER_USEC/TICK_NSEC)
      
      by using scaling arithmetic, which took the best approximation of
      NSEC_PER_USEC/TICK_NSEC with denominator of 2^USEC_JIFFIE_SC =
      x/(2^USEC_JIFFIE_SC), and computed:
      
      jiffies = (usec * x) >> USEC_JIFFIE_SC
      
      and rounded this calculation up in the intermediate form (since we
      can't necessarily exactly represent TICK_NSEC in usec.) But the
      scaling arithmetic is a (very slight) *over*approximation of the true
      value; that is, instead of dividing by (1 usec/ 1 jiffie), we
      effectively divided by (1 usec/1 jiffie)-epsilon (rounding
      down). This would normally be fine, but we want to round timeouts up,
      and we did so by adding 2^USEC_JIFFIE_SC - 1 before the shift; this
      would be fine if our division was exact, but dividing this by the
      slightly smaller factor was equivalent to adding just _over_ 1 to the
      final result (instead of just _under_ 1, as desired.)
      
      In particular, with HZ=1000, we consistently computed that 10000 usec
      was 11 jiffies; the same was true for any exact multiple of
      TICK_NSEC.
      
      We could possibly still round in the intermediate form, adding
      something less than 2^USEC_JIFFIE_SC - 1, but easier still is to
      convert usec->nsec, round in nanoseconds, and then convert using
      time*spec*_to_jiffies.  This adds one constant multiplication, and is
      not observably slower in microbenchmarks on recent x86 hardware.
      
      Tested: the following program:
      
      int main() {
        struct itimerval zero = {{0, 0}, {0, 0}};
        /* Initially set to 10 ms. */
        struct itimerval initial = zero;
        initial.it_interval.tv_usec = 10000;
        setitimer(ITIMER_PROF, &initial, NULL);
        /* Save and restore several times. */
        for (size_t i = 0; i < 10; ++i) {
          struct itimerval prev;
          setitimer(ITIMER_PROF, &zero, &prev);
          /* on old kernels, this goes up by TICK_USEC every iteration */
          printf("previous value: %ld %ld %ld %ld\n",
                 prev.it_interval.tv_sec, prev.it_interval.tv_usec,
                 prev.it_value.tv_sec, prev.it_value.tv_usec);
          setitimer(ITIMER_PROF, &prev, NULL);
        }
          return 0;
      }
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Reviewed-by: default avatarPaul Turner <pjt@google.com>
      Reported-by: default avatarAaron Jacobs <jacobsa@google.com>
      Signed-off-by: default avatarAndrew Hunter <ahh@google.com>
      [jstultz: Tweaked to apply to 3.17-rc]
      Signed-off-by: default avatarJohn Stultz <john.stultz@linaro.org>
      [bwh: Backported to 3.2: adjust filename]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      d8aaaebb