Commits · 35782b233f37e48ecc469d9c7232f3f6a7fad41a · Kirill Smelkov / linux

23 Nov, 2016 24 commits

f2fs: remove percpu_count due to performance regression · 35782b23

Jaegeuk Kim authored Oct 20, 2016

This patch removes percpu_count usage due to performance regression in iozone.

Fixes: 523be8a6 ("f2fs: use percpu_counter for page counters")
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

35782b23

f2fs: make clean inodes when flushing inode page · 18340edc

Jaegeuk Kim authored Oct 19, 2016

This patch tries to make more clean inodes when flushing dirty inodes in
checkpoint.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

18340edc

f2fs: keep dirty inodes selectively for checkpoint · 7c45729a

Jaegeuk Kim authored Oct 14, 2016

This is to avoid no free segment bug during checkpoint caused by a number of
dirty inodes.

The case was reported by Chao like this.
1. mount with lazytime option
2. fill 4k file until disk is full
3. sync filesystem
4. read all files in the image
5. umount

In this case, we actually don't need to flush dirty inode to inode page during
checkpoint.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

7c45729a

f2fs: use BIO_MAX_PAGES for bio allocation · 664ba972

Jaegeuk Kim authored Oct 18, 2016

We don't need to allocate bio partially in order to maximize sequential writes.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

664ba972

f2fs: declare static function for __build_free_nids · 3e7b5bbb
Jaegeuk Kim authored Oct 17, 2016
```
This patch avoids build warning.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
```
3e7b5bbb

f2fs: call f2fs_balance_fs for setattr · 15d04354

Jaegeuk Kim authored Oct 14, 2016

If inode becomes dirty, we need to check the # of dirty inodes whether or not
further checkpoint would be required.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

15d04354

f2fs: count dirty inodes to flush node pages during checkpoint · b9610bdf

Jaegeuk Kim authored Oct 14, 2016

If there are a lot of dirty inodes, we need to flush all of them when doing
checkpoint. So, we need to count this for enough free space.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

b9610bdf

f2fs: avoid casted negative value as shrink count · 02110a4f

Chao Yu authored Oct 11, 2016

This patch makes sure it returns a positive value instead of a probable
casted negative value as shrink count.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

02110a4f

f2fs: don't interrupt free nids building during nid allocation · 3a2ad567

Chao Yu authored Oct 11, 2016

Let build_free_nids support sync/async methods, in allocation flow of nids,
we use synchronuous method, so that we can avoid looping in alloc_nid when
free memory is low; in unblock_operations and f2fs_balance_fs_bg we use
asynchronuous method in where low memory condition can interrupt us.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

3a2ad567

f2fs: clean up free nid list operations · eb0aa4b8

Jaegeuk Kim authored Oct 12, 2016

This patch cleans up to use consistent free nid list ops.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

eb0aa4b8

f2fs: split free nid list · b8559dc2

Chao Yu authored Oct 12, 2016

During free nid allocation, in order to do preallocation, we will tag free
nid entry as allocated one and still leave it in free nid list, for other
allocators who want to grab free nids, it needs to traverse the free nid
list for lookup. It becomes overhead in scenario of allocating free nid
intensively by multithreads.

This patch splits free nid list to two list: {free,alloc}_nid_list, to
keep free nids and preallocated free nids separately, after that, traverse
latency will be gone, besides split nid_cnt for separate statistic.

Additionally, introduce __insert_nid_to_list and __remove_nid_from_list for
cleanup.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: modify f2fs_bug_on to avoid needless branches]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

b8559dc2

f2fs: clear nlink if fail to add_link · a11b9f65

Chao Yu authored Oct 11, 2016

We don't need to keep incomplete created inode in cache, so if we fail to
add link into directory during new inode creation, it's better to set
nlink of inode to zero, then we can evict inode immediately. Otherwise
release of nid belong to inode will be delayed until inode cache is being
shrunk, it may cause a seemingly endless loop while allocating free nids
in time of testing generic/269 case of fstest suit.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: add update_inode_page to fix kernel panic]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

a11b9f65

f2fs: fix sparse warnings · 0c0b471e

Eric Biggers authored Oct 11, 2016

f2fs contained a number of endianness conversion bugs.

Also, one function should have been 'static'.

Found with sparse by running 'make C=2 CF=-D__CHECK_ENDIAN__ fs/f2fs/'
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

0c0b471e

f2fs: fix error handling in fsync_node_pages · 9de69279

Chao Yu authored Oct 11, 2016

In fsync_node_pages, if f2fs was taged with CP_ERROR_FLAG, make sure bio
cache was flushed before return.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

9de69279

f2fs: fix to update largest extent under lock · b691d98f

Chao Yu authored Oct 11, 2016

In order to avoid racing problem, make largest extent cache being updated
under lock.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

b691d98f

f2fs: be aware of extent beyond EOF in fiemap · 58736fa6

Chao Yu authored Oct 11, 2016

f2fs can support fallocating blocks beyond file size without changing the
size, but ->fiemap of f2fs was restricted and can't detect these extents
fallocated past EOF, now relieve the restriction.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

58736fa6

f2fs: don't miss any f2fs_balance_fs cases · 6f2d8ed6

Chao Yu authored Oct 11, 2016

In f2fs_map_blocks, let f2fs_balance_fs detects node page modification
with dn.node_changed to avoid miss some corner cases.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

6f2d8ed6

f2fs: add missing f2fs_balance_fs in f2fs_zero_range · 9434fcde

Chao Yu authored Oct 11, 2016

f2fs_balance_fs should be called in between node page updating, otherwise
node page count will exceeded far beyond watermark of triggering
foreground garbage collection, result in facing high risk of hitting LFS
allocation failure.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

9434fcde

f2fs: give a chance to detach from dirty list · 933439c8

Chao Yu authored Oct 11, 2016

If there is no dirty pages in inode, we should give a chance to detach
the inode from global dirty list, otherwise it needs to call another
unnecessary .writepages for detaching.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

933439c8

f2fs: fix to release discard entries during checkpoint · 2dd15654

Chao Yu authored Oct 11, 2016

In f2fs_fill_super, if there is any IO error occurs during recovery,
cached discard entries will be leaked, in order to avoid this, make
write_checkpoint() handle memory release by itself, besides, move
clear_prefree_segments to write_checkpoint for readability.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

2dd15654

f2fs: exclude free nids building and allocation · 2411cf5b

Chao Yu authored Oct 11, 2016

During nid allocation, it needs to exclude building and allocating flow
of free nids, this is because while building free nid cache, there are two
steps: a) load free nids from unused nat entries in NAT pages, b) update
free nid cache by checking nat journal. The two steps should be atomical,
otherwise an used nid can be allocated as free one after a) and before b).

This patch adds missing lock which covers build_free_nids in
unlock_operation and f2fs_balance_fs_bg to avoid that.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

2411cf5b

f2fs: fix overflow due to condition check order · e87f7329

Jaegeuk Kim authored Nov 23, 2016

In the last ilen case, i was already increased, resulting in accessing out-
of-boundary entry of do_replace and blkaddr.
Fix to check ilen first to exit the loop.

Fixes: 2aa8fbb9693020 ("f2fs: refactor __exchange_data_block for speed up")
Cc: stable@vger.kernel.org # 4.8+
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

e87f7329

Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · ded9b5dd

Linus Torvalds authored Nov 23, 2016

Pull perf fixes from Ingo Molnar:
 "Six fixes for bugs that were found via fuzzing, and a trivial
  hw-enablement patch for AMD Family-17h CPU PMUs"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel/uncore: Allow only a single PMU/box within an events group
  perf/x86/intel: Cure bogus unwind from PEBS entries
  perf/x86: Restore TASK_SIZE check on frame pointer
  perf/core: Fix address filter parser
  perf/x86: Add perf support for AMD family-17h processors
  perf/x86/uncore: Fix crash by removing bogus event_list[] handling for SNB client uncore IMC
  perf/core: Do not set cpuctx->cgrp for unscheduled cgroups

ded9b5dd

Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 · 23aabe73

Linus Torvalds authored Nov 23, 2016

Pull crypto fixes from Herbert Xu:
 "The last push broke algif_hash for all shash implementations, so this
  is a follow-up to fix that.

  This also fixes a problem in the crypto scatterwalk that triggers a
  BUG_ON with certain debugging options due to the new vmalloced-stack
  code"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: scatterwalk - Remove unnecessary aliasing check in map_and_copy
  crypto: algif_hash - Fix result clobbering in recvmsg

23aabe73

22 Nov, 2016 12 commits

Merge branch 'for-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux · 23400ac9

Linus Torvalds authored Nov 22, 2016

Pull thermal management fix from Zhang Rui:
 "We only have one urgent fix this time.

  Commit 3105f234 ("thermal/powerclamp: correct cpu support check"),
  which is shipped in 4.9-rc3, fixed a problem introduced by commit
  b721ca0d ("thermal/powerclamp: remove cpu whitelist").

  But unfortunately, it broke intel_powerclamp driver module auto-
  loading at the same time. Thus we need this change to add back module
  auto-loading for 4.9"

* 'for-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux:
  thermal/powerclamp: add back module device table

23400ac9

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · b66c08ba

Linus Torvalds authored Nov 22, 2016

Pull SCSI fixes from James Bottomley:
 "Two small fixes.

  One prevents timeouts on mpt3sas when trying to use the secure erase
  protocol which causes the erase protocol to be aborted. The second is
  a regression in a prior fix which causes all commands to abort during
  PCI extended error recovery, which is incorrect because PCI EEH is
  independent from what's happening on the FC transport"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: qla2xxx: do not abort all commands in the adapter during EEH recovery
  scsi: mpt3sas: Fix secure erase premature termination

b66c08ba

Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux · 57527ed1

Linus Torvalds authored Nov 22, 2016

Pull clk fixes from Stephen Boyd:
 "A handful of driver fixes.

  The sunxi fixes are for an incorrect clk tree configuration and a bad
  frequency calculation. The other two are fixes for passing the wrong
  pointer in drivers recently converted to clk_hw style registration"

* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
  clk: efm32gg: Pass correct type to hw provider registration
  clk: berlin: Pass correct type to hw provider registration
  clk: sunxi: Fix M factor computation for APB1
  clk: sunxi-ng: sun6i-a31: Force AHB1 clock to use PLL6 as parent

57527ed1

Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 000b8949

Linus Torvalds authored Nov 22, 2016

Pull scheduler fixes from Ingo Molnar:
 "Two fixes for autogroup scheduling, for races when turning the feature
  on/off via /proc/sys/kernel/sched_autogroup_enabled"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/autogroup: Do not use autogroup->tg in zombie threads
  sched/autogroup: Fix autogroup_move_group() to never skip sched_move_task()

000b8949

Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 7cfc4317

Linus Torvalds authored Nov 22, 2016

Pull x86 fixes from Ingo Molnar:
 "Misc fixes:
   - two fixes to make (very) old Intel CPUs boot reliably
   - fix the intel-mid driver and rename it
   - two KASAN false positive fixes
   - an FPU fix
   - two sysfb fixes
   - two build fixes related to new toolchain versions"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/platform/intel-mid: Rename platform_wdt to platform_mrfld_wdt
  x86/build: Build compressed x86 kernels as PIE when !CONFIG_RELOCATABLE as well
  x86/platform/intel-mid: Register watchdog device after SCU
  x86/fpu: Fix invalid FPU ptrace state after execve()
  x86/boot: Fail the boot if !M486 and CPUID is missing
  x86/traps: Ignore high word of regs->cs in early_fixup_exception()
  x86/dumpstack: Prevent KASAN false positive warnings
  x86/unwind: Prevent KASAN false positive warnings in guess unwinder
  x86/boot: Avoid warning for zero-filling .bss
  x86/sysfb: Fix lfb_size calculation
  x86/sysfb: Add support for 64bit EFI lfb_base

7cfc4317

perf/x86/intel/uncore: Allow only a single PMU/box within an events group · 033ac60c

Peter Zijlstra authored Nov 18, 2016

Group validation expects all events to be of the same PMU; however
is_uncore_pmu() is too wide, it matches _all_ uncore events, even
across PMUs.

This triggers failure when we group different events from different
uncore PMUs, like:

  perf stat -vv -e '{uncore_cbox_0/config=0x0334/,uncore_qpi_0/event=1/}' -a sleep 1

Fix is_uncore_pmu() by only matching events to the box at hand.

Note that generic code; ran after this step; will disallow this
mixture of PMU events.
Reported-by: Jiri Olsa <jolsa@redhat.com>
Tested-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vince@deater.net>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/20161118125354.GQ3117@twins.programming.kicks-ass.netSigned-off-by: Ingo Molnar <mingo@kernel.org>

033ac60c

perf/x86/intel: Cure bogus unwind from PEBS entries · b8000586

Peter Zijlstra authored Nov 17, 2016

Vince Weaver reported that perf_fuzzer + KASAN detects that PEBS event
unwinds sometimes do 'weird' things. In particular, we seemed to be
ending up unwinding from random places on the NMI stack.

While it was somewhat expected that the event record BP,SP would not
match the interrupt BP,SP in that the interrupt is strictly later than
the record event, it was overlooked that it could be on an already
overwritten stack.

Therefore, don't copy the recorded BP,SP over the interrupted BP,SP
when we need stack unwinds.

Note that its still possible the unwind doesn't full match the actual
event, as its entirely possible to have done an (I)RET between record
and interrupt, but on average it should still point in the general
direction of where the event came from. Also, it's the best we can do,
considering.

The particular scenario that triggered the bogus NMI stack unwind was
a PEBS event with very short period, upon enabling the event at the
tail of the PMI handler (FREEZE_ON_PMI is not used), it instantly
triggers a record (while still on the NMI stack) which in turn
triggers the next PMI. This then causes back-to-back NMIs and we'll
try and unwind the stack-frame from the last NMI, which obviously is
now overwritten by our own.
Analyzed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@gmail.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davej@codemonkey.org.uk <davej@codemonkey.org.uk>
Cc: dvyukov@google.com <dvyukov@google.com>
Cc: stable@vger.kernel.org
Fixes: ca037701 ("perf, x86: Add PEBS infrastructure")
Link: http://lkml.kernel.org/r/20161117171731.GV3157@twins.programming.kicks-ass.netSigned-off-by: Ingo Molnar <mingo@kernel.org>

b8000586

perf/x86: Restore TASK_SIZE check on frame pointer · ae31fe51

Johannes Weiner authored Nov 22, 2016

The following commit:

  75925e1a ("perf/x86: Optimize stack walk user accesses")

... switched from copy_from_user_nmi() to __copy_from_user_nmi() with a manual
access_ok() check.

Unfortunately, copy_from_user_nmi() does an explicit check against TASK_SIZE,
whereas the access_ok() uses whatever the current address limit of the task is.

We are getting NMIs when __probe_kernel_read() has switched to KERNEL_DS, and
then see vmalloc faults when we access what looks like pointers into vmalloc
space:

  [] WARNING: CPU: 3 PID: 3685731 at arch/x86/mm/fault.c:435 vmalloc_fault+0x289/0x290
  [] CPU: 3 PID: 3685731 Comm: sh Tainted: G        W       4.6.0-5_fbk1_223_gdbf0f40 #1
  [] Call Trace:
  []  <NMI>  [<ffffffff814717d1>] dump_stack+0x4d/0x6c
  []  [<ffffffff81076e43>] __warn+0xd3/0xf0
  []  [<ffffffff81076f2d>] warn_slowpath_null+0x1d/0x20
  []  [<ffffffff8104a899>] vmalloc_fault+0x289/0x290
  []  [<ffffffff8104b5a0>] __do_page_fault+0x330/0x490
  []  [<ffffffff8104b70c>] do_page_fault+0xc/0x10
  []  [<ffffffff81794e82>] page_fault+0x22/0x30
  []  [<ffffffff81006280>] ? perf_callchain_user+0x100/0x2a0
  []  [<ffffffff8115124f>] get_perf_callchain+0x17f/0x190
  []  [<ffffffff811512c7>] perf_callchain+0x67/0x80
  []  [<ffffffff8114e750>] perf_prepare_sample+0x2a0/0x370
  []  [<ffffffff8114e840>] perf_event_output+0x20/0x60
  []  [<ffffffff8114aee7>] ? perf_event_update_userpage+0xc7/0x130
  []  [<ffffffff8114ea01>] __perf_event_overflow+0x181/0x1d0
  []  [<ffffffff8114f484>] perf_event_overflow+0x14/0x20
  []  [<ffffffff8100a6e3>] intel_pmu_handle_irq+0x1d3/0x490
  []  [<ffffffff8147daf7>] ? copy_user_enhanced_fast_string+0x7/0x10
  []  [<ffffffff81197191>] ? vunmap_page_range+0x1a1/0x2f0
  []  [<ffffffff811972f1>] ? unmap_kernel_range_noflush+0x11/0x20
  []  [<ffffffff814f2056>] ? ghes_copy_tofrom_phys+0x116/0x1f0
  []  [<ffffffff81040d1d>] ? x2apic_send_IPI_self+0x1d/0x20
  []  [<ffffffff8100411d>] perf_event_nmi_handler+0x2d/0x50
  []  [<ffffffff8101ea31>] nmi_handle+0x61/0x110
  []  [<ffffffff8101ef94>] default_do_nmi+0x44/0x110
  []  [<ffffffff8101f13b>] do_nmi+0xdb/0x150
  []  [<ffffffff81795187>] end_repeat_nmi+0x1a/0x1e
  []  [<ffffffff8147daf7>] ? copy_user_enhanced_fast_string+0x7/0x10
  []  [<ffffffff8147daf7>] ? copy_user_enhanced_fast_string+0x7/0x10
  []  [<ffffffff8147daf7>] ? copy_user_enhanced_fast_string+0x7/0x10
  []  <<EOE>>  <IRQ>  [<ffffffff8115d05e>] ? __probe_kernel_read+0x3e/0xa0

Fix this by moving the valid_user_frame() check to before the uaccess
that loads the return address and the pointer to the next frame.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Fixes: 75925e1a ("perf/x86: Optimize stack walk user accesses")
Signed-off-by: Ingo Molnar <mingo@kernel.org>

ae31fe51

sched/autogroup: Do not use autogroup->tg in zombie threads · 8e5bfa8c

Oleg Nesterov authored Nov 14, 2016

Exactly because for_each_thread() in autogroup_move_group() can't see it
and update its ->sched_task_group before _put() and possibly free().

So the exiting task needs another sched_move_task() before exit_notify()
and we need to re-introduce the PF_EXITING (or similar) check removed by
the previous change for another reason.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hartsjc@redhat.com
Cc: vbendel@redhat.com
Cc: vlovejoy@redhat.com
Link: http://lkml.kernel.org/r/20161114184612.GA15968@redhat.comSigned-off-by: Ingo Molnar <mingo@kernel.org>

8e5bfa8c

sched/autogroup: Fix autogroup_move_group() to never skip sched_move_task() · 18f649ef

Oleg Nesterov authored Nov 14, 2016

The PF_EXITING check in task_wants_autogroup() is no longer needed. Remove
it, but see the next patch.

However the comment is correct in that autogroup_move_group() must always
change task_group() for every thread so the sysctl_ check is very wrong;
we can race with cgroups and even sys_setsid() is not safe because a task
running with task_group() == ag->tg must participate in refcounting:

	int main(void)
	{
		int sctl = open("/proc/sys/kernel/sched_autogroup_enabled", O_WRONLY);

		assert(sctl > 0);
		if (fork()) {
			wait(NULL); // destroy the child's ag/tg
			pause();
		}

		assert(pwrite(sctl, "1\n", 2, 0) == 2);
		assert(setsid() > 0);
		if (fork())
			pause();

		kill(getppid(), SIGKILL);
		sleep(1);

		// The child has gone, the grandchild runs with kref == 1
		assert(pwrite(sctl, "0\n", 2, 0) == 2);
		assert(setsid() > 0);

		// runs with the freed ag/tg
		for (;;)
			sleep(1);

		return 0;
	}

crashes the kernel. It doesn't really need sleep(1), it doesn't matter if
autogroup_move_group() actually frees the task_group or this happens later.
Reported-by: Vern Lovejoy <vlovejoy@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hartsjc@redhat.com
Cc: vbendel@redhat.com
Link: http://lkml.kernel.org/r/20161114184609.GA15965@redhat.comSigned-off-by: Ingo Molnar <mingo@kernel.org>

18f649ef

crypto: scatterwalk - Remove unnecessary aliasing check in map_and_copy · c8467f7a

Herbert Xu authored Nov 21, 2016

The aliasing check in map_and_copy is no longer necessary because
the IPsec ESP code no longer provides an IV that points into the
actual request data.  As this check is now triggering BUG checks
due to the vmalloced stack code, I'm removing it.
Reported-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

c8467f7a

crypto: algif_hash - Fix result clobbering in recvmsg · 8acf7a10

Herbert Xu authored Nov 21, 2016

Recently an init call was added to hash_recvmsg so as to reset
the hash state in case a sendmsg call was never made.

Unfortunately this ended up clobbering the result if the previous
sendmsg was done with a MSG_MORE flag.  This patch fixes it by
excluding that case when we make the init call.

Fixes: a8348bca ("algif_hash - Fix NULL hash crash with shash")
Reported-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

8acf7a10

21 Nov, 2016 4 commits

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security · 3b404a51

Linus Torvalds authored Nov 21, 2016

Pull apparmor bugfix from James Morris:
 "This has a fix for a policy replacement bug that is fairly serious for
  apache mod_apparmor users, as it results in the wrong policy being
  applied on an network facing service"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
  apparmor: fix change_hat not finding hat after policy replacement

3b404a51

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc · 8d1a2408

Linus Torvalds authored Nov 21, 2016

Pull sparc fixes from David Miller:

 1) With modern networking cards we can run out of 32-bit DMA space, so
    support 64-bit DMA addressing when possible on sparc64. From Dave
    Tushar.

 2) Some signal frame validation checks are inverted on sparc32, fix
    from Andreas Larsson.

 3) Lockdep tables can get too large in some circumstances on sparc64,
    add a way to adjust the size a bit. From Babu Moger.

 4) Fix NUMA node probing on some sun4v systems, from Thomas Tai.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
  sparc: drop duplicate header scatterlist.h
  lockdep: Limit static allocations if PROVE_LOCKING_SMALL is defined
  config: Adding the new config parameter CONFIG_PROVE_LOCKING_SMALL for sparc
  sunbmac: Fix compiler warning
  sunqe: Fix compiler warnings
  sparc64: Enable 64-bit DMA
  sparc64: Enable sun4v dma ops to use IOMMU v2 APIs
  sparc64: Bind PCIe devices to use IOMMU v2 service
  sparc64: Initialize iommu_map_table and iommu_pool
  sparc64: Add ATU (new IOMMU) support
  sparc64: Add FORCE_MAX_ZONEORDER and default to 13
  sparc64: fix compile warning section mismatch in find_node()
  sparc32: Fix inverted invalid_frame_pointer checks on sigreturns
  sparc64: Fix find_node warning if numa node cannot be found

8d1a2408

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 27e7ab99

Linus Torvalds authored Nov 21, 2016

Pull networking fixes from David Miller:

 1) Clear congestion control state when changing algorithms on an
    existing socket, from Florian Westphal.

 2) Fix register bit values in altr_tse_pcs portion of stmmac driver,
    from Jia Jie Ho.

 3) Fix PTP handling in stammc driver for GMAC4, from Giuseppe
    CAVALLARO.

 4) Fix udplite multicast delivery handling, it ignores the udp_table
    parameter passed into the lookups, from Pablo Neira Ayuso.

 5) Synchronize the space estimated by rtnl_vfinfo_size and the space
    actually used by rtnl_fill_vfinfo. From Sabrina Dubroca.

 6) Fix memory leak in fib_info when splitting nodes, from Alexander
    Duyck.

 7) If a driver does a napi_hash_del() explicitily and not via
    netif_napi_del(), it must perform RCU synchronization as needed. Fix
    this in virtio-net and bnxt drivers, from Eric Dumazet.

 8) Likewise, it is not necessary to invoke napi_hash_del() is we are
    also doing neif_napi_del() in the same code path. Remove such calls
    from be2net and cxgb4 drivers, also from Eric Dumazet.

 9) Don't allocate an ID in peernet2id_alloc() if the netns is dead,
    from WANG Cong.

10) Fix OF node and device struct leaks in of_mdio, from Johan Hovold.

11) We cannot cache routes in ip6_tunnel when using inherited traffic
    classes, from Paolo Abeni.

12) Fix several crashes and leaks in cpsw driver, from Johan Hovold.

13) Splice operations cannot use freezable blocking calls in AF_UNIX,
    from WANG Cong.

14) Link dump filtering by master device and kind support added an error
    in loop index updates during the dump if we actually do filter, fix
    from Zhang Shengju.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (59 commits)
  tcp: zero ca_priv area when switching cc algorithms
  net: l2tp: Treat NET_XMIT_CN as success in l2tp_eth_dev_xmit
  ethernet: stmmac: make DWMAC_STM32 depend on it's associated SoC
  tipc: eliminate obsolete socket locking policy description
  rtnl: fix the loop index update error in rtnl_dump_ifinfo()
  l2tp: fix racy SOCK_ZAPPED flag check in l2tp_ip{,6}_bind()
  net: macb: add check for dma mapping error in start_xmit()
  rtnetlink: fix FDB size computation
  netns: fix get_net_ns_by_fd(int pid) typo
  af_unix: conditionally use freezable blocking calls in read
  net: ethernet: ti: cpsw: fix fixed-link phy probe deferral
  net: ethernet: ti: cpsw: add missing sanity check
  net: ethernet: ti: cpsw: fix secondary-emac probe error path
  net: ethernet: ti: cpsw: fix of_node and phydev leaks
  net: ethernet: ti: cpsw: fix deferred probe
  net: ethernet: ti: cpsw: fix mdio device reference leak
  net: ethernet: ti: cpsw: fix bad register access in probe error path
  net: sky2: Fix shutdown crash
  cfg80211: limit scan results cache size
  net sched filters: pass netlink message flags in event notification
  ...

27e7ab99

tcp: zero ca_priv area when switching cc algorithms · 7082c5c3

Florian Westphal authored Nov 21, 2016

We need to zero out the private data area when application switches
connection to different algorithm (TCP_CONGESTION setsockopt).

When congestion ops get assigned at connect time everything is already
zeroed because sk_alloc uses GFP_ZERO flag.  But in the setsockopt case
this contains whatever previous cc placed there.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

7082c5c3