1. 23 Jan, 2019 26 commits
    • Ivan Mironov's avatar
      scsi: sd: Fix cache_type_store() · 24c99a92
      Ivan Mironov authored
      commit 44759979 upstream.
      
      Changing of caching mode via /sys/devices/.../scsi_disk/.../cache_type may
      fail if device responds to MODE SENSE command with DPOFUA flag set, and
      then checks this flag to be not set on MODE SELECT command.
      
      In this scenario, when trying to change cache_type, write always fails:
      
      	# echo "none" >cache_type
      	bash: echo: write error: Invalid argument
      
      And following appears in dmesg:
      
      	[13007.865745] sd 1:0:1:0: [sda] Sense Key : Illegal Request [current]
      	[13007.865753] sd 1:0:1:0: [sda] Add. Sense: Invalid field in parameter list
      
      From SBC-4 r15, 6.5.1 "Mode pages overview", description of DEVICE-SPECIFIC
      PARAMETER field in the mode parameter header:
      	...
      	The write protect (WP) bit for mode data sent with a MODE SELECT
      	command shall be ignored by the device server.
      	...
      	The DPOFUA bit is reserved for mode data sent with a MODE SELECT
      	command.
      	...
      
      The remaining bits in the DEVICE-SPECIFIC PARAMETER byte are also reserved
      and shall be set to zero.
      
      [mkp: shuffled commentary to commit description]
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarIvan Mironov <mironov.ivan@gmail.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      24c99a92
    • Stanley Chu's avatar
      scsi: core: Synchronize request queue PM status only on successful resume · caae28b3
      Stanley Chu authored
      commit 3f7e62bb upstream.
      
      The commit 356fd266 ("scsi: Set request queue runtime PM status back to
      active on resume") fixed up the inconsistent RPM status between request
      queue and device. However changing request queue RPM status shall be done
      only on successful resume, otherwise status may be still inconsistent as
      below,
      
      Request queue: RPM_ACTIVE
      Device: RPM_SUSPENDED
      
      This ends up soft lockup because requests can be submitted to underlying
      devices but those devices and their required resource are not resumed.
      
      For example,
      
      After above inconsistent status happens, IO request can be submitted to UFS
      device driver but required resource (like clock) is not resumed yet thus
      lead to warning as below call stack,
      
      WARN_ON(hba->clk_gating.state != CLKS_ON);
      ufshcd_queuecommand
      scsi_dispatch_cmd
      scsi_request_fn
      __blk_run_queue
      cfq_insert_request
      __elv_add_request
      blk_flush_plug_list
      blk_finish_plug
      jbd2_journal_commit_transaction
      kjournald2
      
      We may see all behind IO requests hang because of no response from storage
      host or device and then soft lockup happens in system. In the end, system
      may crash in many ways.
      
      Fixes: 356fd266 (scsi: Set request queue runtime PM status back to active on resume)
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarStanley Chu <stanley.chu@mediatek.com>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      caae28b3
    • Kees Cook's avatar
      Yama: Check for pid death before checking ancestry · 41c13bfc
      Kees Cook authored
      commit 9474f4e7 upstream.
      
      It's possible that a pid has died before we take the rcu lock, in which
      case we can't walk the ancestry list as it may be detached. Instead, check
      for death first before doing the walk.
      
      Reported-by: syzbot+a9ac39bf55329e206219@syzkaller.appspotmail.com
      Fixes: 2d514487 ("security: Yama LSM")
      Cc: stable@vger.kernel.org
      Suggested-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarJames Morris <james.morris@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      41c13bfc
    • Josef Bacik's avatar
      btrfs: wait on ordered extents on abort cleanup · f97fd292
      Josef Bacik authored
      commit 74d5d229 upstream.
      
      If we flip read-only before we initiate writeback on all dirty pages for
      ordered extents we've created then we'll have ordered extents left over
      on umount, which results in all sorts of bad things happening.  Fix this
      by making sure we wait on ordered extents if we have to do the aborted
      transaction cleanup stuff.
      
      generic/475 can produce this warning:
      
       [ 8531.177332] WARNING: CPU: 2 PID: 11997 at fs/btrfs/disk-io.c:3856 btrfs_free_fs_root+0x95/0xa0 [btrfs]
       [ 8531.183282] CPU: 2 PID: 11997 Comm: umount Tainted: G        W 5.0.0-rc1-default+ #394
       [ 8531.185164] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
       [ 8531.187851] RIP: 0010:btrfs_free_fs_root+0x95/0xa0 [btrfs]
       [ 8531.193082] RSP: 0018:ffffb1ab86163d98 EFLAGS: 00010286
       [ 8531.194198] RAX: ffff9f3449494d18 RBX: ffff9f34a2695000 RCX:0000000000000000
       [ 8531.195629] RDX: 0000000000000002 RSI: 0000000000000001 RDI:0000000000000000
       [ 8531.197315] RBP: ffff9f344e930000 R08: 0000000000000001 R09:0000000000000000
       [ 8531.199095] R10: 0000000000000000 R11: ffff9f34494d4ff8 R12:ffffb1ab86163dc0
       [ 8531.200870] R13: ffff9f344e9300b0 R14: ffffb1ab86163db8 R15:0000000000000000
       [ 8531.202707] FS:  00007fc68e949fc0(0000) GS:ffff9f34bd800000(0000)knlGS:0000000000000000
       [ 8531.204851] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [ 8531.205942] CR2: 00007ffde8114dd8 CR3: 000000002dfbd000 CR4:00000000000006e0
       [ 8531.207516] Call Trace:
       [ 8531.208175]  btrfs_free_fs_roots+0xdb/0x170 [btrfs]
       [ 8531.210209]  ? wait_for_completion+0x5b/0x190
       [ 8531.211303]  close_ctree+0x157/0x350 [btrfs]
       [ 8531.212412]  generic_shutdown_super+0x64/0x100
       [ 8531.213485]  kill_anon_super+0x14/0x30
       [ 8531.214430]  btrfs_kill_super+0x12/0xa0 [btrfs]
       [ 8531.215539]  deactivate_locked_super+0x29/0x60
       [ 8531.216633]  cleanup_mnt+0x3b/0x70
       [ 8531.217497]  task_work_run+0x98/0xc0
       [ 8531.218397]  exit_to_usermode_loop+0x83/0x90
       [ 8531.219324]  do_syscall_64+0x15b/0x180
       [ 8531.220192]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
       [ 8531.221286] RIP: 0033:0x7fc68e5e4d07
       [ 8531.225621] RSP: 002b:00007ffde8116608 EFLAGS: 00000246 ORIG_RAX:00000000000000a6
       [ 8531.227512] RAX: 0000000000000000 RBX: 00005580c2175970 RCX:00007fc68e5e4d07
       [ 8531.229098] RDX: 0000000000000001 RSI: 0000000000000000 RDI:00005580c2175b80
       [ 8531.230730] RBP: 0000000000000000 R08: 00005580c2175ba0 R09:00007ffde8114e80
       [ 8531.232269] R10: 0000000000000000 R11: 0000000000000246 R12:00005580c2175b80
       [ 8531.233839] R13: 00007fc68eac61c4 R14: 00005580c2175a68 R15:0000000000000000
      
      Leaving a tree in the rb-tree:
      
      3853 void btrfs_free_fs_root(struct btrfs_root *root)
      3854 {
      3855         iput(root->ino_cache_inode);
      3856         WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree));
      
      CC: stable@vger.kernel.org
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      [ add stacktrace ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f97fd292
    • David Sterba's avatar
      Revert "btrfs: balance dirty metadata pages in btrfs_finish_ordered_io" · 0400be16
      David Sterba authored
      commit 77b7aad1 upstream.
      
      This reverts commit e73e81b6.
      
      This patch causes a few problems:
      
      - adds latency to btrfs_finish_ordered_io
      - as btrfs_finish_ordered_io is used for free space cache, generating
        more work from btrfs_btree_balance_dirty_nodelay could end up in the
        same workque, effectively deadlocking
      
      12260 kworker/u96:16+btrfs-freespace-write D
      [<0>] balance_dirty_pages+0x6e6/0x7ad
      [<0>] balance_dirty_pages_ratelimited+0x6bb/0xa90
      [<0>] btrfs_finish_ordered_io+0x3da/0x770
      [<0>] normal_work_helper+0x1c5/0x5a0
      [<0>] process_one_work+0x1ee/0x5a0
      [<0>] worker_thread+0x46/0x3d0
      [<0>] kthread+0xf5/0x130
      [<0>] ret_from_fork+0x24/0x30
      [<0>] 0xffffffffffffffff
      
      Transaction commit will wait on the freespace cache:
      
      838 btrfs-transacti D
      [<0>] btrfs_start_ordered_extent+0x154/0x1e0
      [<0>] btrfs_wait_ordered_range+0xbd/0x110
      [<0>] __btrfs_wait_cache_io+0x49/0x1a0
      [<0>] btrfs_write_dirty_block_groups+0x10b/0x3b0
      [<0>] commit_cowonly_roots+0x215/0x2b0
      [<0>] btrfs_commit_transaction+0x37e/0x910
      [<0>] transaction_kthread+0x14d/0x180
      [<0>] kthread+0xf5/0x130
      [<0>] ret_from_fork+0x24/0x30
      [<0>] 0xffffffffffffffff
      
      And then writepages ends up waiting on transaction commit:
      
      9520 kworker/u96:13+flush-btrfs-1 D
      [<0>] wait_current_trans+0xac/0xe0
      [<0>] start_transaction+0x21b/0x4b0
      [<0>] cow_file_range_inline+0x10b/0x6b0
      [<0>] cow_file_range.isra.69+0x329/0x4a0
      [<0>] run_delalloc_range+0x105/0x3c0
      [<0>] writepage_delalloc+0x119/0x180
      [<0>] __extent_writepage+0x10c/0x390
      [<0>] extent_write_cache_pages+0x26f/0x3d0
      [<0>] extent_writepages+0x4f/0x80
      [<0>] do_writepages+0x17/0x60
      [<0>] __writeback_single_inode+0x59/0x690
      [<0>] writeback_sb_inodes+0x291/0x4e0
      [<0>] __writeback_inodes_wb+0x87/0xb0
      [<0>] wb_writeback+0x3bb/0x500
      [<0>] wb_workfn+0x40d/0x610
      [<0>] process_one_work+0x1ee/0x5a0
      [<0>] worker_thread+0x1e0/0x3d0
      [<0>] kthread+0xf5/0x130
      [<0>] ret_from_fork+0x24/0x30
      [<0>] 0xffffffffffffffff
      
      Eventually, we have every process in the system waiting on
      balance_dirty_pages(), and nobody is able to make progress on page
      writeback.
      
      The original patch tried to fix an OOM condition, that happened on 4.4 but no
      success reproducing that on later kernels (4.19 and 4.20). This is more likely
      a problem in OOM itself.
      
      Link: https://lore.kernel.org/linux-btrfs/20180528054821.9092-1-ethanlien@synology.com/Reported-by: default avatarChris Mason <clm@fb.com>
      CC: stable@vger.kernel.org # 4.18+
      CC: ethanlien <ethanlien@synology.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0400be16
    • Eric Biggers's avatar
      crypto: authenc - fix parsing key with misaligned rta_len · b9119fd2
      Eric Biggers authored
      commit 8f9c4693 upstream.
      
      Keys for "authenc" AEADs are formatted as an rtattr containing a 4-byte
      'enckeylen', followed by an authentication key and an encryption key.
      crypto_authenc_extractkeys() parses the key to find the inner keys.
      
      However, it fails to consider the case where the rtattr's payload is
      longer than 4 bytes but not 4-byte aligned, and where the key ends
      before the next 4-byte aligned boundary.  In this case, 'keylen -=
      RTA_ALIGN(rta->rta_len);' underflows to a value near UINT_MAX.  This
      causes a buffer overread and crash during crypto_ahash_setkey().
      
      Fix it by restricting the rtattr payload to the expected size.
      
      Reproducer using AF_ALG:
      
      	#include <linux/if_alg.h>
      	#include <linux/rtnetlink.h>
      	#include <sys/socket.h>
      
      	int main()
      	{
      		int fd;
      		struct sockaddr_alg addr = {
      			.salg_type = "aead",
      			.salg_name = "authenc(hmac(sha256),cbc(aes))",
      		};
      		struct {
      			struct rtattr attr;
      			__be32 enckeylen;
      			char keys[1];
      		} __attribute__((packed)) key = {
      			.attr.rta_len = sizeof(key),
      			.attr.rta_type = 1 /* CRYPTO_AUTHENC_KEYA_PARAM */,
      		};
      
      		fd = socket(AF_ALG, SOCK_SEQPACKET, 0);
      		bind(fd, (void *)&addr, sizeof(addr));
      		setsockopt(fd, SOL_ALG, ALG_SET_KEY, &key, sizeof(key));
      	}
      
      It caused:
      
      	BUG: unable to handle kernel paging request at ffff88007ffdc000
      	PGD 2e01067 P4D 2e01067 PUD 2e04067 PMD 2e05067 PTE 0
      	Oops: 0000 [#1] SMP
      	CPU: 0 PID: 883 Comm: authenc Not tainted 4.20.0-rc1-00108-g00c9fe37 #13
      	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-20181126_142135-anatol 04/01/2014
      	RIP: 0010:sha256_ni_transform+0xb3/0x330 arch/x86/crypto/sha256_ni_asm.S:155
      	[...]
      	Call Trace:
      	 sha256_ni_finup+0x10/0x20 arch/x86/crypto/sha256_ssse3_glue.c:321
      	 crypto_shash_finup+0x1a/0x30 crypto/shash.c:178
      	 shash_digest_unaligned+0x45/0x60 crypto/shash.c:186
      	 crypto_shash_digest+0x24/0x40 crypto/shash.c:202
      	 hmac_setkey+0x135/0x1e0 crypto/hmac.c:66
      	 crypto_shash_setkey+0x2b/0xb0 crypto/shash.c:66
      	 shash_async_setkey+0x10/0x20 crypto/shash.c:223
      	 crypto_ahash_setkey+0x2d/0xa0 crypto/ahash.c:202
      	 crypto_authenc_setkey+0x68/0x100 crypto/authenc.c:96
      	 crypto_aead_setkey+0x2a/0xc0 crypto/aead.c:62
      	 aead_setkey+0xc/0x10 crypto/algif_aead.c:526
      	 alg_setkey crypto/af_alg.c:223 [inline]
      	 alg_setsockopt+0xfe/0x130 crypto/af_alg.c:256
      	 __sys_setsockopt+0x6d/0xd0 net/socket.c:1902
      	 __do_sys_setsockopt net/socket.c:1913 [inline]
      	 __se_sys_setsockopt net/socket.c:1910 [inline]
      	 __x64_sys_setsockopt+0x1f/0x30 net/socket.c:1910
      	 do_syscall_64+0x4a/0x180 arch/x86/entry/common.c:290
      	 entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Fixes: e236d4a8 ("[CRYPTO] authenc: Move enckeylen into key itself")
      Cc: <stable@vger.kernel.org> # v2.6.25+
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b9119fd2
    • Eric Biggers's avatar
      crypto: bcm - convert to use crypto_authenc_extractkeys() · 7c5f00e8
      Eric Biggers authored
      commit ab57b335 upstream.
      
      Convert the bcm crypto driver to use crypto_authenc_extractkeys() so
      that it picks up the fix for broken validation of rtattr::rta_len.
      
      This also fixes the DES weak key check to actually be done on the right
      key. (It was checking the authentication key, not the encryption key...)
      
      Fixes: 9d12ba86 ("crypto: brcm - Add Broadcom SPU driver")
      Cc: <stable@vger.kernel.org> # v4.11+
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7c5f00e8
    • Harsh Jain's avatar
      crypto: authencesn - Avoid twice completion call in decrypt path · d196d2fd
      Harsh Jain authored
      commit a7773363 upstream.
      
      Authencesn template in decrypt path unconditionally calls aead_request_complete
      after ahash_verify which leads to following kernel panic in after decryption.
      
      [  338.539800] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
      [  338.548372] PGD 0 P4D 0
      [  338.551157] Oops: 0000 [#1] SMP PTI
      [  338.554919] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Tainted: G        W I       4.19.7+ #13
      [  338.564431] Hardware name: Supermicro X8ST3/X8ST3, BIOS 2.0        07/29/10
      [  338.572212] RIP: 0010:esp_input_done2+0x350/0x410 [esp4]
      [  338.578030] Code: ff 0f b6 68 10 48 8b 83 c8 00 00 00 e9 8e fe ff ff 8b 04 25 04 00 00 00 83 e8 01 48 98 48 8b 3c c5 10 00 00 00 e9 f7 fd ff ff <8b> 04 25 04 00 00 00 83 e8 01 48 98 4c 8b 24 c5 10 00 00 00 e9 3b
      [  338.598547] RSP: 0018:ffff911c97803c00 EFLAGS: 00010246
      [  338.604268] RAX: 0000000000000002 RBX: ffff911c4469ee00 RCX: 0000000000000000
      [  338.612090] RDX: 0000000000000000 RSI: 0000000000000130 RDI: ffff911b87c20400
      [  338.619874] RBP: 0000000000000000 R08: ffff911b87c20498 R09: 000000000000000a
      [  338.627610] R10: 0000000000000001 R11: 0000000000000004 R12: 0000000000000000
      [  338.635402] R13: ffff911c89590000 R14: ffff911c91730000 R15: 0000000000000000
      [  338.643234] FS:  0000000000000000(0000) GS:ffff911c97800000(0000) knlGS:0000000000000000
      [  338.652047] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  338.658299] CR2: 0000000000000004 CR3: 00000001ec20a000 CR4: 00000000000006f0
      [  338.666382] Call Trace:
      [  338.669051]  <IRQ>
      [  338.671254]  esp_input_done+0x12/0x20 [esp4]
      [  338.675922]  chcr_handle_resp+0x3b5/0x790 [chcr]
      [  338.680949]  cpl_fw6_pld_handler+0x37/0x60 [chcr]
      [  338.686080]  chcr_uld_rx_handler+0x22/0x50 [chcr]
      [  338.691233]  uldrx_handler+0x8c/0xc0 [cxgb4]
      [  338.695923]  process_responses+0x2f0/0x5d0 [cxgb4]
      [  338.701177]  ? bitmap_find_next_zero_area_off+0x3a/0x90
      [  338.706882]  ? matrix_alloc_area.constprop.7+0x60/0x90
      [  338.712517]  ? apic_update_irq_cfg+0x82/0xf0
      [  338.717177]  napi_rx_handler+0x14/0xe0 [cxgb4]
      [  338.722015]  net_rx_action+0x2aa/0x3e0
      [  338.726136]  __do_softirq+0xcb/0x280
      [  338.730054]  irq_exit+0xde/0xf0
      [  338.733504]  do_IRQ+0x54/0xd0
      [  338.736745]  common_interrupt+0xf/0xf
      
      Fixes: 104880a6 ("crypto: authencesn - Convert to new AEAD...")
      Signed-off-by: default avatarHarsh Jain <harsh@chelsio.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d196d2fd
    • Aymen Sghaier's avatar
      crypto: caam - fix zero-length buffer DMA mapping · 3466b8be
      Aymen Sghaier authored
      commit 04e6d25c upstream.
      
      Recent changes - probably DMA API related (generic and/or arm64-specific) -
      exposed a case where driver maps a zero-length buffer:
      ahash_init()->ahash_update()->ahash_final() with a zero-length string to
      hash
      
      kernel BUG at kernel/dma/swiotlb.c:475!
      Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
      Modules linked in:
      CPU: 2 PID: 1823 Comm: cryptomgr_test Not tainted 4.20.0-rc1-00108-g00c9fe37 #1
      Hardware name: LS1046A RDB Board (DT)
      pstate: 80000005 (Nzcv daif -PAN -UAO)
      pc : swiotlb_tbl_map_single+0x170/0x2b8
      lr : swiotlb_map_page+0x134/0x1f8
      sp : ffff00000f79b8f0
      x29: ffff00000f79b8f0 x28: 0000000000000000
      x27: ffff0000093d0000 x26: 0000000000000000
      x25: 00000000001f3ffe x24: 0000000000200000
      x23: 0000000000000000 x22: 00000009f2c538c0
      x21: ffff800970aeb410 x20: 0000000000000001
      x19: ffff800970aeb410 x18: 0000000000000007
      x17: 000000000000000e x16: 0000000000000001
      x15: 0000000000000019 x14: c32cb8218a167fe8
      x13: ffffffff00000000 x12: ffff80097fdae348
      x11: 0000800976bca000 x10: 0000000000000010
      x9 : 0000000000000000 x8 : ffff0000091fd6c8
      x7 : 0000000000000000 x6 : 00000009f2c538bf
      x5 : 0000000000000000 x4 : 0000000000000001
      x3 : 0000000000000000 x2 : 00000009f2c538c0
      x1 : 00000000f9fff000 x0 : 0000000000000000
      Process cryptomgr_test (pid: 1823, stack limit = 0x(____ptrval____))
      Call trace:
       swiotlb_tbl_map_single+0x170/0x2b8
       swiotlb_map_page+0x134/0x1f8
       ahash_final_no_ctx+0xc4/0x6cc
       ahash_final+0x10/0x18
       crypto_ahash_op+0x30/0x84
       crypto_ahash_final+0x14/0x1c
       __test_hash+0x574/0xe0c
       test_hash+0x28/0x80
       __alg_test_hash+0x84/0xd0
       alg_test_hash+0x78/0x144
       alg_test.part.30+0x12c/0x2b4
       alg_test+0x3c/0x68
       cryptomgr_test+0x44/0x4c
       kthread+0xfc/0x128
       ret_from_fork+0x10/0x18
      Code: d34bfc18 2a1a03f7 1a9f8694 35fff89a (d4210000)
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAymen Sghaier <aymen.sghaier@nxp.com>
      Signed-off-by: default avatarHoria Geantă <horia.geanta@nxp.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3466b8be
    • Willem de Bruijn's avatar
      ip: on queued skb use skb_header_pointer instead of pskb_may_pull · 75664d80
      Willem de Bruijn authored
      [ Upstream commit 4a06fa67 ]
      
      Commit 2efd4fca ("ip: in cmsg IP(V6)_ORIGDSTADDR call
      pskb_may_pull") avoided a read beyond the end of the skb linear
      segment by calling pskb_may_pull.
      
      That function can trigger a BUG_ON in pskb_expand_head if the skb is
      shared, which it is when when peeking. It can also return ENOMEM.
      
      Avoid both by switching to safer skb_header_pointer.
      
      Fixes: 2efd4fca ("ip: in cmsg IP(V6)_ORIGDSTADDR call pskb_may_pull")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Suggested-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      75664d80
    • Willem de Bruijn's avatar
      bonding: update nest level on unlink · 0bab9990
      Willem de Bruijn authored
      [ Upstream commit 001e465f ]
      
      A network device stack with multiple layers of bonding devices can
      trigger a false positive lockdep warning. Adding lockdep nest levels
      fixes this. Update the level on both enslave and unlink, to avoid the
      following series of events ..
      
          ip netns add test
          ip netns exec test bash
          ip link set dev lo addr 00:11:22:33:44:55
          ip link set dev lo down
      
          ip link add dev bond1 type bond
          ip link add dev bond2 type bond
      
          ip link set dev lo master bond1
          ip link set dev bond1 master bond2
      
          ip link set dev bond1 nomaster
          ip link set dev bond2 master bond1
      
      .. from still generating a splat:
      
          [  193.652127] ======================================================
          [  193.658231] WARNING: possible circular locking dependency detected
          [  193.664350] 4.20.0 #8 Not tainted
          [  193.668310] ------------------------------------------------------
          [  193.674417] ip/15577 is trying to acquire lock:
          [  193.678897] 00000000a40e3b69 (&(&bond->stats_lock)->rlock#3/3){+.+.}, at: bond_get_stats+0x58/0x290
          [  193.687851]
          	       but task is already holding lock:
          [  193.693625] 00000000807b9d9f (&(&bond->stats_lock)->rlock#2/2){+.+.}, at: bond_get_stats+0x58/0x290
      
          [..]
      
          [  193.851092]        lock_acquire+0xa7/0x190
          [  193.855138]        _raw_spin_lock_nested+0x2d/0x40
          [  193.859878]        bond_get_stats+0x58/0x290
          [  193.864093]        dev_get_stats+0x5a/0xc0
          [  193.868140]        bond_get_stats+0x105/0x290
          [  193.872444]        dev_get_stats+0x5a/0xc0
          [  193.876493]        rtnl_fill_stats+0x40/0x130
          [  193.880797]        rtnl_fill_ifinfo+0x6c5/0xdc0
          [  193.885271]        rtmsg_ifinfo_build_skb+0x86/0xe0
          [  193.890091]        rtnetlink_event+0x5b/0xa0
          [  193.894320]        raw_notifier_call_chain+0x43/0x60
          [  193.899225]        netdev_change_features+0x50/0xa0
          [  193.904044]        bond_compute_features.isra.46+0x1ab/0x270
          [  193.909640]        bond_enslave+0x141d/0x15b0
          [  193.913946]        do_set_master+0x89/0xa0
          [  193.918016]        do_setlink+0x37c/0xda0
          [  193.921980]        __rtnl_newlink+0x499/0x890
          [  193.926281]        rtnl_newlink+0x48/0x70
          [  193.930238]        rtnetlink_rcv_msg+0x171/0x4b0
          [  193.934801]        netlink_rcv_skb+0xd1/0x110
          [  193.939103]        rtnetlink_rcv+0x15/0x20
          [  193.943151]        netlink_unicast+0x3b5/0x520
          [  193.947544]        netlink_sendmsg+0x2fd/0x3f0
          [  193.951942]        sock_sendmsg+0x38/0x50
          [  193.955899]        ___sys_sendmsg+0x2ba/0x2d0
          [  193.960205]        __x64_sys_sendmsg+0xad/0x100
          [  193.964687]        do_syscall_64+0x5a/0x460
          [  193.968823]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Fixes: 7e2556e4 ("bonding: avoid lockdep confusion in bond_get_stats()")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0bab9990
    • Jason Gunthorpe's avatar
      packet: Do not leak dev refcounts on error exit · 6740236d
      Jason Gunthorpe authored
      [ Upstream commit d972f3dc ]
      
      'dev' is non NULL when the addr_len check triggers so it must goto a label
      that does the dev_put otherwise dev will have a leaked refcount.
      
      This bug causes the ib_ipoib module to become unloadable when using
      systemd-network as it triggers this check on InfiniBand links.
      
      Fixes: 99137b78 ("packet: validate address length")
      Reported-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6740236d
    • JianJhen Chen's avatar
      net: bridge: fix a bug on using a neighbour cache entry without checking its state · b4683849
      JianJhen Chen authored
      [ Upstream commit 4c84edc1 ]
      
      When handling DNAT'ed packets on a bridge device, the neighbour cache entry
      from lookup was used without checking its state. It means that a cache entry
      in the NUD_STALE state will be used directly instead of entering the NUD_DELAY
      state to confirm the reachability of the neighbor.
      
      This problem becomes worse after commit 2724680b ("neigh: Keep neighbour
      cache entries if number of them is small enough."), since all neighbour cache
      entries in the NUD_STALE state will be kept in the neighbour table as long as
      the number of cache entries does not exceed the value specified in gc_thresh1.
      
      This commit validates the state of a neighbour cache entry before using
      the entry.
      Signed-off-by: default avatarJianJhen Chen <kchen@synology.com>
      Reviewed-by: default avatarJinLin Chen <jlchen@synology.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b4683849
    • Eric Dumazet's avatar
      ipv6: fix kernel-infoleak in ipv6_local_error() · c809028e
      Eric Dumazet authored
      [ Upstream commit 7d033c9f ]
      
      This patch makes sure the flow label in the IPv6 header
      forged in ipv6_local_error() is initialized.
      
      BUG: KMSAN: kernel-infoleak in _copy_to_user+0x16b/0x1f0 lib/usercopy.c:32
      CPU: 1 PID: 24675 Comm: syz-executor1 Not tainted 4.20.0-rc7+ #4
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x173/0x1d0 lib/dump_stack.c:113
       kmsan_report+0x12e/0x2a0 mm/kmsan/kmsan.c:613
       kmsan_internal_check_memory+0x455/0xb00 mm/kmsan/kmsan.c:675
       kmsan_copy_to_user+0xab/0xc0 mm/kmsan/kmsan_hooks.c:601
       _copy_to_user+0x16b/0x1f0 lib/usercopy.c:32
       copy_to_user include/linux/uaccess.h:177 [inline]
       move_addr_to_user+0x2e9/0x4f0 net/socket.c:227
       ___sys_recvmsg+0x5d7/0x1140 net/socket.c:2284
       __sys_recvmsg net/socket.c:2327 [inline]
       __do_sys_recvmsg net/socket.c:2337 [inline]
       __se_sys_recvmsg+0x2fa/0x450 net/socket.c:2334
       __x64_sys_recvmsg+0x4a/0x70 net/socket.c:2334
       do_syscall_64+0xbc/0xf0 arch/x86/entry/common.c:291
       entry_SYSCALL_64_after_hwframe+0x63/0xe7
      RIP: 0033:0x457ec9
      Code: 6d b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 3b b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f8750c06c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002f
      RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000457ec9
      RDX: 0000000000002000 RSI: 0000000020000400 RDI: 0000000000000005
      RBP: 000000000073bf00 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f8750c076d4
      R13: 00000000004c4a60 R14: 00000000004d8140 R15: 00000000ffffffff
      
      Uninit was stored to memory at:
       kmsan_save_stack_with_flags mm/kmsan/kmsan.c:204 [inline]
       kmsan_save_stack mm/kmsan/kmsan.c:219 [inline]
       kmsan_internal_chain_origin+0x134/0x230 mm/kmsan/kmsan.c:439
       __msan_chain_origin+0x70/0xe0 mm/kmsan/kmsan_instr.c:200
       ipv6_recv_error+0x1e3f/0x1eb0 net/ipv6/datagram.c:475
       udpv6_recvmsg+0x398/0x2ab0 net/ipv6/udp.c:335
       inet_recvmsg+0x4fb/0x600 net/ipv4/af_inet.c:830
       sock_recvmsg_nosec net/socket.c:794 [inline]
       sock_recvmsg+0x1d1/0x230 net/socket.c:801
       ___sys_recvmsg+0x4d5/0x1140 net/socket.c:2278
       __sys_recvmsg net/socket.c:2327 [inline]
       __do_sys_recvmsg net/socket.c:2337 [inline]
       __se_sys_recvmsg+0x2fa/0x450 net/socket.c:2334
       __x64_sys_recvmsg+0x4a/0x70 net/socket.c:2334
       do_syscall_64+0xbc/0xf0 arch/x86/entry/common.c:291
       entry_SYSCALL_64_after_hwframe+0x63/0xe7
      
      Uninit was created at:
       kmsan_save_stack_with_flags mm/kmsan/kmsan.c:204 [inline]
       kmsan_internal_poison_shadow+0x92/0x150 mm/kmsan/kmsan.c:158
       kmsan_kmalloc+0xa6/0x130 mm/kmsan/kmsan_hooks.c:176
       kmsan_slab_alloc+0xe/0x10 mm/kmsan/kmsan_hooks.c:185
       slab_post_alloc_hook mm/slab.h:446 [inline]
       slab_alloc_node mm/slub.c:2759 [inline]
       __kmalloc_node_track_caller+0xe18/0x1030 mm/slub.c:4383
       __kmalloc_reserve net/core/skbuff.c:137 [inline]
       __alloc_skb+0x309/0xa20 net/core/skbuff.c:205
       alloc_skb include/linux/skbuff.h:998 [inline]
       ipv6_local_error+0x1a7/0x9e0 net/ipv6/datagram.c:334
       __ip6_append_data+0x129f/0x4fd0 net/ipv6/ip6_output.c:1311
       ip6_make_skb+0x6cc/0xcf0 net/ipv6/ip6_output.c:1775
       udpv6_sendmsg+0x3f8e/0x45d0 net/ipv6/udp.c:1384
       inet_sendmsg+0x54a/0x720 net/ipv4/af_inet.c:798
       sock_sendmsg_nosec net/socket.c:621 [inline]
       sock_sendmsg net/socket.c:631 [inline]
       __sys_sendto+0x8c4/0xac0 net/socket.c:1788
       __do_sys_sendto net/socket.c:1800 [inline]
       __se_sys_sendto+0x107/0x130 net/socket.c:1796
       __x64_sys_sendto+0x6e/0x90 net/socket.c:1796
       do_syscall_64+0xbc/0xf0 arch/x86/entry/common.c:291
       entry_SYSCALL_64_after_hwframe+0x63/0xe7
      
      Bytes 4-7 of 28 are uninitialized
      Memory access of size 28 starts at ffff8881937bfce0
      Data copied to user address 0000000020000000
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c809028e
    • Mark Rutland's avatar
      arm64: Don't trap host pointer auth use to EL2 · fea3f83e
      Mark Rutland authored
      [ Backport of upstream commit b3669b1e ]
      
      To allow EL0 (and/or EL1) to use pointer authentication functionality,
      we must ensure that pointer authentication instructions and accesses to
      pointer authentication keys are not trapped to EL2.
      
      This patch ensures that HCR_EL2 is configured appropriately when the
      kernel is booted at EL2. For non-VHE kernels we set HCR_EL2.{API,APK},
      ensuring that EL1 can access keys and permit EL0 use of instructions.
      For VHE kernels host EL0 (TGE && E2H) is unaffected by these settings,
      and it doesn't matter how we configure HCR_EL2.{API,APK}, so we don't
      bother setting them.
      
      This does not enable support for KVM guests, since KVM manages HCR_EL2
      itself when running VMs.
      Reviewed-by: default avatarRichard Henderson <richard.henderson@linaro.org>
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarChristoffer Dall <christoffer.dall@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: kvmarm@lists.cs.columbia.edu
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      [kristina: backport to 4.14.y: adjust context]
      Signed-off-by: default avatarKristina Martsenko <kristina.martsenko@arm.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      fea3f83e
    • Mark Rutland's avatar
      arm64/kvm: consistently handle host HCR_EL2 flags · 4ef8d21b
      Mark Rutland authored
      [ Backport of upstream commit 4eaed6aa ]
      
      In KVM we define the configuration of HCR_EL2 for a VHE HOST in
      HCR_HOST_VHE_FLAGS, but we don't have a similar definition for the
      non-VHE host flags, and open-code HCR_RW. Further, in head.S we
      open-code the flags for VHE and non-VHE configurations.
      
      In future, we're going to want to configure more flags for the host, so
      lets add a HCR_HOST_NVHE_FLAGS defintion, and consistently use both
      HCR_HOST_VHE_FLAGS and HCR_HOST_NVHE_FLAGS in the kvm code and head.S.
      
      We now use mov_q to generate the HCR_EL2 value, as we use when
      configuring other registers in head.S.
      Reviewed-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Reviewed-by: default avatarRichard Henderson <richard.henderson@linaro.org>
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Reviewed-by: default avatarChristoffer Dall <christoffer.dall@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: kvmarm@lists.cs.columbia.edu
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      [kristina: backport to 4.14.y: adjust context]
      Signed-off-by: default avatarKristina Martsenko <kristina.martsenko@arm.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      4ef8d21b
    • Varun Prakash's avatar
      scsi: target: iscsi: cxgbit: fix csk leak · ccc67efc
      Varun Prakash authored
      [ Upstream commit ed076c55 ]
      
      In case of arp failure call cxgbit_put_csk() to free csk.
      Signed-off-by: default avatarVarun Prakash <varun@chelsio.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      ccc67efc
    • Sasha Levin's avatar
      Revert "scsi: target: iscsi: cxgbit: fix csk leak" · 1c62825e
      Sasha Levin authored
      This reverts commit b8315280.
      
      A wrong commit message was used for the stable commit because of a human
      error (and duplicate commit subject lines).
      
      This patch reverts this error, and the following patches add the two
      upstream commits.
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      1c62825e
    • Xunlei Pang's avatar
      sched/fair: Fix bandwidth timer clock drift condition · d93cef31
      Xunlei Pang authored
      commit 512ac999 upstream.
      
      I noticed that cgroup task groups constantly get throttled even
      if they have low CPU usage, this causes some jitters on the response
      time to some of our business containers when enabling CPU quotas.
      
      It's very simple to reproduce:
      
        mkdir /sys/fs/cgroup/cpu/test
        cd /sys/fs/cgroup/cpu/test
        echo 100000 > cpu.cfs_quota_us
        echo $$ > tasks
      
      then repeat:
      
        cat cpu.stat | grep nr_throttled  # nr_throttled will increase steadily
      
      After some analysis, we found that cfs_rq::runtime_remaining will
      be cleared by expire_cfs_rq_runtime() due to two equal but stale
      "cfs_{b|q}->runtime_expires" after period timer is re-armed.
      
      The current condition to judge clock drift in expire_cfs_rq_runtime()
      is wrong, the two runtime_expires are actually the same when clock
      drift happens, so this condtion can never hit. The orginal design was
      correctly done by this commit:
      
        a9cf55b2 ("sched: Expire invalid runtime")
      
      ... but was changed to be the current implementation due to its locking bug.
      
      This patch introduces another way, it adds a new field in both structures
      cfs_rq and cfs_bandwidth to record the expiration update sequence, and
      uses them to figure out if clock drift happens (true if they are equal).
      Signed-off-by: default avatarXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      [alakeshh: backport: Fixed merge conflicts:
       - sched.h: Fix the indentation and order in which the variables are
         declared to match with coding style of the existing code in 4.14
         Struct members of same type were declared in separate lines in
         upstream patch which has been changed back to having multiple
         members of same type in the same line.
         e.g. int a; int b; ->  int a, b; ]
      Signed-off-by: default avatarAlakesh Haloi <alakeshh@amazon.com>
      Reviewed-by: default avatarBen Segall <bsegall@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: <stable@vger.kernel.org> # 4.14.x
      Fixes: 51f2176d ("sched/fair: Fix unlocked reads of some cfs_b->quota/period")
      Link: http://lkml.kernel.org/r/20180620101834.24455-1-xlpang@linux.alibaba.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d93cef31
    • Ben Hutchings's avatar
      media: em28xx: Fix misplaced reset of dev->v4l::field_count · 8e643473
      Ben Hutchings authored
      The backport of commit afeaade9 "media: em28xx: make
      v4l2-compliance happier by starting sequence on zero" added a
      reset on em28xx_v4l2::field_count to em28xx_enable_analog_tuner()
      but it should be done in em28xx_start_analog_streaming().
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8e643473
    • Loic Poulain's avatar
      mmc: sdhci-msm: Disable CDR function on TX · 4abb6960
      Loic Poulain authored
      commit a89e7bcb upstream.
      
      The Clock Data Recovery (CDR) circuit allows to automatically adjust
      the RX sampling-point/phase for high frequency cards (SDR104, HS200...).
      CDR is automatically enabled during DLL configuration.
      However, according to the APQ8016 reference manual, this function
      must be disabled during TX and tuning phase in order to prevent any
      interferences during tuning challenges and unexpected phase alteration
      during TX transfers.
      
      This patch enables/disables CDR according to the current transfer mode.
      
      This fixes sporadic write transfer issues observed with some SDR104 and
      HS200 cards.
      
      Inspired by sdhci-msm downstream patch:
      https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/432516/Reported-by: default avatarLeonid Segal <leonid.s@variscite.com>
      Reported-by: default avatarManabu Igusa <migusa@arrowjapan.com>
      Signed-off-by: default avatarLoic Poulain <loic.poulain@linaro.org>
      Acked-by: default avatarAdrian Hunter <adrian.hunter@intel.com>
      Acked-by: default avatarGeorgi Djakov <georgi.djakov@linaro.org>
      Signed-off-by: default avatarUlf Hansson <ulf.hansson@linaro.org>
      [georgi: backport to v4.14]
      Signed-off-by: default avatarGeorgi Djakov <georgi.djakov@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4abb6960
    • Oliver Hartkopp's avatar
      can: gw: ensure DLC boundaries after CAN frame modification · 39ff087b
      Oliver Hartkopp authored
      commit 0aaa8137 upstream.
      
      Muyu Yu provided a POC where user root with CAP_NET_ADMIN can create a CAN
      frame modification rule that makes the data length code a higher value than
      the available CAN frame data size. In combination with a configured checksum
      calculation where the result is stored relatively to the end of the data
      (e.g. cgw_csum_xor_rel) the tail of the skb (e.g. frag_list pointer in
      skb_shared_info) can be rewritten which finally can cause a system crash.
      
      Michael Kubecek suggested to drop frames that have a DLC exceeding the
      available space after the modification process and provided a patch that can
      handle CAN FD frames too. Within this patch we also limit the length for the
      checksum calculations to the maximum of Classic CAN data length (8).
      
      CAN frames that are dropped by these additional checks are counted with the
      CGW_DELETED counter which indicates misconfigurations in can-gw rules.
      
      This fixes CVE-2019-3701.
      Reported-by: default avatarMuyu Yu <ieatmuttonchuan@gmail.com>
      Reported-by: default avatarMarcus Meissner <meissner@suse.de>
      Suggested-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Tested-by: default avatarMuyu Yu <ieatmuttonchuan@gmail.com>
      Tested-by: default avatarOliver Hartkopp <socketcan@hartkopp.net>
      Signed-off-by: default avatarOliver Hartkopp <socketcan@hartkopp.net>
      Cc: linux-stable <stable@vger.kernel.org> # >= v3.2
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      39ff087b
    • Dmitry Safonov's avatar
      tty: Don't hold ldisc lock in tty_reopen() if ldisc present · cb7f9a46
      Dmitry Safonov authored
      commit d3736d82 upstream.
      
      Try to get reference for ldisc during tty_reopen().
      If ldisc present, we don't need to do tty_ldisc_reinit() and lock the
      write side for line discipline semaphore.
      Effectively, it optimizes fast-path for tty_reopen(), but more
      importantly it won't interrupt ongoing IO on the tty as no ldisc change
      is needed.
      Fixes user-visible issue when tty_reopen() interrupted login process for
      user with a long password, observed and reported by Lukas.
      
      Fixes: c96cf923 ("tty: Don't block on IO when ldisc change is pending")
      Fixes: 83d817f4 ("tty: Hold tty_ldisc_lock() during tty_reopen()")
      Cc: Jiri Slaby <jslaby@suse.com>
      Reported-by: default avatarLukas F. Hartmann <lukas@mntmn.com>
      Tested-by: default avatarLukas F. Hartmann <lukas@mntmn.com>
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: default avatarDmitry Safonov <dima@arista.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cb7f9a46
    • Dmitry Safonov's avatar
      tty: Simplify tty->count math in tty_reopen() · 4086e287
      Dmitry Safonov authored
      commit cf62a1a1 upstream.
      
      As notted by Jiri, tty_ldisc_reinit() shouldn't rely on tty counter.
      Simplify math by increasing the counter after reinit success.
      
      Cc: Jiri Slaby <jslaby@suse.com>
      Link: lkml.kernel.org/r/<20180829022353.23568-2-dima@arista.com>
      Suggested-by: default avatarJiri Slaby <jslaby@suse.com>
      Reviewed-by: default avatarJiri Slaby <jslaby@suse.cz>
      Tested-by: default avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarDmitry Safonov <dima@arista.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4086e287
    • Dmitry Safonov's avatar
      tty: Hold tty_ldisc_lock() during tty_reopen() · 108bf6a2
      Dmitry Safonov authored
      commit 83d817f4 upstream.
      
      tty_ldisc_reinit() doesn't race with neither tty_ldisc_hangup()
      nor set_ldisc() nor tty_ldisc_release() as they use tty lock.
      But it races with anyone who expects line discipline to be the same
      after hoding read semaphore in tty_ldisc_ref().
      
      We've seen the following crash on v4.9.108 stable:
      
      BUG: unable to handle kernel paging request at 0000000000002260
      IP: [..] n_tty_receive_buf_common+0x5f/0x86d
      Workqueue: events_unbound flush_to_ldisc
      Call Trace:
       [..] n_tty_receive_buf2
       [..] tty_ldisc_receive_buf
       [..] flush_to_ldisc
       [..] process_one_work
       [..] worker_thread
       [..] kthread
       [..] ret_from_fork
      
      tty_ldisc_reinit() should be called with ldisc_sem hold for writing,
      which will protect any reader against line discipline changes.
      
      Cc: Jiri Slaby <jslaby@suse.com>
      Cc: stable@vger.kernel.org # b027e229 ("tty: fix data race between tty_init_dev and flush of buf")
      Reviewed-by: default avatarJiri Slaby <jslaby@suse.cz>
      Reported-by: syzbot+3aa9784721dfb90e984d@syzkaller.appspotmail.com
      Tested-by: default avatarMark Rutland <mark.rutland@arm.com>
      Tested-by: default avatarTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: default avatarDmitry Safonov <dima@arista.com>
      Tested-by: default avatarTycho Andersen <tycho@tycho.ws>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      108bf6a2
    • Dmitry Safonov's avatar
      tty/ldsem: Wake up readers after timed out down_write() · fe137002
      Dmitry Safonov authored
      commit 231f8fd0 upstream.
      
      ldsem_down_read() will sleep if there is pending writer in the queue.
      If the writer times out, readers in the queue should be woken up,
      otherwise they may miss a chance to acquire the semaphore until the last
      active reader will do ldsem_up_read().
      
      There was a couple of reports where there was one active reader and
      other readers soft locked up:
        Showing all locks held in the system:
        2 locks held by khungtaskd/17:
         #0:  (rcu_read_lock){......}, at: watchdog+0x124/0x6d1
         #1:  (tasklist_lock){.+.+..}, at: debug_show_all_locks+0x72/0x2d3
        2 locks held by askfirst/123:
         #0:  (&tty->ldisc_sem){.+.+.+}, at: ldsem_down_read+0x46/0x58
         #1:  (&ldata->atomic_read_lock){+.+...}, at: n_tty_read+0x115/0xbe4
      
      Prevent readers wait for active readers to release ldisc semaphore.
      
      Link: lkml.kernel.org/r/20171121132855.ajdv4k6swzhvktl6@wfg-t540p.sh.intel.com
      Link: lkml.kernel.org/r/20180907045041.GF1110@shao2-debian
      Cc: Jiri Slaby <jslaby@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Reported-by: default avatarkernel test robot <rong.a.chen@intel.com>
      Signed-off-by: default avatarDmitry Safonov <dima@arista.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fe137002
  2. 16 Jan, 2019 14 commits
    • Greg Kroah-Hartman's avatar
      Linux 4.14.94 · 8979da25
      Greg Kroah-Hartman authored
      8979da25
    • Christoffer Dall's avatar
      KVM: arm/arm64: Fix VMID alloc race by reverting to lock-less · cb754d67
      Christoffer Dall authored
      commit fb544d1c upstream.
      
      We recently addressed a VMID generation race by introducing a read/write
      lock around accesses and updates to the vmid generation values.
      
      However, kvm_arch_vcpu_ioctl_run() also calls need_new_vmid_gen() but
      does so without taking the read lock.
      
      As far as I can tell, this can lead to the same kind of race:
      
        VM 0, VCPU 0			VM 0, VCPU 1
        ------------			------------
        update_vttbr (vmid 254)
        				update_vttbr (vmid 1) // roll over
      				read_lock(kvm_vmid_lock);
      				force_vm_exit()
        local_irq_disable
        need_new_vmid_gen == false //because vmid gen matches
      
        enter_guest (vmid 254)
        				kvm_arch.vttbr = <PGD>:<VMID 1>
      				read_unlock(kvm_vmid_lock);
      
        				enter_guest (vmid 1)
      
      Which results in running two VCPUs in the same VM with different VMIDs
      and (even worse) other VCPUs from other VMs could now allocate clashing
      VMID 254 from the new generation as long as VCPU 0 is not exiting.
      
      Attempt to solve this by making sure vttbr is updated before another CPU
      can observe the updated VMID generation.
      
      Cc: stable@vger.kernel.org
      Fixes: f0cf47d9 "KVM: arm/arm64: Close VMID generation race"
      Reviewed-by: default avatarJulien Thierry <julien.thierry@arm.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@arm.com>
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cb754d67
    • Vasily Averin's avatar
      sunrpc: use-after-free in svc_process_common() · 65dba325
      Vasily Averin authored
      commit d4b09acf upstream.
      
      if node have NFSv41+ mounts inside several net namespaces
      it can lead to use-after-free in svc_process_common()
      
      svc_process_common()
              /* Setup reply header */
              rqstp->rq_xprt->xpt_ops->xpo_prep_reply_hdr(rqstp); <<< HERE
      
      svc_process_common() can use incorrect rqstp->rq_xprt,
      its caller function bc_svc_process() takes it from serv->sv_bc_xprt.
      The problem is that serv is global structure but sv_bc_xprt
      is assigned per-netnamespace.
      
      According to Trond, the whole "let's set up rqstp->rq_xprt
      for the back channel" is nothing but a giant hack in order
      to work around the fact that svc_process_common() uses it
      to find the xpt_ops, and perform a couple of (meaningless
      for the back channel) tests of xpt_flags.
      
      All we really need in svc_process_common() is to be able to run
      rqstp->rq_xprt->xpt_ops->xpo_prep_reply_hdr()
      
      Bruce J Fields points that this xpo_prep_reply_hdr() call
      is an awfully roundabout way just to do "svc_putnl(resv, 0);"
      in the tcp case.
      
      This patch does not initialiuze rqstp->rq_xprt in bc_svc_process(),
      now it calls svc_process_common() with rqstp->rq_xprt = NULL.
      
      To adjust reply header svc_process_common() just check
      rqstp->rq_prot and calls svc_tcp_prep_reply_hdr() for tcp case.
      
      To handle rqstp->rq_xprt = NULL case in functions called from
      svc_process_common() patch intruduces net namespace pointer
      svc_rqst->rq_bc_net and adjust SVC_NET() definition.
      Some other function was also adopted to properly handle described case.
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Cc: stable@vger.kernel.org
      Fixes: 23c20ecd ("NFS: callback up - users counting cleanup")
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      v2: - added lost extern svc_tcp_prep_reply_hdr()
          - dropped trace_svc_process() changes
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      65dba325
    • Theodore Ts'o's avatar
      ext4: track writeback errors using the generic tracking infrastructure · 5903fc64
      Theodore Ts'o authored
      commit 95cb6713 upstream.
      
      We already using mapping_set_error() in fs/ext4/page_io.c, so all we
      need to do is to use file_check_and_advance_wb_err() when handling
      fsync() requests in ext4_sync_file().
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5903fc64
    • Theodore Ts'o's avatar
      ext4: use ext4_write_inode() when fsyncing w/o a journal · 82f71b8b
      Theodore Ts'o authored
      commit ad211f3e upstream.
      
      In no-journal mode, we previously used __generic_file_fsync() in
      no-journal mode.  This triggers a lockdep warning, and in addition,
      it's not safe to depend on the inode writeback mechanism in the case
      ext4.  We can solve both problems by calling ext4_write_inode()
      directly.
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      82f71b8b
    • Theodore Ts'o's avatar
      ext4: avoid kernel warning when writing the superblock to a dead device · b17971ae
      Theodore Ts'o authored
      commit e8680786 upstream.
      
      The xfstests generic/475 test switches the underlying device with
      dm-error while running a stress test.  This results in a large number
      of file system errors, and since we can't lock the buffer head when
      marking the superblock dirty in the ext4_grp_locked_error() case, it's
      possible the superblock to be !buffer_uptodate() without
      buffer_write_io_error() being true.
      
      We need to set buffer_uptodate() before we call mark_buffer_dirty() or
      this will trigger a WARN_ON.  It's safe to do this since the
      superblock must have been properly read into memory or the mount would
      have been successful.  So if buffer_uptodate() is not set, we can
      safely assume that this happened due to a failed attempt to write the
      superblock.
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b17971ae
    • Theodore Ts'o's avatar
      ext4: fix a potential fiemap/page fault deadlock w/ inline_data · b4727299
      Theodore Ts'o authored
      commit 2b08b1f1 upstream.
      
      The ext4_inline_data_fiemap() function calls fiemap_fill_next_extent()
      while still holding the xattr semaphore.  This is not necessary and it
      triggers a circular lockdep warning.  This is because
      fiemap_fill_next_extent() could trigger a page fault when it writes
      into page which triggers a page fault.  If that page is mmaped from
      the inline file in question, this could very well result in a
      deadlock.
      
      This problem can be reproduced using generic/519 with a file system
      configuration which has the inline_data feature enabled.
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b4727299
    • Theodore Ts'o's avatar
      ext4: make sure enough credits are reserved for dioread_nolock writes · eb13c60d
      Theodore Ts'o authored
      commit 812c0cab upstream.
      
      There are enough credits reserved for most dioread_nolock writes;
      however, if the extent tree is sufficiently deep, and/or quota is
      enabled, the code was not allowing for all eventualities when
      reserving journal credits for the unwritten extent conversion.
      
      This problem can be seen using xfstests ext4/034:
      
         WARNING: CPU: 1 PID: 257 at fs/ext4/ext4_jbd2.c:271 __ext4_handle_dirty_metadata+0x10c/0x180
         Workqueue: ext4-rsv-conversion ext4_end_io_rsv_work
         RIP: 0010:__ext4_handle_dirty_metadata+0x10c/0x180
         	...
         EXT4-fs: ext4_free_blocks:4938: aborting transaction: error 28 in __ext4_handle_dirty_metadata
         EXT4: jbd2_journal_dirty_metadata failed: handle type 11 started at line 4921, credits 4/0, errcode -28
         EXT4-fs error (device dm-1) in ext4_free_blocks:4950: error 28
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      eb13c60d
    • Ilya Dryomov's avatar
      rbd: don't return 0 on unmap if RBD_DEV_FLAG_REMOVING is set · 022ce60c
      Ilya Dryomov authored
      commit 85f5a4d6 upstream.
      
      There is a window between when RBD_DEV_FLAG_REMOVING is set and when
      the device is removed from rbd_dev_list.  During this window, we set
      "already" and return 0.
      
      Returning 0 from write(2) can confuse userspace tools because
      0 indicates that nothing was written.  In particular, "rbd unmap"
      will retry the write multiple times a second:
      
        10:28:05.463299 write(4, "0", 1)        = 0
        10:28:05.463509 write(4, "0", 1)        = 0
        10:28:05.463720 write(4, "0", 1)        = 0
        10:28:05.463942 write(4, "0", 1)        = 0
        10:28:05.464155 write(4, "0", 1)        = 0
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Tested-by: default avatarDongsheng Yang <dongsheng.yang@easystack.cn>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      022ce60c
    • Ivan Mironov's avatar
      drm/fb-helper: Partially bring back workaround for bugs of SDL 1.2 · 8b99f170
      Ivan Mironov authored
      commit 62d85b3b upstream.
      
      SDL 1.2 sets all fields related to the pixel format to zero in some
      cases[1]. Prior to commit db05c481 ("drm: fb-helper: Reject all
      pixel format changing requests"), there was an unintentional workaround
      for this that existed for more than a decade. First in device-specific DRM
      drivers, then here in drm_fb_helper.c.
      
      Previous code containing this workaround just ignores pixel format fields
      from userspace code. Not a good thing either, as this way, driver may
      silently use pixel format different from what client actually requested,
      and this in turn will lead to displaying garbage on the screen. I think
      that returning EINVAL to userspace in this particular case is the right
      option, so I decided to left code from problematic commit untouched
      instead of just reverting it entirely.
      
      Here is the steps required to reproduce this problem exactly:
      	1) Compile fceux[2] with SDL 1.2.15 and without GTK or OpenGL
      	   support. SDL should be compiled with fbdev support (which is
      	   on by default).
      	2) Create /etc/fb.modes with following contents (values seems
      	   not used, and just required to trigger problematic code in
      	   SDL):
      
      		mode "test"
      		    geometry 1 1 1 1 1
      		    timings 1 1 1 1 1 1 1
      		endmode
      
      	3) Create ~/.fceux/fceux.cfg with following contents:
      
      		SDL.Hotkeys.Quit = 27
      		SDL.DoubleBuffering = 1
      
      	4) Ensure that screen resolution is at least 1280x960 (e.g.
      	   append "video=Virtual-1:1280x960-32" to the kernel cmdline
      	   for qemu/QXL).
      
      	5) Try to run fceux on VT with some ROM file[3]:
      
      		# ./fceux color_test.nes
      
      [1] SDL 1.2.15 source code, src/video/fbcon/SDL_fbvideo.c,
          FB_SetVideoMode()
      [2] http://www.fceux.com
      [3] Example ROM: https://github.com/bokuweb/rustynes/blob/master/roms/color_test.nesReported-by: default avatarsaahriktu <mail@saahriktu.org>
      Suggested-by: default avatarsaahriktu <mail@saahriktu.org>
      Cc: stable@vger.kernel.org
      Fixes: db05c481 ("drm: fb-helper: Reject all pixel format changing requests")
      Signed-off-by: default avatarIvan Mironov <mironov.ivan@gmail.com>
      [danvet: Delete misleading comment.]
      Signed-off-by: default avatarDaniel Vetter <daniel.vetter@ffwll.ch>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190108072353.28078-2-mironov.ivan@gmail.com
      Link: https://patchwork.freedesktop.org/patch/msgid/20190108072353.28078-2-mironov.ivan@gmail.comSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8b99f170
    • Yi Zeng's avatar
      i2c: dev: prevent adapter retries and timeout being set as minus value · 78e5ef1a
      Yi Zeng authored
      commit 6ebec961 upstream.
      
      If adapter->retries is set to a minus value from user space via ioctl,
      it will make __i2c_transfer and __i2c_smbus_xfer skip the calling to
      adapter->algo->master_xfer and adapter->algo->smbus_xfer that is
      registered by the underlying bus drivers, and return value 0 to all the
      callers. The bus driver will never be accessed anymore by all users,
      besides, the users may still get successful return value without any
      error or information log print out.
      
      If adapter->timeout is set to minus value from user space via ioctl,
      it will make the retrying loop in __i2c_transfer and __i2c_smbus_xfer
      always break after the the first try, due to the time_after always
      returns true.
      Signed-off-by: default avatarYi Zeng <yizeng@asrmicro.com>
      [wsa: minor grammar updates to commit message]
      Signed-off-by: default avatarWolfram Sang <wsa@the-dreams.de>
      Cc: stable@kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      78e5ef1a
    • Hans de Goede's avatar
      ACPI / PMIC: xpower: Fix TS-pin current-source handling · d3db93d2
      Hans de Goede authored
      commit 2b531d71 upstream.
      
      The current-source used for the battery temp-sensor (TS) is shared with the
      GPADC. For proper fuel-gauge and charger operation the TS current-source
      needs to be permanently on. But to read the GPADC we need to temporary
      switch the TS current-source to ondemand, so that the GPADC can use it,
      otherwise we will always read an all 0 value.
      
      The switching from on to on-ondemand is not necessary when the TS
      current-source is off (this happens on devices which do not have a TS).
      
      Prior to this commit there were 2 issues with our handling of the TS
      current-source switching:
      
       1) We were writing hardcoded values to the ADC TS pin-ctrl register,
       overwriting various other unrelated bits. Specifically we were overwriting
       the current-source setting for the TS and GPIO0 pins, forcing it to 80ųA
       independent of its original setting. On a Chuwi Vi10 tablet this was
       causing us to get a too high adc value (due to a too high current-source)
       resulting in acpi_lpat_raw_to_temp() returning -ENOENT, resulting in:
      
      ACPI Error: AE_ERROR, Returned by Handler for [UserDefinedRegion]
      ACPI Error: Method parse/execution failed \_SB.SXP1._TMP, AE_ERROR
      
      This commit fixes this by using regmap_update_bits to change only the
      relevant bits.
      
       2) At the end of intel_xpower_pmic_get_raw_temp() we were unconditionally
       enabling the TS current-source even on devices where the TS-pin is not used
       and the current-source thus was off on entry of the function.
      
      This commit fixes this by checking if the TS current-source is off when
      entering intel_xpower_pmic_get_raw_temp() and if so it is left as is.
      
      Fixes: 58eefe2f (ACPI / PMIC: xpower: Do pinswitch ... reading GPADC)
      Signed-off-by: default avatarHans de Goede <hdegoede@redhat.com>
      Acked-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: 4.14+ <stable@vger.kernel.org> # 4.14+
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d3db93d2
    • Hans de Goede's avatar
      ACPI: power: Skip duplicate power resource references in _PRx · ec697eb3
      Hans de Goede authored
      commit 7d7b467c upstream.
      
      Some ACPI tables contain duplicate power resource references like this:
      
              Name (_PR0, Package (0x04)  // _PR0: Power Resources for D0
              {
                  P28P,
                  P18P,
                  P18P,
                  CLK4
              })
      
      This causes a WARN_ON in sysfs_add_link_to_group() because we end up
      adding a link to the same acpi_device twice:
      
      sysfs: cannot create duplicate filename '/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/808622C1:00/OVTI2680:00/power_resources_D0/LNXPOWER:0a'
      CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.19.12-301.fc29.x86_64 #1
      Hardware name: Insyde CherryTrail/Type2 - Board Product Name, BIOS jumperx.T87.KFBNEEA02 04/13/2016
      Call Trace:
       dump_stack+0x5c/0x80
       sysfs_warn_dup.cold.3+0x17/0x2a
       sysfs_do_create_link_sd.isra.2+0xa9/0xb0
       sysfs_add_link_to_group+0x30/0x50
       acpi_power_expose_list+0x74/0xa0
       acpi_power_add_remove_device+0x50/0xa0
       acpi_add_single_object+0x26b/0x5f0
       acpi_bus_check_add+0xc4/0x250
       ...
      
      To address this issue, make acpi_extract_power_resources() check for
      duplicates and simply skip them when found.
      
      Cc: All applicable <stable@vger.kernel.org>
      Signed-off-by: default avatarHans de Goede <hdegoede@redhat.com>
      [ rjw: Subject & changelog, comments ]
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ec697eb3
    • Michal Hocko's avatar
      mm, memcg: fix reclaim deadlock with writeback · 8c4da113
      Michal Hocko authored
      commit 63f3655f upstream.
      
      Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
      ext4 writeback
      
        task1:
          wait_on_page_bit+0x82/0xa0
          shrink_page_list+0x907/0x960
          shrink_inactive_list+0x2c7/0x680
          shrink_node_memcg+0x404/0x830
          shrink_node+0xd8/0x300
          do_try_to_free_pages+0x10d/0x330
          try_to_free_mem_cgroup_pages+0xd5/0x1b0
          try_charge+0x14d/0x720
          memcg_kmem_charge_memcg+0x3c/0xa0
          memcg_kmem_charge+0x7e/0xd0
          __alloc_pages_nodemask+0x178/0x260
          alloc_pages_current+0x95/0x140
          pte_alloc_one+0x17/0x40
          __pte_alloc+0x1e/0x110
          alloc_set_pte+0x5fe/0xc20
          do_fault+0x103/0x970
          handle_mm_fault+0x61e/0xd10
          __do_page_fault+0x252/0x4d0
          do_page_fault+0x30/0x80
          page_fault+0x28/0x30
      
        task2:
          __lock_page+0x86/0xa0
          mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
          ext4_writepages+0x479/0xd60
          do_writepages+0x1e/0x30
          __writeback_single_inode+0x45/0x320
          writeback_sb_inodes+0x272/0x600
          __writeback_inodes_wb+0x92/0xc0
          wb_writeback+0x268/0x300
          wb_workfn+0xb4/0x390
          process_one_work+0x189/0x420
          worker_thread+0x4e/0x4b0
          kthread+0xe6/0x100
          ret_from_fork+0x41/0x50
      
      He adds
       "task1 is waiting for the PageWriteback bit of the page that task2 has
        collected in mpd->io_submit->io_bio, and tasks2 is waiting for the
        LOCKED bit the page which tasks1 has locked"
      
      More precisely task1 is handling a page fault and it has a page locked
      while it charges a new page table to a memcg.  That in turn hits a
      memory limit reclaim and the memcg reclaim for legacy controller is
      waiting on the writeback but that is never going to finish because the
      writeback itself is waiting for the page locked in the #PF path.  So
      this is essentially ABBA deadlock:
      
                                              lock_page(A)
                                              SetPageWriteback(A)
                                              unlock_page(A)
        lock_page(B)
                                              lock_page(B)
        pte_alloc_pne
          shrink_page_list
            wait_on_page_writeback(A)
                                              SetPageWriteback(B)
                                              unlock_page(B)
      
                                              # flush A, B to clear the writeback
      
      This accumulating of more pages to flush is used by several filesystems
      to generate a more optimal IO patterns.
      
      Waiting for the writeback in legacy memcg controller is a workaround for
      pre-mature OOM killer invocations because there is no dirty IO
      throttling available for the controller.  There is no easy way around
      that unfortunately.  Therefore fix this specific issue by pre-allocating
      the page table outside of the page lock.  We have that handy
      infrastructure for that already so simply reuse the fault-around pattern
      which already does this.
      
      There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
      from under a fs page locked but they should be really rare.  I am not
      aware of a better solution unfortunately.
      
      [akpm@linux-foundation.org: fix mm/memory.c:__do_fault()]
      [akpm@linux-foundation.org: coding-style fixes]
      [mhocko@kernel.org: enhance comment, per Johannes]
        Link: http://lkml.kernel.org/r/20181214084948.GA5624@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20181213092221.27270-1-mhocko@kernel.org
      Fixes: c3b94f44 ("memcg: further prevent OOM with too many dirty pages")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Debugged-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8c4da113