1. 19 Nov, 2013 14 commits
    • majianpeng's avatar
      md/raid5: Use conf->device_lock protect changing of multi-thread resources. · 60aaf933
      majianpeng authored
      When we change group_thread_cnt from sysfs entry, it can OOPS.
      
      The kernel messages are:
      [  135.299021] BUG: unable to handle kernel NULL pointer dereference at           (null)
      [  135.299073] IP: [<ffffffff815188ab>] handle_active_stripes+0x32b/0x440
      [  135.299107] PGD 0
      [  135.299122] Oops: 0000 [#1] SMP
      [  135.299144] Modules linked in: netconsole e1000e ptp pps_core
      [  135.299188] CPU: 3 PID: 2225 Comm: md0_raid5 Not tainted 3.12.0+ #24
      [  135.299214] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS 080015  11/09/2011
      [  135.299255] task: ffff8800b9638f80 ti: ffff8800b77a4000 task.ti: ffff8800b77a4000
      [  135.299283] RIP: 0010:[<ffffffff815188ab>]  [<ffffffff815188ab>] handle_active_stripes+0x32b/0x440
      [  135.299323] RSP: 0018:ffff8800b77a5c48  EFLAGS: 00010002
      [  135.299344] RAX: ffff880037bb5c70 RBX: 0000000000000000 RCX: 0000000000000008
      [  135.299371] RDX: ffff880037bb5cb8 RSI: 0000000000000001 RDI: ffff880037bb5c00
      [  135.299398] RBP: ffff8800b77a5d08 R08: 0000000000000001 R09: 0000000000000000
      [  135.299425] R10: ffff8800b77a5c98 R11: 00000000ffffffff R12: ffff880037bb5c00
      [  135.299452] R13: 0000000000000000 R14: 0000000000000000 R15: ffff880037bb5c70
      [  135.299479] FS:  0000000000000000(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000
      [  135.299510] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [  135.299532] CR2: 0000000000000000 CR3: 0000000001c0b000 CR4: 00000000000407e0
      [  135.299559] Stack:
      [  135.299570]  ffff8800b77a5c88 ffffffff8107383e ffff8800b77a5c88 ffff880037a64300
      [  135.299611]  000000000000ec08 ffff880037bb5cb8 ffff8800b77a5c98 ffffffffffffffd8
      [  135.299654]  000000000000ec08 ffff880037bb5c60 ffff8800b77a5c98 ffff8800b77a5c98
      [  135.299696] Call Trace:
      [  135.299711]  [<ffffffff8107383e>] ? __wake_up+0x4e/0x70
      [  135.299733]  [<ffffffff81518f88>] raid5d+0x4c8/0x680
      [  135.299756]  [<ffffffff817174ed>] ? schedule_timeout+0x15d/0x1f0
      [  135.299781]  [<ffffffff81524c9f>] md_thread+0x11f/0x170
      [  135.299804]  [<ffffffff81069cd0>] ? wake_up_bit+0x40/0x40
      [  135.299826]  [<ffffffff81524b80>] ? md_rdev_init+0x110/0x110
      [  135.299850]  [<ffffffff81069656>] kthread+0xc6/0xd0
      [  135.299871]  [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70
      [  135.299899]  [<ffffffff81722ffc>] ret_from_fork+0x7c/0xb0
      [  135.299923]  [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70
      [  135.299951] Code: ff ff ff 0f 84 d7 fe ff ff e9 5c fe ff ff 66 90 41 8b b4 24 d8 01 00 00 45 31 ed 85 f6 0f 8e 7b fd ff ff 49 8b 9c 24 d0 01 00 00 <48> 3b 1b 49 89 dd 0f 85 67 fd ff ff 48 8d 43 28 31 d2 eb 17 90
      [  135.300005] RIP  [<ffffffff815188ab>] handle_active_stripes+0x32b/0x440
      [  135.300005]  RSP <ffff8800b77a5c48>
      [  135.300005] CR2: 0000000000000000
      [  135.300005] ---[ end trace 504854e5bb7562ed ]---
      [  135.300005] Kernel panic - not syncing: Fatal exception
      
      This is because raid5d() can be running when the multi-thread
      resources are changed via system. We see need to provide locking.
      
      mddev->device_lock is suitable, but we cannot simple call
      alloc_thread_groups under this lock as we cannot allocate memory
      while holding a spinlock.
      So change alloc_thread_groups() to allocate and return the data
      structures, then raid5_store_group_thread_cnt() can take the lock
      while updating the pointers to the data structures.
      
      This fixes a bug introduced in 3.12 and so is suitable for the 3.12.x
      stable series.
      
      Fixes: b721420e
      Cc: stable@vger.kernel.org (3.12)
      Signed-off-by: default avatarJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Reviewed-by: default avatarShaohua Li <shli@kernel.org>
      60aaf933
    • majianpeng's avatar
      md/raid5: Before freeing old multi-thread worker, it should flush them. · d206dcfa
      majianpeng authored
      When changing group_thread_cnt from sysfs entry, the kernel can oops.
      
      The kernel messages are:
      [  740.961389] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      [  740.961444] IP: [<ffffffff81062570>] process_one_work+0x30/0x500
      [  740.961476] PGD b9013067 PUD b651e067 PMD 0
      [  740.961503] Oops: 0000 [#1] SMP
      [  740.961525] Modules linked in: netconsole e1000e ptp pps_core
      [  740.961577] CPU: 0 PID: 3683 Comm: kworker/u8:5 Not tainted 3.12.0+ #23
      [  740.961602] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS 080015  11/09/2011
      [  740.961646] task: ffff88013abe0000 ti: ffff88013a246000 task.ti: ffff88013a246000
      [  740.961673] RIP: 0010:[<ffffffff81062570>]  [<ffffffff81062570>] process_one_work+0x30/0x500
      [  740.961708] RSP: 0018:ffff88013a247e08  EFLAGS: 00010086
      [  740.961730] RAX: ffff8800b912b400 RBX: ffff88013a61e680 RCX: ffff8800b912b400
      [  740.961757] RDX: ffff8800b912b600 RSI: ffff8800b912b600 RDI: ffff88013a61e680
      [  740.961782] RBP: ffff88013a247e48 R08: ffff88013a246000 R09: 000000000002c09d
      [  740.961808] R10: 000000000000010f R11: 0000000000000000 R12: ffff88013b00cc00
      [  740.961833] R13: 0000000000000000 R14: ffff88013b00cf80 R15: ffff88013a61e6b0
      [  740.961861] FS:  0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
      [  740.961893] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [  740.962001] CR2: 00000000000000b8 CR3: 00000000b24fe000 CR4: 00000000000407f0
      [  740.962001] Stack:
      [  740.962001]  0000000000000008 ffff8800b912b600 ffff88013b00cc00 ffff88013a61e680
      [  740.962001]  ffff88013b00cc00 ffff88013b00cc18 ffff88013b00cf80 ffff88013a61e6b0
      [  740.962001]  ffff88013a247eb8 ffffffff810639c6 0000000000012a80 ffff88013a247fd8
      [  740.962001] Call Trace:
      [  740.962001]  [<ffffffff810639c6>] worker_thread+0x206/0x3f0
      [  740.962001]  [<ffffffff810637c0>] ? manage_workers+0x2c0/0x2c0
      [  740.962001]  [<ffffffff81069656>] kthread+0xc6/0xd0
      [  740.962001]  [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70
      [  740.962001]  [<ffffffff81722ffc>] ret_from_fork+0x7c/0xb0
      [  740.962001]  [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70
      [  740.962001] Code: 89 e5 41 57 41 56 41 55 45 31 ed 41 54 53 48 89 fb 48 83 ec 18 48 8b 06 4c 8b 67 48 48 89 c1 30 c9 a8 04 4c 0f 45 e9 80 7f 58 00 <49> 8b 45 08 44 8b b0 00 01 00 00 78 0c 41 f6 44 24 10 04 0f 84
      [  740.962001] RIP  [<ffffffff81062570>] process_one_work+0x30/0x500
      [  740.962001]  RSP <ffff88013a247e08>
      [  740.962001] CR2: 0000000000000008
      [  740.962001] ---[ end trace 39181460000748de ]---
      [  740.962001] Kernel panic - not syncing: Fatal exception
      
      This can happen if there are some stripes left, fewer than MAX_STRIPE_BATCH.
      A worker is queued to handle them.
      But before calling raid5_do_work, raid5d handles those
      stripes making conf->active_stripe = 0.
      So mddev_suspend() can return.
      We might then free old worker resources before the queued
      raid5_do_work() handled them.  When it runs, it crashes.
      
      	raid5d()		raid5_store_group_thread_cnt()
      	queue_work		mddev_suspend()
      				handle_strips
      				active_stripe=0
      				free(old worker resources)
      	process_one_work
      	raid5_do_work
      
      To avoid this, we should only flush the worker resources before freeing them.
      
      This fixes a bug introduced in 3.12 so is suitable for the 3.12.x
      stable series.
      
      Cc: stable@vger.kernel.org (3.12)
      Fixes: b721420eSigned-off-by: default avatarJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Reviewed-by: default avatarShaohua Li <shli@kernel.org>
      d206dcfa
    • majianpeng's avatar
      md/raid5: For stripe with R5_ReadNoMerge, we replace REQ_FLUSH with REQ_NOMERGE. · e59aa23f
      majianpeng authored
      For R5_ReadNoMerge,it mean this bio can't merge with other bios or
      request.It used REQ_FLUSH to achieve this. But REQ_NOMERGE can do the
      same work.
      Signed-off-by: default avatarJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      e59aa23f
    • Aurelien Jarno's avatar
      UAPI: include <asm/byteorder.h> in linux/raid/md_p.h · c0f8bd14
      Aurelien Jarno authored
      linux/raid/md_p.h is using conditionals depending on endianess and fails
      with an error if neither of __BIG_ENDIAN, __LITTLE_ENDIAN or
      __BYTE_ORDER are defined, but it doesn't include any header which can
      define these constants. This make this header unusable alone.
      
      This patch adds a #include <asm/byteorder.h> at the beginning of this
      header to make it usable alone. This is needed to compile klibc on MIPS.
      Signed-off-by: default avatarAurelien Jarno <aurelien@aurel32.net>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      c0f8bd14
    • majianpeng's avatar
      raid1: Rewrite the implementation of iobarrier. · 79ef3a8a
      majianpeng authored
      There is an iobarrier in raid1 because of contention between normal IO and
      resync IO.  It suspends all normal IO when resync/recovery happens.
      
      However if normal IO is out side the resync window, there is no contention.
      So this patch changes the barrier mechanism to only block IO that
      could contend with the resync that is currently happening.
      
      We partition the whole space into five parts.
      |---------|-----------|------------|----------------|-------|
              start   next_resync   start_next_window    end_window
      
      start + RESYNC_WINDOW = next_resync
      next_resync + NEXT_NORMALIO_DISTANCE = start_next_window
      start_next_window + NEXT_NORMALIO_DISTANCE = end_window
      
      Firstly we introduce some concepts:
      
      1 - RESYNC_WINDOW: For resync, there are 32 resync requests at most at the
            same time. A sync request is RESYNC_BLOCK_SIZE(64*1024).
            So the RESYNC_WINDOW is 32 * RESYNC_BLOCK_SIZE, that is 2MB.
      2 - NEXT_NORMALIO_DISTANCE: the distance between next_resync
            and start_next_window.  It also indicates the distance between
            start_next_window and end_window.
            It is currently 3 * RESYNC_WINDOW_SIZE but could be tuned if
            this turned out not to be optimal.
      3 - next_resync: the next sector at which we will do sync IO.
      4 - start: a position which is at most RESYNC_WINDOW before
            next_resync.
      5 - start_next_window:  a position which is NEXT_NORMALIO_DISTANCE
            beyond next_resync.  Normal-io after this position doesn't need to
            wait for resync-io to complete.
      6 - end_window:  a position which is 2 * NEXT_NORMALIO_DISTANCE beyond
            next_resync.  This also doesn't need to wait, but is counted
            differently.
      7 - current_window_requests:  the count of normalIO between
            start_next_window and end_window.
      8 - next_window_requests: the count of normalIO after end_window.
      
      NormalIO will be partitioned into four types:
      
      NormIO1:  the end sector of bio is smaller or equal the start
      NormIO2:  the start sector of bio larger or equal to end_window
      NormIO3:  the start sector of bio larger or equal to
                start_next_window.
      NormIO4:  the location between start_next_window and end_window
      
      |--------|-----------|--------------------|----------------|-------------|
          | start   |   next_resync   |  start_next_window   |  end_window |
       NormIO1   NormIO4            NormIO4                NormIO3      NormIO2
      
      For NormIO1, we don't need any io barrier.
      For NormIO4, we used a similar approach to the original iobarrier
          mechanism.  The normalIO and resyncIO must be kept separate.
      For NormIO2/3, we add two fields to struct r1conf: "current_window_requests"
          and "next_window_requests". They indicate the count of active
          requests in the two window.
          For these, we don't wait for resync io to complete.
      
      For resync action, if there are NormIO4s, we must wait for it.
      If not, we can proceed.
      But if resync action reaches start_next_window and
      current_window_requests > 0 (that is there are NormIO3s), we must
      wait until the current_window_requests becomes zero.
      When current_window_requests becomes zero,  start_next_window also
      moves forward. Then current_window_requests will replaced by
      next_window_requests.
      
      There is a problem which when and how to change from NormIO2 to
      NormIO3.  Only then can sync action progress.
      
      We add a field in struct r1conf "start_next_window".
      
      A: if start_next_window == MaxSector, it means there are no NormIO2/3.
         So start_next_window = next_resync + NEXT_NORMALIO_DISTANCE
      B: if current_window_requests == 0 && next_window_requests != 0, it
         means start_next_window move to end_window
      
      There is another problem which how to differentiate between
      old NormIO2(now it is NormIO3) and NormIO2.
      For example, there are many bios which are NormIO2 and a bio which is
      NormIO3. NormIO3 firstly completed, so the bios of NormIO2 became NormIO3.
      
      We add a field in struct r1bio "start_next_window".
      This is used to record the position conf->start_next_window when the call
      to wait_barrier() is made in make_request().
      
      In allow_barrier(), we check the conf->start_next_window.
      If r1bio->stat_next_window == conf->start_next_window, it means
      there is no transition between NormIO2 and NormIO3.
      If r1bio->start_next_window != conf->start_next_window, it mean
      there was a transition between NormIO2 and NormIO3.  There can only
      have been one transition.  So it only means the bio is old NormIO2.
      
      For one bio, there may be many r1bio's. So we make sure
      all the r1bio->start_next_window are the same value.
      If we met blocked_dev in make_request(), it must call allow_barrier
      and wait_barrier. So the former and the later value of
      conf->start_next_window will be change.
      If there are many r1bio's with differnet start_next_window,
      for the relevant bio, it depend on the last value of r1bio.
      It will cause error. To avoid this, we must wait for previous r1bios
      to complete.
      Signed-off-by: default avatarJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      79ef3a8a
    • majianpeng's avatar
      raid1: Add some macros to make code clearly. · 8e005f7c
      majianpeng authored
      In a subsequent patch, we'll use some const parameters.
      Using macros will make the code clearly.
      Signed-off-by: default avatarJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      8e005f7c
    • majianpeng's avatar
      raid1: Replace raise_barrier/lower_barrier with freeze_array/unfreeze_array... · 07169fd4
      majianpeng authored
      raid1: Replace raise_barrier/lower_barrier with freeze_array/unfreeze_array when reconfiguring the array.
      
      We used to use raise_barrier to suspend normal IO while we reconfigure
      the array.  However raise_barrier will soon only suspend some normal
      IO, not all.  So we need something else.
      Change it to use freeze_array.
      But freeze_array not only suspends normal io, it also suspends
      resync io.
      For the place where call raise_barrier for reconfigure, it isn't a
      problem.
      Signed-off-by: default avatarJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      07169fd4
    • majianpeng's avatar
      raid1: Add a field array_frozen to indicate whether raid in freeze state. · b364e3d0
      majianpeng authored
      Because the following patch will rewrite the content between normal IO
      and resync IO. So we used a parameter to indicate whether raid is in freeze
      array.
      Signed-off-by: default avatarJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      b364e3d0
    • Joe Perches's avatar
      md: Convert use of typedef ctl_table to struct ctl_table · 82592c38
      Joe Perches authored
      This typedef is unnecessary and should just be removed.
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      82592c38
    • NeilBrown's avatar
      md/raid5: avoid deadlock when raid5 array has unack badblocks during md_stop_writes. · 30b8feb7
      NeilBrown authored
      When raid5 recovery hits a fresh badblock, this badblock will flagged as unack
      badblock until md_update_sb() is called.
      But md_stop will take reconfig lock which means raid5d can't call
      md_update_sb() in md_check_recovery(), the badblock will always
      be unack, so raid5d thread enters an infinite loop and md_stop_write()
      can never stop sync_thread. This causes deadlock.
      
      To solve this, when STOP_ARRAY ioctl is issued and sync_thread is
      running, we need set md->recovery FROZEN and INTR flags and wait for
      sync_thread to stop before we (re)take reconfig lock.
      
      This requires that raid5 reshape_request notices MD_RECOVERY_INTR
      (which it probably should have noticed anyway) and stops waiting for a
      metadata update in that case.
      Reported-by: default avatarJianpeng Ma <majianpeng@gmail.com>
      Reported-by: default avatarBian Yu <bianyu@kedacom.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      30b8feb7
    • NeilBrown's avatar
      md: use MD_RECOVERY_INTR instead of kthread_should_stop in resync thread. · c91abf5a
      NeilBrown authored
      We currently use kthread_should_stop() in various places in the
      sync/reshape code to abort early.
      However some places set MD_RECOVERY_INTR but don't immediately call
      md_reap_sync_thread() (and we will shortly get another one).
      When this happens we are relying on md_check_recovery() to reap the
      thread and that only happen when it finishes normally.
      So MD_RECOVERY_INTR must lead to a normal finish without the
      kthread_should_stop() test.
      
      So replace all relevant tests, and be more careful when the thread is
      interrupted not to acknowledge that latest step in a reshape as it may
      not be fully committed yet.
      
      Also add a test on MD_RECOVERY_INTR in the 'is_mddev_idle' loop
      so we don't wait have to wait for the speed to drop before we can abort.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      c91abf5a
    • NeilBrown's avatar
      md: fix some places where mddev_lock return value is not checked. · 29f097c4
      NeilBrown authored
      Sometimes we need to lock and mddev and cannot cope with
      failure due to interrupt.
      In these cases we should use mutex_lock, not mutex_lock_interruptible.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      29f097c4
    • Bian Yu's avatar
      raid5: Retry R5_ReadNoMerge flag when hit a read error. · edfa1f65
      Bian Yu authored
      Because of block layer merge, one bio fails will cause other bios
      which belongs to the same request fails, so raid5_end_read_request
      will record all these bios as badblocks.
      If retry request with R5_ReadNoMerge flag to avoid bios merge,
      badblocks can only record sector which is bad exactly.
      
      test:
      hdparm --yes-i-know-what-i-am-doing --make-bad-sector 300000 /dev/sdb
      mdadm -C /dev/md0 -l5 -n3 /dev/sd[bcd] --assume-clean
      mdadm /dev/md0 -f /dev/sdd
      mdadm /dev/md0 -r /dev/sdd
      mdadm --zero-superblock /dev/sdd
      mdadm /dev/md0 -a /dev/sdd
      
      1. Without this patch:
      cat /sys/block/md0/md/rd*/bad_blocks
      299776 256
      299776 256
      
      2. With this patch:
      cat /sys/block/md0/md/rd*/bad_blocks
      300000 8
      300000 8
      Signed-off-by: default avatarBian Yu <bianyu@kedacom.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      edfa1f65
    • Shaohua Li's avatar
      raid5: relieve lock contention in get_active_stripe() · 4bda556a
      Shaohua Li authored
      track empty inactive list count, so md_raid5_congested() can use it to make
      decision.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      4bda556a
  2. 14 Nov, 2013 5 commits
    • Shaohua Li's avatar
      raid5: relieve lock contention in get_active_stripe() · 566c09c5
      Shaohua Li authored
      get_active_stripe() is the last place we have lock contention. It has two
      paths. One is stripe isn't found and new stripe is allocated, the other is
      stripe is found.
      
      The first path basically calls __find_stripe and init_stripe. It accesses
      conf->generation, conf->previous_raid_disks, conf->raid_disks,
      conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
      conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
      stripe_hashtbl and inactive_list, other fields are changed very rarely.
      
      With this patch, we split inactive_list and add new hash locks. Each free
      stripe belongs to a specific inactive list. Which inactive list is determined
      by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
      lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
      is determined by it's lock_hash too. The lock_hash is derivied from current
      stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
      to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
      list too. The goal of the new hash locks introduced is we can only use the new
      locks in the first path of get_active_stripe(). Since we have several hash
      locks, lock contention is relieved significantly.
      
      The first path of get_active_stripe() accesses other fields, since they are
      changed rarely, changing them now need take conf->device_lock and all hash
      locks. For a slow path, this isn't a problem.
      
      If we need lock device_lock and hash lock, we always lock hash lock first. The
      tricky part is release_stripe and friends. We need take device_lock first.
      Neil's suggestion is we put inactive stripes to a temporary list and readd it
      to inactive_list after device_lock is released. In this way, we add stripes to
      temporary list with device_lock hold and remove stripes from the list with hash
      lock hold. So we don't allow concurrent access to the temporary list, which
      means we need allocate temporary list for all participants of release_stripe.
      
      One downside is free stripes are maintained in their inactive list, they can't
      across between the lists. By default, we have total 256 stripes and 8 lists, so
      each list will have 32 stripes. It's possible one list has free stripe but
      other list hasn't. The chance should be rare because stripes allocation are
      even distributed. And we can always allocate more stripes for cache, several
      mega bytes memory isn't a big deal.
      
      This completely removes the lock contention of the first path of
      get_active_stripe(). It slows down the second code path a little bit though
      because we now need takes two locks, but since the hash lock isn't contended,
      the overhead should be quite small (several atomic instructions). The second
      path of get_active_stripe() (basically sequential write or big request size
      randwrite) still has lock contentions.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      566c09c5
    • Shaohua Li's avatar
      wait: add wait_event_cmd() · 82e06c81
      Shaohua Li authored
      Add a new API wait_event_cmd(). It's a variant of wait_even() with two
      commands executed. One is executed before sleep, another after sleep.
      
      Modified to match use wait.h approach based on suggestion by
      Peter Zijlstra <peterz@infradead.org> - neilb
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      82e06c81
    • NeilBrown's avatar
      md/raid5.c: add proper locking to error path of raid5_start_reshape. · ba8805b9
      NeilBrown authored
      If raid5_start_reshape errors out, we need to reset all the fields
      that were updated (not just some), and need to use the seq_counter
      to ensure make_request() doesn't use an inconsitent state.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      ba8805b9
    • NeilBrown's avatar
      md: fix calculation of stacking limits on level change. · 02e5f5c0
      NeilBrown authored
      The various ->run routines of md personalities assume that the 'queue'
      has been initialised by the blk_set_stacking_limits() call in
      md_alloc().
      
      However when the level is changed (by level_store()) the ->run routine
      for the new level is called for an array which has already had the
      stacking limits modified.  This can result in incorrect final
      settings.
      
      So call blk_set_stacking_limits() before ->run in level_store().
      
      A specific consequence of this bug is that it causes
      discard_granularity to be set incorrectly when reshaping a RAID4 to a
      RAID0.
      
      This is suitable for any -stable kernel since 3.3 in which
      blk_set_stacking_limits() was introduced.
      
      Cc: stable@vger.kernel.org (3.3+)
      Reported-and-tested-by: default avatar"Baldysiak, Pawel" <pawel.baldysiak@intel.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      02e5f5c0
    • majianpeng's avatar
      raid5: Use slow_path to release stripe when mddev->thread is null · ad4068de
      majianpeng authored
      When release_stripe() is called in grow_one_stripe(), the
      mddev->thread is null. So it will omit one wakeup this thread to
      release stripe.
      For this condition, use slow_path to release stripe.
      
      Bug was introduced in 3.12
      
      Cc: stable@vger.kernel.org (3.12+)
      Fixes: 773ca82fSigned-off-by: default avatarJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      ad4068de
  3. 12 Nov, 2013 5 commits
    • Linus Torvalds's avatar
      Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 39cf275a
      Linus Torvalds authored
      Pull scheduler changes from Ingo Molnar:
       "The main changes in this cycle are:
      
         - (much) improved CONFIG_NUMA_BALANCING support from Mel Gorman, Rik
           van Riel, Peter Zijlstra et al.  Yay!
      
         - optimize preemption counter handling: merge the NEED_RESCHED flag
           into the preempt_count variable, by Peter Zijlstra.
      
         - wait.h fixes and code reorganization from Peter Zijlstra
      
         - cfs_bandwidth fixes from Ben Segall
      
         - SMP load-balancer cleanups from Peter Zijstra
      
         - idle balancer improvements from Jason Low
      
         - other fixes and cleanups"
      
      * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (129 commits)
        ftrace, sched: Add TRACE_FLAG_PREEMPT_RESCHED
        stop_machine: Fix race between stop_two_cpus() and stop_cpus()
        sched: Remove unnecessary iteration over sched domains to update nr_busy_cpus
        sched: Fix asymmetric scheduling for POWER7
        sched: Move completion code from core.c to completion.c
        sched: Move wait code from core.c to wait.c
        sched: Move wait.c into kernel/sched/
        sched/wait: Fix __wait_event_interruptible_lock_irq_timeout()
        sched: Avoid throttle_cfs_rq() racing with period_timer stopping
        sched: Guarantee new group-entities always have weight
        sched: Fix hrtimer_cancel()/rq->lock deadlock
        sched: Fix cfs_bandwidth misuse of hrtimer_expires_remaining
        sched: Fix race on toggling cfs_bandwidth_used
        sched: Remove extra put_online_cpus() inside sched_setaffinity()
        sched/rt: Fix task_tick_rt() comment
        sched/wait: Fix build breakage
        sched/wait: Introduce prepare_to_wait_event()
        sched/wait: Add ___wait_cond_timeout() to wait_event*_timeout() too
        sched: Remove get_online_cpus() usage
        sched: Fix race in migrate_swap_stop()
        ...
      39cf275a
    • Linus Torvalds's avatar
      Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · ad5d6989
      Linus Torvalds authored
      Pull perf updates from Ingo Molnar:
       "As a first remark I'd like to note that the way to build perf tooling
        has been simplified and sped up, in the future it should be enough for
        you to build perf via:
      
              cd tools/perf/
              make install
      
        (ie without the -j option.) The build system will figure out the
        number of CPUs and will do a parallel build+install.
      
        The various build system inefficiencies and breakages Linus reported
        against the v3.12 pull request should now be resolved - please
        (re-)report any remaining annoyances or bugs.
      
        Main changes on the perf kernel side:
      
         * Performance optimizations:
            . perf ring-buffer code optimizations,          by Peter Zijlstra
            . perf ring-buffer code optimizations,          by Oleg Nesterov
            . x86 NMI call-stack processing optimizations,  by Peter Zijlstra
            . perf context-switch optimizations,            by Peter Zijlstra
            . perf sampling speedups,                       by Peter Zijlstra
            . x86 Intel PEBS processing speedups,           by Peter Zijlstra
      
         * Enhanced hardware support:
            . for Intel Ivy Bridge-EP uncore PMUs,          by Zheng Yan
            . for Haswell transactions,                     by Andi Kleen, Peter Zijlstra
      
         * Core perf events code enhancements and fixes by Oleg Nesterov:
            . for uprobes, if fork() is called with pending ret-probes
            . for uprobes platform support code
      
         * New ABI details by Andi Kleen:
            . Report x86 Haswell TSX transaction abort cost as weight
      
        Main changes on the perf tooling side (some of these tooling changes
        utilize the above kernel side changes):
      
         * 'perf report/top' enhancements:
      
            . Convert callchain children list to rbtree, greatly reducing the
              time taken for callchain processing, from Namhyung Kim.
      
            . Add new COMM infrastructure, further improving histogram
              processing, from Frédéric Weisbecker, one fix from Namhyung Kim.
      
            . Add /proc/kcore based live-annotation improvements, including
              build-id cache support, multi map 'call' instruction navigation
              fixes, kcore address validation, objdump workarounds.  From
              Adrian Hunter.
      
            . Show progress on histogram collapsing, that can take a long
              time, from Namhyung Kim.
      
            . Add --max-stack option to limit callchain stack scan in 'top'
              and 'report', improving callchain processing when reducing the
              stack depth is an option, from Waiman Long.
      
            . Add new option --ignore-vmlinux for perf top, from Willy
              Tarreau.
      
         * 'perf trace' enhancements:
      
            . 'perf trace' now can can use a 'perf probe' dynamic tracepoints
              to hook into the userspace -> kernel pathname copy so that it
              can map fds to pathnames without reading /proc/pid/fd/ symlinks.
              From Arnaldo Carvalho de Melo.
      
            . Show VFS path associated with fd in live sessions, using a
              'vfs_getname' 'perf probe' created dynamic tracepoint or by
              looking at /proc/pid/fd, from Arnaldo Carvalho de Melo.
      
            . Add 'trace' beautifiers for lots of syscall arguments, from
              Arnaldo Carvalho de Melo.
      
            . Implement more compact 'trace' output by suppressing zeroed
              args, from Arnaldo Carvalho de Melo.
      
            . Show thread COMM by default in 'trace', from Arnaldo Carvalho de
              Melo.
      
            . Add option to show full timestamp in 'trace', from David Ahern.
      
            . Add 'record' command in 'trace', to record raw_syscalls:*, from
              David Ahern.
      
            . Add summary option to dump syscall statistics in 'trace', from
              David Ahern.
      
            . Improve error messages in 'trace', providing hints about system
              configuration steps needed for using it, from Ramkumar
              Ramachandra.
      
            . 'perf trace' now emits hints as to why tracing is not possible,
              helping the user to setup the system to allow tracing in the
              desired permission granularity, telling if the problem is due to
              debugfs not being mounted or with not enough permission for
              !root, /proc/sys/kernel/perf_event_paranoit value, etc.  From
              Arnaldo Carvalho de Melo.
      
         * 'perf record' enhancements:
      
            . Check maximum frequency rate for record/top, emitting better
              error messages, from Jiri Olsa.
      
            . 'perf record' code cleanups, from David Ahern.
      
            . Improve write_output error message in 'perf record', from Adrian
              Hunter.
      
            . Allow specifying B/K/M/G unit to the --mmap-pages arguments,
              from Jiri Olsa.
      
            . Fix command line callchain attribute tests to handle the new
              -g/--call-chain semantics, from Arnaldo Carvalho de Melo.
      
         * 'perf kvm' enhancements:
      
            . Disable live kvm command if timerfd is not supported, from David
              Ahern.
      
            . Fix detection of non-core features, from David Ahern.
      
         * 'perf list' enhancements:
      
            . Add usage to 'perf list', from David Ahern.
      
            . Show error in 'perf list' if tracepoints not available, from
              Pekka Enberg.
      
         * 'perf probe' enhancements:
      
            . Support "$vars" meta argument syntax for local variables,
              allowing asking for all possible variables at a given probe
              point to be collected when it hits, from Masami Hiramatsu.
      
         * 'perf sched' enhancements:
      
            . Address the root cause of that 'perf sched' stack initialization
              build slowdown, by programmatically setting a big array after
              moving the global variable back to the stack.  Fix from Adrian
              Hunter.
      
         * 'perf script' enhancements:
      
            . Set up output options for in-stream attributes, from Adrian
              Hunter.
      
            . Print addr by default for BTS in 'perf script', from Adrian
              Juntmer
      
         * 'perf stat' enhancements:
      
            . Improved messages when doing profiling in all or a subset of
              CPUs using a workload as the session delimitator, as in:
      
               'perf stat --cpu 0,2 sleep 10s'
      
              from Arnaldo Carvalho de Melo.
      
            . Add units to nanosec-based counters in 'perf stat', from David
              Ahern.
      
            . Remove bogus info when using 'perf stat' -e cycles/instructions,
              from Ramkumar Ramachandra.
      
         * 'perf lock' enhancements:
      
            . 'perf lock' fixes and cleanups, from Davidlohr Bueso.
      
         * 'perf test' enhancements:
      
            . Fixup PERF_SAMPLE_TRANSACTION handling in sample synthesizing
              and 'perf test', from Adrian Hunter.
      
            . Clarify the "sample parsing" test entry, from Arnaldo Carvalho
              de Melo.
      
            . Consider PERF_SAMPLE_TRANSACTION in the "sample parsing" test,
              from Arnaldo Carvalho de Melo.
      
            . Memory leak fixes in 'perf test', from Felipe Pena.
      
         * 'perf bench' enhancements:
      
            . Change the procps visible command-name of invididual benchmark
              tests plus cleanups, from Ingo Molnar.
      
         * Generic perf tooling infrastructure/plumbing changes:
      
            . Separating data file properties from session, code
              reorganization from Jiri Olsa.
      
            . Fix version when building out of tree, as when using one of
              these:
      
              $ make help | grep perf
                perf-tar-src-pkg    - Build perf-3.12.0.tar source tarball
                perf-targz-src-pkg  - Build perf-3.12.0.tar.gz source tarball
                perf-tarbz2-src-pkg - Build perf-3.12.0.tar.bz2 source tarball
                perf-tarxz-src-pkg  - Build perf-3.12.0.tar.xz source tarball
              $
      
              from David Ahern.
      
            . Enhance option parse error message, showing just the help lines
              of the options affected, from Namhyung Kim.
      
            . libtraceevent updates from upstream trace-cmd repo, from Steven
              Rostedt.
      
            . Always use perf_evsel__set_sample_bit to set sample_type, from
              Adrian Hunter.
      
            . Memory and mmap leak fixes from Chenggang Qin.
      
            . Assorted build fixes for from David Ahern and Jiri Olsa.
      
            . Speed up and prettify the build system, from Ingo Molnar.
      
            . Implement addr2line directly using libbfd, from Roberto Vitillo.
      
            . Separate the GTK support in a separate libperf-gtk.so DSO, that
              is only loaded when --gtk is specified, from Namhyung Kim.
      
            . perf bash completion fixes and improvements from Ramkumar
              Ramachandra.
      
            . Support for Openembedded/Yocto -dbg packages, from Ricardo
              Ribalda Delgado.
      
        And lots and lots of other fixes and code reorganizations that did not
        make it into the list, see the shortlog, diffstat and the Git log for
        details!"
      
      * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (300 commits)
        uprobes: Fix the memory out of bound overwrite in copy_insn()
        uprobes: Fix the wrong usage of current->utask in uprobe_copy_process()
        perf tools: Remove unneeded include
        perf record: Remove post_processing_offset variable
        perf record: Remove advance_output function
        perf record: Refactor feature handling into a separate function
        perf trace: Don't relookup fields by name in each sample
        perf tools: Fix version when building out of tree
        perf evsel: Ditch evsel->handler.data field
        uprobes: Export write_opcode() as uprobe_write_opcode()
        uprobes: Introduce arch_uprobe->ixol
        uprobes: Kill module_init() and module_exit()
        uprobes: Move function declarations out of arch
        perf/x86/intel: Add Ivy Bridge-EP uncore IRP box support
        perf/x86/intel/uncore: Add filter support for IvyBridge-EP QPI boxes
        perf: Factor out strncpy() in perf_event_mmap_event()
        tools/perf: Add required memory barriers
        perf: Fix arch_perf_out_copy_user default
        perf: Update a stale comment
        perf: Optimize perf_output_begin() -- address calculation
        ...
      ad5d6989
    • Linus Torvalds's avatar
      Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · ef1417a5
      Linus Torvalds authored
      Pull leftover IRQ fixes from Ingo Molnar:
       "Two (minor) fixlets that missed v3.12"
      
      * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        genirq: Set the irq thread policy without checking CAP_SYS_NICE
        irq: DocBook/genericirq.tmpl: Correct various typos
      ef1417a5
    • Linus Torvalds's avatar
      Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 1006fae3
      Linus Torvalds authored
      Pull IRQ changes from Ingo Molnar:
       "The biggest change this cycle are the softirq/hardirq stack
        interaction and nesting fixes, cleanups and reorganizations from
        Frederic.  This is the longer followup story to the softirq nesting
        fix that is already upstream (commit ded79754: "irq: Force hardirq
        exit's softirq processing on its own stack")"
      
      * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        irqchip: bcm2835: Convert to use IRQCHIP_DECLARE macro
        powerpc: Tell about irq stack coverage
        x86: Tell about irq stack coverage
        irq: Optimize softirq stack selection in irq exit
        irq: Justify the various softirq stack choices
        irq: Improve a bit softirq debugging
        irq: Optimize call to softirq on hardirq exit
        irq: Consolidate do_softirq() arch overriden implementations
        x86/irq: Correct comment about i8259 initialization
      1006fae3
    • Linus Torvalds's avatar
      Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 70fdcb83
      Linus Torvalds authored
      Pull RCU updates from Ingo Molnar:
       "The main RCU changes in this cycle are:
      
         - Idle entry/exit changes, to throttle callback execution and other
           refinements to speed up kbuild, primarily to address performance
           issues located by Tibor Billes.
      
         - Grace-period related changes, primarily to aid in debugging,
           inspired by an -rt debugging session.
      
         - Code reorganization moving RCU's source files into its own
           kernel/rcu/ directory.
      
         - RCU documentation updates
      
         - Miscellaneous fixes.
      
        Note, the following commit:
      
          5c889690 mm: Place preemption point in do_mlockall() loop
      
        is identical to the commit already in your tree via email:
      
          22356f44 mm: Place preemption point in do_mlockall() loop
      
        [ Your version of the changelog nicely demonstrates it how kernel oops
          messages should be trimmed properly :-/ ]"
      
      * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
        rcu: Move RCU-related source code to kernel/rcu directory
        rcu: Fix occurrence of "the the" in checklist.txt
        kthread: Add pointer to vmstat-avoidance patch
        rcu: Update stall-warning documentation
        rcu: Consistent rcu_is_watching() naming
        rcu: Change EXPORT_SYMBOL() to EXPORT_SYMBOL_GPL()
        rcu: Is it safe to enter an RCU read-side critical section?
        rcu: Throttle invoke_rcu_core() invocations due to non-lazy callbacks
        rcu: Throttle rcu_try_advance_all_cbs() execution
        rcu: Remove redundant code from rcu_cleanup_after_idle()
        rcu: Fix CONFIG_RCU_NOCB_CPU_ALL panic on machines with sparse CPU mask
        rcu: Avoid sparse warnings in rcu_nocb_wake trace event
        rcu: Track rcu_nocb_kthread()'s sleeping and awakening
        rcu: Distinguish between NOCB and non-NOCB rcu_callback trace events
        rcu: Add tracing for rcuo no-CBs CPU wakeup handshake
        rcu: Add tracing of normal (non-NOCB) grace-period requests
        rcu: Add tracing to rcu_gp_kthread()
        rcu: Flag lockless access to ->gp_flags with ACCESS_ONCE()
        rcu: Prevent spurious-wakeup DoS attack on rcu_gp_kthread()
        rcu: Improve grace-period start logic
        ...
      70fdcb83
  4. 11 Nov, 2013 13 commits
    • Peter Zijlstra's avatar
      ftrace, sched: Add TRACE_FLAG_PREEMPT_RESCHED · e5137b50
      Peter Zijlstra authored
      Since the introduction of PREEMPT_NEED_RESCHED in:
      
        f27dde8d ("sched: Add NEED_RESCHED to the preempt_count")
      
      we need to be able to look at both TIF_NEED_RESCHED and
      PREEMPT_NEED_RESCHED to understand the full preemption behaviour.
      
      Add it to the trace output.
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Acked-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com>
      Link: http://lkml.kernel.org/r/20131004152826.GP3081@twins.programming.kicks-ass.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      e5137b50
    • Rik van Riel's avatar
      stop_machine: Fix race between stop_two_cpus() and stop_cpus() · 7053ea1a
      Rik van Riel authored
      There is a race between stop_two_cpus, and the global stop_cpus.
      
      It is possible for two CPUs to get their stopper functions queued
      "backwards" from one another, resulting in the stopper threads
      getting stuck, and the system hanging. This can happen because
      queuing up stoppers is not synchronized.
      
      This patch adds synchronization between stop_cpus (a rare operation),
      and stop_two_cpus.
      Reported-and-Tested-by: default avatarPrarit Bhargava <prarit@redhat.com>
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Link: http://lkml.kernel.org/r/20131101104146.03d1e043@annuminas.surriel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      7053ea1a
    • Linus Torvalds's avatar
      Merge tag 'arc-v3.13-rc1-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc · edae583a
      Linus Torvalds authored
      Pull ARC changes from Vineet Gupta:
       - Towards a working SMP setup (ASID allocation, TLB Flush,...)
       - Support for TRACE_IRQFLAGS, LOCKDEP
       - cacheflush backend consolidation for I/D
       - Lots of allmodconfig fixlets from Chen
       - Other improvements/fixes
      
      * tag 'arc-v3.13-rc1-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc: (25 commits)
        ARC: [plat-arcfpga] defconfig update
        smp, ARC: kill SMP single function call interrupt
        ARC: [SMP] Disallow RTSC
        ARC: [SMP] Fix build failures for large NR_CPUS
        ARC: [SMP] enlarge possible NR_CPUS
        ARC: [SMP] TLB flush
        ARC: [SMP] ASID allocation
        arc: export symbol for pm_power_off in reset.c
        arc: export symbol for save_stack_trace() in stacktrace.c
        arc: remove '__init' for get_hw_config_num_irq()
        arc: remove '__init' for first_lines_of_secondary()
        arc: remove '__init' for setup_processor() and arc_init_IRQ()
        arc: kgdb: add default implementation for kgdb_roundup_cpus()
        ARC: Fix bogus gcc warning and micro-optimise TLB iteration loop
        ARC: Add support for irqflags tracing and lockdep
        ARC: Reset the value of Interrupt Priority Register
        ARC: Reduce #ifdef'ery for unaligned access emulation
        ARC: Change calling convention of do_page_fault()
        ARC: cacheflush optim - PTAG can be loop invariant if V-P is const
        ARC: cacheflush refactor #3: Unify the {d,i}cache flush leaf helpers
        ...
      edae583a
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k · 0a759b24
      Linus Torvalds authored
      Pull m68k updates from Geert Uytterhoeven:
       "Summary:
         - __put_user_unaligned may/will be used by btrfs
         - m68k part of a global cleanup"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k:
        m68k: Remove deprecated IRQF_DISABLED
        m68k/m68knommu: Implement __get_user_unaligned/__put_user_unaligned()
      0a759b24
    • Linus Torvalds's avatar
      Merge branch 'parisc-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux · 78d4a420
      Linus Torvalds authored
      Pull parisc update from Helge Deller:
       - a bugfix for sticon (parisc text console driver) to not crash the
         64bit kernel on machines with more than 4GB RAM
       - added kernel audit support
       - made udelay() implementation SMP-safe
       - "make install" now does not depend on vmlinux
       - added defconfigs for 32- and 64-kernels
      
      * 'parisc-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
        parisc: add generic 32- and 64-bit defconfigs
        parisc: sticon - unbreak on 64bit kernel
        parisc: signal fixup - SIGBUS vs. SIGSEGV
        parisc: implement full version of access_ok()
        parisc: correctly display number of active CPUs
        parisc: do not count IPI calls twice
        parisc: make udelay() SMP-safe
        parisc: remove duplicate define
        parisc: make "make install" not depend on vmlinux
        parisc: add kernel audit feature
        parisc: provide macro to create exception table entries
      78d4a420
    • Ingo Molnar's avatar
      Merge branch 'uprobes/core' of... · caea6cf5
      Ingo Molnar authored
      Merge branch 'uprobes/core' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc into perf/core
      
      Pull uprobes fixes from Oleg Nesterov.
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      caea6cf5
    • Linus Torvalds's avatar
      Merge tag 'dt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc · f9efbce6
      Linus Torvalds authored
      Pull ARM SoC DT updates from Olof Johansson:
       "Most of this branch consists of updates, additions and general churn
        of the device tree source files in the kernel (arch/arm/boot/dts).
        Besides that, there are a few things to point out:
      
         - Lots of platform conversion on OMAP2+, with removal of old board
           files for various platforms.
         - Final conversion of a bunch of ux500 (ST-Ericsson) platforms as
           well
         - Some updates to pinctrl and other subsystems.  Most of these are
           for DT-enablement of the various platforms and acks have been
           collected"
      
      * tag 'dt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (385 commits)
        ARM: dts: bcm11351: Use GIC/IRQ defines for sdio interrupts
        ARM: dts: bcm: Add missing UARTs for bcm11351 (bcm281xx)
        ARM: dts: bcm281xx: Add card detect GPIO
        ARM: dts: rename ARCH_BCM to ARCH_BCM_MOBILE (dt)
        ARM: bcm281xx: Add device node for the GPIO controller
        ARM: mvebu: Add Netgear ReadyNAS 104 board
        ARM: tegra: fix Tegra114 IOMMU register address
        ARM: kirkwood: add support for OpenBlocks A7 platform
        ARM: dts: omap4-panda: add DPI pinmuxing
        ARM: dts: AM33xx: Add RNG node
        ARM: dts: AM33XX: Add hwspinlock node
        ARM: dts: OMAP5: Add hwspinlock node
        ARM: dts: OMAP4: Add hwspinlock node
        ARM: dts: use 'status' property for PCIe nodes
        ARM: dts: sirf: add missed address-cells and size-cells for prima2 I2C
        ARM: dts: sirf: add missed cell, cs and dma channel for SPI nodes
        ARM: dts: sirf: add missed graphics2d iobg in atlas6 dts
        ARM: dts: sirf: add missed chhifbg node in prima2 and atlas6 dts
        ARM: dts: sirf: add missed memcontrol-monitor node in prima2 and atlas6 dts
        ARM: mvebu: Add the core-divider clock to Armada 370/XP
        ...
      f9efbce6
    • Linus Torvalds's avatar
      Merge tag 'drivers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc · 53575aa9
      Linus Torvalds authored
      Pull ARM driver updates from Olof Johansson:
       "Updates of SoC-near drivers and other driver updates that makes more
        sense to take through our tree.  In this case it's involved:
      
         - Some Davinci driver updates that has required corresponding
           platform code changes (gpio mostly)
         - CCI bindings and a few driver updates
         - Marvell mvebu patches for PCI MSI support (could have gone through
           the PCI tree for this release, but they were acked by Bjorn for
           3.12 so we kept them through arm-soc).
         - Marvell dove switch-over to DT-based PCIe configuration
         - Misc updates for Samsung platform dmaengine drivers"
      
      * tag 'drivers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (32 commits)
        ARM: S3C24XX: add dma pdata for s3c2410, s3c2440 and s3c2442
        dmaengine: s3c24xx-dma: add support for the s3c2410 type of controller
        ARM: S3C24XX: Fix possible dma selection warning
        PCI: mvebu: make local functions static
        PCI: mvebu: add I/O access wrappers
        PCI: mvebu: Dynamically detect if the PEX link is up to enable hot plug
        ARM: mvebu: fix gated clock documentation
        ARM: dove: remove legacy pcie and clock init
        ARM: dove: switch to DT probed mbus address windows
        ARM: SAMSUNG: set s3c24xx_dma_filter for s3c64xx-spi0 device
        ARM: S3C24XX: add platform-devices for new dma driver for s3c2412 and s3c2443
        dmaengine: add driver for Samsung s3c24xx SoCs
        ARM: S3C24XX: number the dma clocks
        PCI: mvebu: add support for Marvell Dove SoCs
        PCI: mvebu: add support for reset on GPIO
        PCI: mvebu: remove subsys_initcall
        PCI: mvebu: increment nports only for registered ports
        PCI: mvebu: move clock enable before register access
        PCI: mvebu: add support for MSI
        irqchip: armada-370-xp: implement MSI support
        ...
      53575aa9
    • Linus Torvalds's avatar
      Merge tag 'boards-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc · d5aabbca
      Linus Torvalds authored
      Pull ARM SoC board updates from Olof Johansson:
       "Board-related updates.  This branch is getting smaller and smaller,
        which is the whole idea so that's reassuring.
      
        Right now by far most of the code is related to shmobile updates, and
        they are now switching over to removal of board code and migration to
        multiplatform, so we'll see their board code base shrink in the near
        future too, I hope.
      
        In addition to that is some defconfig updates, some display updates
        for OMAP and a bit of new board support for Rockchip boards"
      
      * tag 'boards-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (56 commits)
        ARM: rockchip: add support for rk3188 and Radxa Rock board
        ARM: rockchip: add dts for bqcurie2 tablet
        ARM: rockchip: enable arm-global-timer
        ARM: rockchip: move shared dt properties to common source file
        ARM: OMAP2+: display: Create omap_vout device inside omap_display_init
        ARM: OMAP2+: display: Create omapvrfb and omapfb devices inside omap_display_init
        ARM: OMAP2+: display: Create omapdrm device inside omap_display_init
        ARM: OMAP2+: drm: Don't build device for DMM
        ARM: tegra: defconfig updates
        RX-51: Add support for OMAP3 ROM Random Number Generator
        ARM: OMAP3: RX-51: ARM errata 430973 workaround
        ARM: OMAP3: Add secure function omap_smc3() which calling instruction smc #1
        ARM: shmobile: marzen: enable INTC IRQ
        ARM: shmobile: bockw: add SMSC support on reference
        ARM: shmobile: Use SMP on Koelsch
        ARM: shmobile: Remove KZM9D reference DTS
        ARM: shmobile: Let KZM9D multiplatform boot with KZM9D DTB
        ARM: shmobile: Remove non-multiplatform KZM9D reference support
        ARM: shmobile: Use KZM9D without reference for multiplatform
        ARM: shmobile: Sync KZM9D DTS with KZM9D reference DTS
        ...
      d5aabbca
    • Linus Torvalds's avatar
      Merge tag 'soc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc · aac59e3e
      Linus Torvalds authored
      Pull ARM SoC platform changes from Olof Johansson:
       "New and updated SoC support.  Among the things new for this release
        are:
      
         - More support for the AM33xx platforms from TI
         - Tegra 124 support, and some updates to older tegra families as well
         - imx cleanups and updates across the board
         - A rename of Broadcom's Mobile platforms which were introduced as
           ARCH_BCM, and turned out to be too broad a name.  New name is
           ARCH_BCM_MOBILE.
         - A whole bunch of updates and fixes for integrator, making the
           platform code more modern and switches over to DT-only booting.
         - Support for two new Renesas shmobile chipsets.  Next up for them is
           more work on consolidation instead of introduction of new
           non-multiplatform SoCs, we're all looking forward to that!
         - Misc cleanups for older Samsung platforms, some Allwinner updates,
           etc"
      
      * tag 'soc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (159 commits)
        ARM: bcm281xx: Add ARCH_BCM_MOBILE to bcm config
        ARM: bcm_defconfig: Run "make savedefconfig"
        ARM: bcm281xx: Add ARCH Timers to config
        rename ARCH_BCM to ARCH_BCM_MOBILE (mach-bcm)
        ARM: vexpress: Enable platform-specific options in defconfig
        ARM: vexpress: Make defconfig work again
        ARM: sunxi: remove .init_time hooks
        ARM: imx: enable suspend for imx6sl
        ARM: imx: ensure dsm_request signal is not asserted when setting LPM
        ARM: imx6q: call WB and RBC configuration from imx6q_pm_enter()
        ARM: imx6q: move low-power code out of clock driver
        ARM: imx: drop extern with function prototypes in common.h
        ARM: imx: reset core along with enable/disable operation
        ARM: imx: do not return from imx_cpu_die() call
        ARM: imx_v6_v7_defconfig: Select CONFIG_PROVE_LOCKING
        ARM: imx_v6_v7_defconfig: Enable LEDS_GPIO related options
        ARM: mxs_defconfig: Turn off CONFIG_DEBUG_GPIO
        ARM: imx: replace imx6q_restart() with mxc_restart()
        ARM: mach-imx: mm-imx5: Retrieve iomuxc base address from dt
        ARM: mach-imx: mm-imx5: Retrieve tzic base address from dt
        ...
      aac59e3e
    • Linus Torvalds's avatar
      Merge tag 'cleanup-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc · 21604cdc
      Linus Torvalds authored
      Pull ARM SoC cleanups from Olof Johansson:
       "This branch contains code cleanups, moves and removals for 3.13.
      
        Qualcomm msm targets had a bunch of code removal for legacy non-DT
        platforms.  Nomadik saw more device tree conversions and cleanup of
        old code.  Tegra has some code refactoring, etc.
      
        One longish patch series from Sebastian Hasselbarth changes the
        init_time hooks and tries to use a generic implementation for most
        platforms, since they were all doing more or less the same things.
      
        Finally the "shark" platform is removed in this release.  It's been
        abandoned for a while and nobody seems to care enough to keep it
        around.  If someone comes along and wants to resurrect it, the removal
        can easily be reverted and code brought back.
      
        Beyond this, mostly a bunch of removals of stale content across the
        board, etc"
      
      * tag 'cleanup-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (79 commits)
        ARM: gemini: convert to GENERIC_CLOCKEVENTS
        ARM: EXYNOS: remove CONFIG_MACH_EXYNOS[4, 5]_DT config options
        ARM: OMAP3: control: add API for setting IVA bootmode
        ARM: OMAP3: CM/control: move CM scratchpad save to CM driver
        ARM: OMAP3: McBSP: do not access CM register directly
        ARM: OMAP3: clock: add API to enable/disable autoidle for a single clock
        ARM: OMAP2: CM/PM: remove direct register accesses outside CM code
        MAINTAINERS: Add patterns for DTS files for AT91
        ARM: at91: remove init_machine() as default is suitable
        ARM: at91/dt: split sama5d3 peripheral definitions
        ARM: at91/dt: split sam9x5 peripheral definitions
        ARM: Remove temporary sched_clock.h header
        ARM: clps711x: Use linux/sched_clock.h
        MAINTAINERS: Add DTS files to patterns for Samsung platform
        ARM: EXYNOS: remove unnecessary header inclusions from exynos4/5 dt machine file
        ARM: tegra: fix ARCH_TEGRA_114_SOC select sort order
        clk: nomadik: fix missing __init on nomadik_src_init
        ARM: drop explicit selection of HAVE_CLK and CLKDEV_LOOKUP
        ARM: S3C64XX: Kill CONFIG_PLAT_S3C64XX
        ASoC: samsung: Use CONFIG_ARCH_S3C64XX to check for S3C64XX support
        ...
      21604cdc
    • Linus Torvalds's avatar
      Merge tag 'fixes-nc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc · beb5bfe4
      Linus Torvalds authored
      Pull ARM SoC low-priority fixes from Olof Johansson:
       "A set of fixes for various platforms that weren't considered bad
        enough to include in 3.12 (nor -stable).  Mostly simple typo fixes,
        etc"
      
      * tag 'fixes-nc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
        ARM: OMAP2+: irq, AM33XX add missing register check
        ARM: OMAP2+: wakeupgen: AM43x adaptation
        ARM: OMAP1: Fix a bunch of GPIO related section warnings after initdata got corrected
        ARM: dts: fix PL330 MDMA1 address in DT for Universal C210 board
        ARM: dts: Work around lack of cpufreq regulator lookup for exynos4210-origen and trats boards
        ARM: dts: Fix typo earlyprintk in exynos5440-sd5v1 and ssdk5440 boards
        ARM: dts: Correct typo in use of samsung,pin-drv for exynos5250
        ARM: rockchip: remove obsolete rockchip,config properties
        ARM: rockchip: fix wrong use of non-existent CONFIG_LOCAL_TIMERS
        ARM: mach-omap1: Fix omap1510_fpga_init_irq() implicit declarations.
        ARM: OMAP1: fix incorrect placement of __initdata tag
        ARM: OMAP: remove deprecated IRQF_DISABLED
        ARM: OMAP2+: throw the die id into the entropy pool
      beb5bfe4
    • Linus Torvalds's avatar
      Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64 · 05ad391d
      Linus Torvalds authored
      Pull ARM64 update from Catalin Marinas:
       "Main features:
         - Ticket-based spinlock implementation and lockless lockref support
         - Big endian support
         - CPU hotplug support, currently for PSCI (Power State Coordination
           Interface) capable firmware
         - Virtual address space extended to 42-bit in the 64K page
           configuration (maximum VA space with 2 levels of page tables)
         - Compat (AArch32) kuser helpers updated to ARMv8 (make use of
           load-acquire/store-release instructions)
         - Code cleanup, defconfig update and minor fixes"
      
      * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64: (43 commits)
        ARM64: /proc/interrupts: display IPIs of online CPUs only
        arm64: locks: Remove CONFIG_GENERIC_LOCKBREAK
        arm64: KVM: vgic: byteswap GICv2 access on world switch if BE
        arm64: KVM: initialize HYP mode following the kernel endianness
        arm64: compat: Clear the IT state independent of the 32-bit ARM or Thumb-2 mode
        arm64: Use 42-bit address space with 64K pages
        arm64: module: ensure instruction is little-endian before manipulation
        arm64: defconfig: Enable CONFIG_PREEMPT by default
        arm64: fix access to preempt_count from assembly code
        arm64: move enabling of GIC before CPUs are set online
        arm64: use generic RW_DATA_SECTION macro in linker script
        arm64: Slightly improve the warning on CPU0 enable-method
        ARM64: simplify cpu_read_bootcpu_ops using OF/DT helper
        ARM64: DT: define ARM64 specific arch_match_cpu_phys_id
        arm64: allow ioremap_cache() to use existing RAM mappings
        arm64: update 32-bit kuser helpers to ARMv8
        arm64: perf: fix event number mask
        arm64: kconfig: allow CPU_BIG_ENDIAN to be selected
        arm64: Fix the endianness of arch_spinlock_t
        arm64: big-endian: write CPU holding pen address as LE
        ...
      05ad391d
  5. 10 Nov, 2013 1 commit
    • Linus Torvalds's avatar
      Merge tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw · 8b5baa46
      Linus Torvalds authored
      Pull gfs2 updates from Steven Whitehouse:
       "The main feature of interest this time is quota updates.  There are
        some clean ups and some patches to use the new generic lru list code.
      
        There is still plenty of scope for some further changes in due course -
        faster lookups of quota structures is very much on the todo list.
        Also, a start has been made towards the more tricky issue of using the
        generic lru code with glocks, but that will have to be completed in a
        subsequent merge window.
      
        The other, more minor feature, is that there have been a number of
        performance patches which relate to block allocation.  In particular
        they will improve performance when the disk is nearly full"
      
      * tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
        GFS2: Use generic list_lru for quota
        GFS2: Rename quota qd_lru_lock qd_lock
        GFS2: Use reflink for quota data cache
        GFS2: Use lockref for glocks
        GFS2: Protect quota sync generation
        GFS2: Inline qd_trylock into gfs2_quota_unlock
        GFS2: Make two similar quota code fragments into a function
        GFS2: Remove obsolete quota tunable
        GFS2: Move gfs2_icbit_munge into quota.c
        GFS2: Speed up starting point selection for block allocation
        GFS2: Add allocation parameters structure
        GFS2: Clean up reservation removal
        GFS2: fix dentry leaks
        GFS2: new function gfs2_rbm_incr
        GFS2: Introduce rbm field bii
        GFS2: Do not reset flags on active reservations
        GFS2: introduce bi_blocks for optimization
        GFS2: optimize rbm_from_block wrt bi_start
        GFS2: d_splice_alias() can't return error
      8b5baa46
  6. 09 Nov, 2013 2 commits
    • Oleg Nesterov's avatar
      uprobes: Fix the memory out of bound overwrite in copy_insn() · 2ded0980
      Oleg Nesterov authored
      1. copy_insn() doesn't look very nice, all calculations are
         confusing and it is not immediately clear why do we read
         the 2nd page first.
      
      2. The usage of inode->i_size is wrong on 32-bit machines.
      
      3. "Instruction at end of binary" logic is simply wrong, it
         doesn't handle the case when uprobe->offset > inode->i_size.
      
         In this case "bytes" overflows, and __copy_insn() writes to
         the memory outside of uprobe->arch.insn.
      
         Yes, uprobe_register() checks i_size_read(), but this file
         can be truncated after that. All i_size checks are racy, we
         do this only to catch the obvious mistakes.
      
      Change copy_insn() to call __copy_insn() in a loop, simplify
      and fix the bytes/nbytes calculations.
      
      Note: we do not care if we read extra bytes after inode->i_size
      if we got the valid page. This is fine because the task gets the
      same page after page-fault, and arch_uprobe_analyze_insn() can't
      know how many bytes were actually read anyway.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      2ded0980
    • Oleg Nesterov's avatar
      uprobes: Fix the wrong usage of current->utask in uprobe_copy_process() · 70d7f987
      Oleg Nesterov authored
      Commit aa59c53f "uprobes: Change uprobe_copy_process() to dup
      xol_area" has a stupid typo, we need to setup t->utask->vaddr but
      the code wrongly uses current->utask.
      
      Even with this bug dup_xol_work() works "in practice", but only
      because get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE) likely
      returns the same address every time.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      70d7f987