1. 25 Jul, 2012 40 commits
    • Thomas Gleixner's avatar
      time: Move common updates to a function · b19f4db4
      Thomas Gleixner authored
      This is a backport of cc06268c
      
      [John Stultz: While not a bugfix itself, it allows following fixes
       to backport in a more straightforward manner.]
      
      CC: Thomas Gleixner <tglx@linutronix.de>
      CC: Eric Dumazet <eric.dumazet@gmail.com>
      CC: Richard Cochran <richardcochran@gmail.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Linux Kernel <linux-kernel@vger.kernel.org>
      Signed-off-by: default avatarJohn Stultz <john.stultz@linaro.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      b19f4db4
    • John Stultz's avatar
      timekeeping: Fix CLOCK_MONOTONIC inconsistency during leapsecond · 09e66e8d
      John Stultz authored
      This is a backport of fad0c66c
      which resolves a bug the previous commit.
      
      Commit 6b43ae8a (ntp: Fix leap-second hrtimer livelock) broke the
      leapsecond update of CLOCK_MONOTONIC. The missing leapsecond update to
      wall_to_monotonic causes discontinuities in CLOCK_MONOTONIC.
      
      Adjust wall_to_monotonic when NTP inserted a leapsecond.
      Reported-by: default avatarRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: default avatarJohn Stultz <john.stultz@linaro.org>
      Tested-by: default avatarRichard Cochran <richardcochran@gmail.com>
      Link: http://lkml.kernel.org/r/1338400497-12420-1-git-send-email-john.stultz@linaro.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Linux Kernel <linux-kernel@vger.kernel.org>
      Signed-off-by: default avatarJohn Stultz <johnstul@us.ibm.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      09e66e8d
    • Richard Cochran's avatar
      ntp: Correct TAI offset during leap second · 76117661
      Richard Cochran authored
      commit dd48d708 upstream.
      
      When repeating a UTC time value during a leap second (when the UTC
      time should be 23:59:60), the TAI timescale should not stop. The kernel
      NTP code increments the TAI offset one second too late. This patch fixes
      the issue by incrementing the offset during the leap second itself.
      Signed-off-by: default avatarRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: default avatarJohn Stultz <john.stultz@linaro.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      76117661
    • John Stultz's avatar
      ntp: Fix leap-second hrtimer livelock · a57ccabe
      John Stultz authored
      This is a backport of 6b43ae8a
      
      This should have been backported when it was commited, but I
      mistook the problem as requiring the ntp_lock changes
      that landed in 3.4 in order for it to occur.
      
      Unfortunately the same issue can happen (with only one cpu)
      as follows:
      do_adjtimex()
       write_seqlock_irq(&xtime_lock);
        process_adjtimex_modes()
         process_adj_status()
          ntp_start_leap_timer()
           hrtimer_start()
            hrtimer_reprogram()
             tick_program_event()
              clockevents_program_event()
               ktime_get()
                seq = req_seqbegin(xtime_lock); [DEADLOCK]
      
      This deadlock will no always occur, as it requires the
      leap_timer to force a hrtimer_reprogram which only happens
      if its set and there's no sooner timer to expire.
      
      NOTE: This patch, being faithful to the original commit,
      introduces a bug (we don't update wall_to_monotonic),
      which will be resovled by backporting a following fix.
      
      Original commit message below:
      
      Since commit 7dffa3c6 the ntp
      subsystem has used an hrtimer for triggering the leapsecond
      adjustment. However, this can cause a potential livelock.
      
      Thomas diagnosed this as the following pattern:
      CPU 0                                                    CPU 1
      do_adjtimex()
        spin_lock_irq(&ntp_lock);
          process_adjtimex_modes();				 timer_interrupt()
            process_adj_status();                                do_timer()
              ntp_start_leap_timer();                             write_lock(&xtime_lock);
                hrtimer_start();                                  update_wall_time();
                   hrtimer_reprogram();                            ntp_tick_length()
                     tick_program_event()                            spin_lock(&ntp_lock);
                       clockevents_program_event()
      		   ktime_get()
                           seq = req_seqbegin(xtime_lock);
      
      This patch tries to avoid the problem by reverting back to not using
      an hrtimer to inject leapseconds, and instead we handle the leapsecond
      processing in the second_overflow() function.
      
      The downside to this change is that on systems that support highres
      timers, the leap second processing will occur on a HZ tick boundary,
      (ie: ~1-10ms, depending on HZ)  after the leap second instead of
      possibly sooner (~34us in my tests w/ x86_64 lapic).
      
      This patch applies on top of tip/timers/core.
      
      CC: Sasha Levin <levinsasha928@gmail.com>
      CC: Thomas Gleixner <tglx@linutronix.de>
      Reported-by: default avatarSasha Levin <levinsasha928@gmail.com>
      Diagnoised-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarSasha Levin <levinsasha928@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Linux Kernel <linux-kernel@vger.kernel.org>
      Signed-off-by: default avatarJohn Stultz <john.stultz@linaro.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      a57ccabe
    • Mikulas Patocka's avatar
      dm raid1: set discard_zeroes_data_unsupported · 9f1e3e0f
      Mikulas Patocka authored
      commit 7c8d3a42 upstream.
      
      We can't guarantee that REQ_DISCARD on dm-mirror zeroes the data even if
      the underlying disks support zero on discard.  So this patch sets
      ti->discard_zeroes_data_unsupported.
      
      For example, if the mirror is in the process of resynchronizing, it may
      happen that kcopyd reads a piece of data, then discard is sent on the
      same area and then kcopyd writes the piece of data to another leg.
      Consequently, the data is not zeroed.
      
      The flag was made available by commit 983c7db3
      (dm crypt: always disable discard_zeroes_data).
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      9f1e3e0f
    • Mikulas Patocka's avatar
      dm raid1: fix crash with mirror recovery and discard · 0409d163
      Mikulas Patocka authored
      commit 751f188d upstream.
      
      This patch fixes a crash when a discard request is sent during mirror
      recovery.
      
      Firstly, some background.  Generally, the following sequence happens during
      mirror synchronization:
      - function do_recovery is called
      - do_recovery calls dm_rh_recovery_prepare
      - dm_rh_recovery_prepare uses a semaphore to limit the number
        simultaneously recovered regions (by default the semaphore value is 1,
        so only one region at a time is recovered)
      - dm_rh_recovery_prepare calls __rh_recovery_prepare,
        __rh_recovery_prepare asks the log driver for the next region to
        recover. Then, it sets the region state to DM_RH_RECOVERING. If there
        are no pending I/Os on this region, the region is added to
        quiesced_regions list. If there are pending I/Os, the region is not
        added to any list. It is added to the quiesced_regions list later (by
        dm_rh_dec function) when all I/Os finish.
      - when the region is on quiesced_regions list, there are no I/Os in
        flight on this region. The region is popped from the list in
        dm_rh_recovery_start function. Then, a kcopyd job is started in the
        recover function.
      - when the kcopyd job finishes, recovery_complete is called. It calls
        dm_rh_recovery_end. dm_rh_recovery_end adds the region to
        recovered_regions or failed_recovered_regions list (depending on
        whether the copy operation was successful or not).
      
      The above mechanism assumes that if the region is in DM_RH_RECOVERING
      state, no new I/Os are started on this region. When I/O is started,
      dm_rh_inc_pending is called, which increases reg->pending count. When
      I/O is finished, dm_rh_dec is called. It decreases reg->pending count.
      If the count is zero and the region was in DM_RH_RECOVERING state,
      dm_rh_dec adds it to the quiesced_regions list.
      
      Consequently, if we call dm_rh_inc_pending/dm_rh_dec while the region is
      in DM_RH_RECOVERING state, it could be added to quiesced_regions list
      multiple times or it could be added to this list when kcopyd is copying
      data (it is assumed that the region is not on any list while kcopyd does
      its jobs). This results in memory corruption and crash.
      
      There already exist bypasses for REQ_FLUSH requests: REQ_FLUSH requests
      do not belong to any region, so they are always added to the sync list
      in do_writes. dm_rh_inc_pending does not increase count for REQ_FLUSH
      requests. In mirror_end_io, dm_rh_dec is never called for REQ_FLUSH
      requests. These bypasses avoid the crash possibility described above.
      
      These bypasses were improperly implemented for REQ_DISCARD when
      the mirror target gained discard support in commit
      5fc2ffea (dm raid1: support discard).
      
      In do_writes, REQ_DISCARD requests is always added to the sync queue and
      immediately dispatched (even if the region is in DM_RH_RECOVERING).  However,
      dm_rh_inc and dm_rh_dec is called for REQ_DISCARD resusts.  So it violates the
      rule that no I/Os are started on DM_RH_RECOVERING regions, and causes the list
      corruption described above.
      
      This patch changes it so that REQ_DISCARD requests follow the same path
      as REQ_FLUSH. This avoids the crash.
      
      Reference: https://bugzilla.redhat.com/837607Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      0409d163
    • Boaz Harrosh's avatar
      pnfs-obj: Fix __r4w_get_page when offset is beyond i_size · 40606252
      Boaz Harrosh authored
      commit c999ff68 upstream.
      
      It is very common for the end of the file to be unaligned on
      stripe size. But since we know it's beyond file's end then
      the XOR should be preformed with all zeros.
      
      Old code used to just read zeros out of the OSD devices, which is a great
      waist. But what scares me more about this situation is that, we now have
      pages attached to the file's mapping that are beyond i_size. I don't
      like the kind of bugs this calls for.
      
      Fix both birds, by returning a global zero_page, if offset is beyond
      i_size.
      
      TODO:
      	Change the API to ->__r4w_get_page() so a NULL can be
      	returned without being considered as error, since XOR API
      	treats NULL entries as zero_pages.
      
      [Bug since 3.2. Should apply the same way to all Kernels since]
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      [bwh: Backported to 3.2: adjust for lack of wdata->header]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      40606252
    • Boaz Harrosh's avatar
      pnfs-obj: don't leak objio_state if ore_write/read fails · 0ab51e8b
      Boaz Harrosh authored
      commit 9909d45a upstream.
      
      [Bug since 3.2 Kernel]
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      0ab51e8b
    • Boaz Harrosh's avatar
      ore: Remove support of partial IO request (NFS crash) · 3e17c16b
      Boaz Harrosh authored
      commit 62b62ad8 upstream.
      
      Do to OOM situations the ore might fail to allocate all resources
      needed for IO of the full request. If some progress was possible
      it would proceed with a partial/short request, for the sake of
      forward progress.
      
      Since this crashes NFS-core and exofs is just fine without it just
      remove this contraption, and fail.
      
      TODO:
      	Support real forward progress with some reserved allocations
      	of resources, such as mem pools and/or bio_sets
      
      [Bug since 3.2 Kernel]
      CC: Benny Halevy <bhalevy@tonian.com>
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      3e17c16b
    • Boaz Harrosh's avatar
      ore: Fix NFS crash by supporting any unaligned RAID IO · c7003a9e
      Boaz Harrosh authored
      commit 9ff19309 upstream.
      
      In RAID_5/6 We used to not permit an IO that it's end
      byte is not stripe_size aligned and spans more than one stripe.
      .i.e the caller must check if after submission the actual
      transferred bytes is shorter, and would need to resubmit
      a new IO with the remainder.
      
      Exofs supports this, and NFS was supposed to support this
      as well with it's short write mechanism. But late testing has
      exposed a CRASH when this is used with none-RPC layout-drivers.
      
      The change at NFS is deep and risky, in it's place the fix
      at ORE to lift the limitation is actually clean and simple.
      So here it is below.
      
      The principal here is that in the case of unaligned IO on
      both ends, beginning and end, we will send two read requests
      one like old code, before the calculation of the first stripe,
      and also a new site, before the calculation of the last stripe.
      If any "boundary" is aligned or the complete IO is within a single
      stripe. we do a single read like before.
      
      The code is clean and simple by splitting the old _read_4_write
      into 3 even parts:
      1._read_4_write_first_stripe
      2. _read_4_write_last_stripe
      3. _read_4_write_execute
      
      And calling 1+3 at the same place as before. 2+3 before last
      stripe, and in the case of all in a single stripe then 1+2+3
      is preformed additively.
      
      Why did I not think of it before. Well I had a strike of
      genius because I have stared at this code for 2 years, and did
      not find this simple solution, til today. Not that I did not try.
      
      This solution is much better for NFS than the previous supposedly
      solution because the short write was dealt  with out-of-band after
      IO_done, which would cause for a seeky IO pattern where as in here
      we execute in order. At both solutions we do 2 separate reads, only
      here we do it within a single IO request. (And actually combine two
      writes into a single submission)
      
      NFS/exofs code need not change since the ORE API communicates the new
      shorter length on return, what will happen is that this case would not
      occur anymore.
      
      hurray!!
      
      [Stable this is an NFS bug since 3.2 Kernel should apply cleanly]
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      c7003a9e
    • Artem Bityutskiy's avatar
      UBIFS: fix a bug in empty space fix-up · 10f26f99
      Artem Bityutskiy authored
      commit c6727932 upstream.
      
      UBIFS has a feature called "empty space fix-up" which is a quirk to work-around
      limitations of dumb flasher programs. Namely, of those flashers that are unable
      to skip NAND pages full of 0xFFs while flashing, resulting in empty space at
      the end of half-filled eraseblocks to be unusable for UBIFS. This feature is
      relatively new (introduced in v3.0).
      
      The fix-up routine (fixup_free_space()) is executed only once at the very first
      mount if the superblock has the 'space_fixup' flag set (can be done with -F
      option of mkfs.ubifs). It basically reads all the UBIFS data and metadata and
      writes it back to the same LEB. The routine assumes the image is pristine and
      does not have anything in the journal.
      
      There was a bug in 'fixup_free_space()' where it fixed up the log incorrectly.
      All but one LEB of the log of a pristine file-system are empty. And one
      contains just a commit start node. And 'fixup_free_space()' just unmapped this
      LEB, which resulted in wiping the commit start node. As a result, some users
      were unable to mount the file-system next time with the following symptom:
      
      UBIFS error (pid 1): replay_log_leb: first log node at LEB 3:0 is not CS node
      UBIFS error (pid 1): replay_log_leb: log error detected while replaying the log at LEB 3:0
      
      The root-cause of this bug was that 'fixup_free_space()' wrongly assumed
      that the beginning of empty space in the log head (c->lhead_offs) was known
      on mount. However, it is not the case - it was always 0. UBIFS does not store
      in it the master node and finds out by scanning the log on every mount.
      
      The fix is simple - just pass commit start node size instead of 0 to
      'fixup_leb()'.
      Signed-off-by: default avatarArtem Bityutskiy <Artem.Bityutskiy@linux.intel.com>
      Reported-by: default avatarIwo Mergler <Iwo.Mergler@netcommwireless.com>
      Tested-by: default avatarIwo Mergler <Iwo.Mergler@netcommwireless.com>
      Reported-by: default avatarJames Nute <newten82@gmail.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      10f26f99
    • David Daney's avatar
      MIPS: Properly align the .data..init_task section. · dc939195
      David Daney authored
      commit 7b1c0d26 upstream.
      
      Improper alignment can lead to unbootable systems and/or random
      crashes.
      
      [ralf@linux-mips.org: This is a lond standing bug since
      6eb10bc9 (kernel.org) rsp.
      c422a10917f75fd19fa7fe070aaaa23e384dae6f (lmo) [MIPS: Clean up linker script
      using new linker script macros.] so dates back to 2.6.32.]
      Signed-off-by: default avatarDavid Daney <david.daney@cavium.com>
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/3881/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      dc939195
    • NeilBrown's avatar
      md/raid1: close some possible races on write errors during resync · 987e8454
      NeilBrown authored
      commit 58e94ae1 upstream.
      
      commit 4367af55
         md/raid1: clear bad-block record when write succeeds.
      
      Added a 'reschedule_retry' call possibility at the end of
      end_sync_write, but didn't add matching code at the end of
      sync_request_write.  So if the writes complete very quickly, or
      scheduling makes it seem that way, then we can miss rescheduling
      the request and the resync could hang.
      
      Also commit 73d5c38a
          md: avoid races when stopping resync.
      
      Fix a race condition in this same code in end_sync_write but didn't
      make the change in sync_request_write.
      
      This patch updates sync_request_write to fix both of those.
      Patch is suitable for 3.1 and later kernels.
      Reported-by: default avatarAlexander Lyakas <alex.bolshoy@gmail.com>
      Original-version-by: default avatarAlexander Lyakas <alex.bolshoy@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      987e8454
    • NeilBrown's avatar
      md: avoid crash when stopping md array races with closing other open fds. · 068bd5de
      NeilBrown authored
      commit a05b7ea0 upstream.
      
      md will refuse to stop an array if any other fd (or mounted fs) is
      using it.
      When any fs is unmounted of when the last open fd is closed all
      pending IO will be flushed (e.g. sync_blockdev call in __blkdev_put)
      so there will be no pending IO to worry about when the array is
      stopped.
      
      However in order to send the STOP_ARRAY ioctl to stop the array one
      must first get and open fd on the block device.
      If some fd is being used to write to the block device and it is closed
      after mdadm open the block device, but before mdadm issues the
      STOP_ARRAY ioctl, then there will be no last-close on the md device so
      __blkdev_put will not call sync_blockdev.
      
      If this happens, then IO can still be in-flight while md tears down
      the array and bad things can happen (use-after-free and subsequent
      havoc).
      
      So in the case where do_md_stop is being called from an open file
      descriptor, call sync_block after taking the mutex to ensure there
      will be no new openers.
      
      This is needed when setting a read-write device to read-only too.
      Reported-by: default avatarmajianpeng <majianpeng@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      068bd5de
    • Aaditya Kumar's avatar
      mm: fix lost kswapd wakeup in kswapd_stop() · e23f4552
      Aaditya Kumar authored
      commit 1c7e7f6c upstream.
      
      Offlining memory may block forever, waiting for kswapd() to wake up
      because kswapd() does not check the event kthread->should_stop before
      sleeping.
      
      The proper pattern, from Documentation/memory-barriers.txt, is:
      
         ---  waker  ---
         event_indicated = 1;
         wake_up_process(event_daemon);
      
         ---  sleeper  ---
         for (;;) {
            set_current_state(TASK_UNINTERRUPTIBLE);
            if (event_indicated)
               break;
            schedule();
         }
      
         set_current_state() may be wrapped by:
            prepare_to_wait();
      
      In the kswapd() case, event_indicated is kthread->should_stop.
      
        === offlining memory (waker) ===
         kswapd_stop()
            kthread_stop()
               kthread->should_stop = 1
               wake_up_process()
               wait_for_completion()
      
        ===  kswapd_try_to_sleep (sleeper) ===
         kswapd_try_to_sleep()
            prepare_to_wait()
                 .
                 .
            schedule()
                 .
                 .
            finish_wait()
      
      The schedule() needs to be protected by a test of kthread->should_stop,
      which is wrapped by kthread_should_stop().
      
      Reproducer:
         Do heavy file I/O in background.
         Do a memory offline/online in a tight loop
      Signed-off-by: default avatarAaditya Kumar <aaditya.kumar@ap.sony.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      e23f4552
    • Jeff Layton's avatar
      cifs: always update the inode cache with the results from a FIND_* · 61c0f234
      Jeff Layton authored
      commit cd60042c upstream.
      
      When we get back a FIND_FIRST/NEXT result, we have some info about the
      dentry that we use to instantiate a new inode. We were ignoring and
      discarding that info when we had an existing dentry in the cache.
      
      Fix this by updating the inode in place when we find an existing dentry
      and the uniqueid is the same.
      Reported-and-Tested-by: default avatarAndrew Bartlett <abartlet@samba.org>
      Reported-by: default avatarBill Robertson <bill_robertson@debortoli.com.au>
      Reported-by: default avatarDion Edwards <dion_edwards@debortoli.com.au>
      Signed-off-by: default avatarJeff Layton <jlayton@redhat.com>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      61c0f234
    • Jeff Layton's avatar
      cifs: on CONFIG_HIGHMEM machines, limit the rsize/wsize to the kmap space · 8e1e19fe
      Jeff Layton authored
      commit 3ae629d9 upstream.
      
      We currently rely on being able to kmap all of the pages in an async
      read or write request. If you're on a machine that has CONFIG_HIGHMEM
      set then that kmap space is limited, sometimes to as low as 512 slots.
      
      With 512 slots, we can only support up to a 2M r/wsize, and that's
      assuming that we can get our greedy little hands on all of them. There
      are other users however, so it's possible we'll end up stuck with a
      size that large.
      
      Since we can't handle a rsize or wsize larger than that currently, cap
      those options at the number of kmap slots we have. We could consider
      capping it even lower, but we currently default to a max of 1M. Might as
      well allow those luddites on 32 bit arches enough rope to hang
      themselves.
      
      A more robust fix would be to teach the send and receive routines how
      to contend with an array of pages so we don't need to marshal up a kvec
      array at all. That's a fairly significant overhaul though, so we'll need
      this limit in place until that's ready.
      Reported-by: default avatarJian Li <jiali@redhat.com>
      Signed-off-by: default avatarJeff Layton <jlayton@redhat.com>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      8e1e19fe
    • Roland Dreier's avatar
      target: Fix range calculation in WRITE SAME emulation when num blocks == 0 · 1edae5d5
      Roland Dreier authored
      commit 1765fe5e upstream.
      
      When NUMBER OF LOGICAL BLOCKS is 0, WRITE SAME is supposed to write
      all the blocks from the specified LBA through the end of the device.
      However, dev->transport->get_blocks(dev) (perhaps confusingly) returns
      the last valid LBA rather than the number of blocks, so the correct
      number of blocks to write starting with lba is
      
      dev->transport->get_blocks(dev) - lba + 1
      
      (nab: Backport roland's for-3.6 patch to for-3.5)
      Signed-off-by: default avatarRoland Dreier <roland@purestorage.com>
      Signed-off-by: default avatarNicholas Bellinger <nab@linux-iscsi.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      1edae5d5
    • Roland Dreier's avatar
      target: Clean up returning errors in PR handling code · 6154c5bc
      Roland Dreier authored
      commit d35212f3 upstream.
      
       - instead of (PTR_ERR(file) < 0) just use IS_ERR(file)
       - return -EINVAL instead of EINVAL
       - all other error returns in target_scsi3_emulate_pr_out() use
         "goto out" -- get rid of the one remaining straight "return."
      Signed-off-by: default avatarRoland Dreier <roland@purestorage.com>
      Signed-off-by: default avatarNicholas Bellinger <nab@linux-iscsi.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      6154c5bc
    • Anders Kaseorg's avatar
      fifo: Do not restart open() if it already found a partner · 9729de79
      Anders Kaseorg authored
      commit 05d290d6 upstream.
      
      If a parent and child process open the two ends of a fifo, and the
      child immediately exits, the parent may receive a SIGCHLD before its
      open() returns.  In that case, we need to make sure that open() will
      return successfully after the SIGCHLD handler returns, instead of
      throwing EINTR or being restarted.  Otherwise, the restarted open()
      would incorrectly wait for a second partner on the other end.
      
      The following test demonstrates the EINTR that was wrongly thrown from
      the parent’s open().  Change .sa_flags = 0 to .sa_flags = SA_RESTART
      to see a deadlock instead, in which the restarted open() waits for a
      second reader that will never come.  (On my systems, this happens
      pretty reliably within about 5 to 500 iterations.  Others report that
      it manages to loop ~forever sometimes; YMMV.)
      
        #include <sys/stat.h>
        #include <sys/types.h>
        #include <sys/wait.h>
        #include <fcntl.h>
        #include <signal.h>
        #include <stdio.h>
        #include <stdlib.h>
        #include <unistd.h>
      
        #define CHECK(x) do if ((x) == -1) {perror(#x); abort();} while(0)
      
        void handler(int signum) {}
      
        int main()
        {
            struct sigaction act = {.sa_handler = handler, .sa_flags = 0};
            CHECK(sigaction(SIGCHLD, &act, NULL));
            CHECK(mknod("fifo", S_IFIFO | S_IRWXU, 0));
            for (;;) {
                int fd;
                pid_t pid;
                putc('.', stderr);
                CHECK(pid = fork());
                if (pid == 0) {
                    CHECK(fd = open("fifo", O_RDONLY));
                    _exit(0);
                }
                CHECK(fd = open("fifo", O_WRONLY));
                CHECK(close(fd));
                CHECK(waitpid(pid, NULL, 0));
            }
        }
      
      This is what I suspect was causing the Git test suite to fail in
      t9010-svn-fe.sh:
      
      	http://bugs.debian.org/678852Signed-off-by: default avatarAnders Kaseorg <andersk@mit.edu>
      Reviewed-by: default avatarJonathan Nieder <jrnieder@gmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      9729de79
    • Mark Rustad's avatar
      tcm_fc: Fix crash seen with aborts and large reads · 60d44861
      Mark Rustad authored
      commit 3cc5d2a6 upstream.
      
      This patch fixes a crash seen when large reads have their exchange
      aborted by either timing out or being reset. Because the exchange
      abort results in the seq pointer being set to NULL, because the
      sequence is no longer valid, it must not be dereferenced. This
      patch changes the function ft_get_task_tag to return ~0 if it is
      unable to get the tag for this reason. Because the get_task_tag
      interface provides no means of returning an error, this seems
      like the best way to fix this issue at the moment.
      Signed-off-by: default avatarMark Rustad <mark.d.rustad@intel.com>
      Signed-off-by: default avatarNicholas Bellinger <nab@linux-iscsi.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      60d44861
    • Tushar Dave's avatar
      e1000e: Correct link check logic for 82571 serdes · df437286
      Tushar Dave authored
      commit d0efa8f2 upstream.
      
      SYNCH bit and IV bit of RXCW register are sticky. Before examining these bits,
      RXCW should be read twice to filter out one-time false events and have correct
      values for these bits. Incorrect values of these bits in link check logic can
      cause weird link stability issues if auto-negotiation fails.
      Reported-by: default avatarDean Nelson <dnelson@redhat.com>
      Signed-off-by: default avatarTushar Dave <tushar.n.dave@intel.com>
      Reviewed-by: default avatarBruce Allan <bruce.w.allan@intel.com>
      Tested-by: default avatarJeff Pieper <jeffrey.e.pieper@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      df437286
    • Emmanuel Grumbach's avatar
      iwlegacy: don't mess up the SCD when removing a key · 154f4399
      Emmanuel Grumbach authored
      commit b48d9665 upstream.
      
      When we remove a key, we put a key index which was supposed
      to tell the fw that we are actually removing the key. But
      instead the fw took that index as a valid index and messed
      up the SRAM of the device.
      
      This memory corruption on the device mangled the data of
      the SCD. The impact on the user is that SCD queue 2 got
      stuck after having removed keys.
      Reported-by: default avatarPaul Bolle <pebolle@tiscali.nl>
      Signed-off-by: default avatarEmmanuel Grumbach <emmanuel.grumbach@intel.com>
      Signed-off-by: default avatarStanislaw Gruszka <sgruszka@redhat.com>
      Signed-off-by: default avatarJohn W. Linville <linville@tuxdriver.com>
      [bwh: Backported to 3.2: adjust filename, context and variable name]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      154f4399
    • Stanislaw Gruszka's avatar
      iwlegacy: always monitor for stuck queues · c9d907de
      Stanislaw Gruszka authored
      commit c2ca7d92 upstream.
      
      This is iwlegacy version of:
      
      commit 342bbf3f
      Author: Johannes Berg <johannes.berg@intel.com>
      Date:   Sun Mar 4 08:50:46 2012 -0800
      
          iwlwifi: always monitor for stuck queues
      
          If we only monitor while associated, the following
          can happen:
           - we're associated, and the queue stuck check
             runs, setting the queue "touch" time to X
           - we disassociate, stopping the monitoring,
             which leaves the time set to X
           - almost 2s later, we associate, and enqueue
             a frame
           - before the frame is transmitted, we monitor
             for stuck queues, and find the time set to
             X, although it is now later than X + 2000ms,
             so we decide that the queue is stuck and
             erroneously restart the device
      Signed-off-by: default avatarStanislaw Gruszka <sgruszka@redhat.com>
      Signed-off-by: default avatarJohn W. Linville <linville@tuxdriver.com>
      [bwh: Backported to 3.2: adjust filename, function and variable names]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      c9d907de
    • Stanislaw Gruszka's avatar
      rt2x00usb: fix indexes ordering on RX queue kick · 04208303
      Stanislaw Gruszka authored
      commit efd82118 upstream.
      
      On rt2x00_dmastart() we increase index specified by Q_INDEX and on
      rt2x00_dmadone() we increase index specified by Q_INDEX_DONE. So entries
      between Q_INDEX_DONE and Q_INDEX are those we currently process in the
      hardware. Entries between Q_INDEX and Q_INDEX_DONE are those we can
      submit to the hardware.
      
      According to that fix rt2x00usb_kick_queue(), as we need to submit RX
      entries that are not processed by the hardware. It worked before only
      for empty queue, otherwise was broken.
      
      Note that for TX queues indexes ordering are ok. We need to kick entries
      that have filled skb, but was not submitted to the hardware, i.e.
      started from Q_INDEX_DONE and have ENTRY_DATA_PENDING bit set.
      
      From practical standpoint this fixes RX queue stall, usually reproducible
      in AP mode, like for example reported here:
      https://bugzilla.redhat.com/show_bug.cgi?id=828824Reported-and-tested-by: default avatarFranco Miceli <fmiceli@plan.ceibal.edu.uy>
      Reported-and-tested-by: default avatarTom Horsley <horsley1953@gmail.com>
      Signed-off-by: default avatarStanislaw Gruszka <sgruszka@redhat.com>
      Signed-off-by: default avatarJohn W. Linville <linville@tuxdriver.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      04208303
    • Cloud Ren's avatar
      atl1c: fix issue of transmit queue 0 timed out · e0dc11cd
      Cloud Ren authored
      commit b94e52f6 upstream.
      
      some people report atl1c could cause system hang with following
      kernel trace info:
      ---------------------------------------
      WARNING: at.../net/sched/sch_generic.c:258 dev_watchdog+0x1db/0x1d0()
      ...
      NETDEV WATCHDOG: eth0 (atl1c): transmit queue 0 timed out
      ...
      ---------------------------------------
      This is caused by netif_stop_queue calling when cable Link is down.
      So remove netif_stop_queue, because link_watch will take it over.
      Signed-off-by: default avatarxiong <xiong@qca.qualcomm.com>
      Signed-off-by: default avatarCloud Ren <cjren@qca.qualcomm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      [bwh: Backported to 3.2: adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      e0dc11cd
    • Takashi Iwai's avatar
      intel_ips: blacklist HP ProBook laptops · 023f9dff
      Takashi Iwai authored
      commit 88ca518b upstream.
      
      intel_ips driver spews the warning message
        "ME failed to update for more than 1s, likely hung"
      at each second endlessly on HP ProBook laptops with IronLake.
      
      As this has never worked, better to blacklist the driver for now.
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarMatthew Garrett <mjg@redhat.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      023f9dff
    • Michal Kazior's avatar
      cfg80211: check iface combinations only when iface is running · 1885f653
      Michal Kazior authored
      commit f8cdddb8 upstream.
      
      Don't validate interface combinations on a stopped
      interface. Otherwise we might end up being able to
      create a new interface with a certain type, but
      won't be able to change an existing interface
      into that type.
      
      This also skips some other functions when
      interface is stopped and changing interface type.
      Signed-off-by: default avatarMichal Kazior <michal.kazior@tieto.com>
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      1885f653
    • Bojan Smojver's avatar
      PM / Hibernate: Hibernate/thaw fixes/improvements · 6ed6791a
      Bojan Smojver authored
      commit 5a21d489 upstream.
      
       1. Do not allocate memory for buffers from emergency pools, unless
          absolutely required. Do not warn about and do not retry non-essential
          failed allocations.
      
       2. Do not check the amount of free pages left on every single page
          write, but wait until one map is completely populated and then check.
      
       3. Set maximum number of pages for read buffering consistently, instead
          of inadvertently depending on the size of the sector type.
      
       4. Fix copyright line, which I missed when I submitted the hibernation
          threading patch.
      
       5. Dispense with bit shifting arithmetic to improve readability.
      
       6. Really recalculate the number of pages required to be free after all
          allocations have been done.
      
       7. Fix calculation of pages required for read buffering. Only count in
          pages that do not belong to high memory.
      Signed-off-by: default avatarBojan Smojver <bojan@rexursive.com>
      Signed-off-by: default avatarRafael J. Wysocki <rjw@sisk.pl>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      6ed6791a
    • Samuel Ortiz's avatar
      NFC: Export nfc.h to userland · bce3ff49
      Samuel Ortiz authored
      commit dbd4fcaf upstream.
      
      The netlink commands and attributes, along with the socket structure
      definitions need to be exported.
      Signed-off-by: default avatarSamuel Ortiz <sameo@linux.intel.com>
      Signed-off-by: default avatarJohn W. Linville <linville@tuxdriver.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      bce3ff49
    • Dave Jones's avatar
      Remove easily user-triggerable BUG from generic_setlease · 8f2c5a74
      Dave Jones authored
      commit 8d657eb3 upstream.
      
      This can be trivially triggered from userspace by passing in something unexpected.
      
          kernel BUG at fs/locks.c:1468!
          invalid opcode: 0000 [#1] SMP
          RIP: 0010:generic_setlease+0xc2/0x100
          Call Trace:
            __vfs_setlease+0x35/0x40
            fcntl_setlease+0x76/0x150
            sys_fcntl+0x1c6/0x810
            system_call_fastpath+0x1a/0x1f
      Signed-off-by: default avatarDave Jones <davej@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      8f2c5a74
    • Jeff Moyer's avatar
      block: fix infinite loop in __getblk_slow · 631a86fc
      Jeff Moyer authored
      commit 91f68c89 upstream.
      
      Commit 080399aa ("block: don't mark buffers beyond end of disk as
      mapped") exposed a bug in __getblk_slow that causes mount to hang as it
      loops infinitely waiting for a buffer that lies beyond the end of the
      disk to become uptodate.
      
      The problem was initially reported by Torsten Hilbrich here:
      
          https://lkml.org/lkml/2012/6/18/54
      
      and also reported independently here:
      
          http://www.sysresccd.org/forums/viewtopic.php?f=13&t=4511
      
      and then Richard W.M.  Jones and Marcos Mello noted a few separate
      bugzillas also associated with the same issue.  This patch has been
      confirmed to fix:
      
          https://bugzilla.redhat.com/show_bug.cgi?id=835019
      
      The main problem is here, in __getblk_slow:
      
              for (;;) {
                      struct buffer_head * bh;
                      int ret;
      
                      bh = __find_get_block(bdev, block, size);
                      if (bh)
                              return bh;
      
                      ret = grow_buffers(bdev, block, size);
                      if (ret < 0)
                              return NULL;
                      if (ret == 0)
                              free_more_memory();
              }
      
      __find_get_block does not find the block, since it will not be marked as
      mapped, and so grow_buffers is called to fill in the buffers for the
      associated page.  I believe the for (;;) loop is there primarily to
      retry in the case of memory pressure keeping grow_buffers from
      succeeding.  However, we also continue to loop for other cases, like the
      block lying beond the end of the disk.  So, the fix I came up with is to
      only loop when grow_buffers fails due to memory allocation issues
      (return value of 0).
      
      The attached patch was tested by myself, Torsten, and Rich, and was
      found to resolve the problem in call cases.
      Signed-off-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Reported-and-Tested-by: default avatarTorsten Hilbrich <torsten.hilbrich@secunet.com>
      Tested-by: default avatarRichard W.M. Jones <rjones@redhat.com>
      Reviewed-by: default avatarJosh Boyer <jwboyer@redhat.com>
      [ Jens is on vacation, taking this directly  - Linus ]
      --
      Stable Notes: this patch requires backport to 3.0, 3.2 and 3.3.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      631a86fc
    • Todd Poynor's avatar
      ARM: SAMSUNG: fix race in s3c_adc_start for ADC · 3bbc9e19
      Todd Poynor authored
      commit 8265981b upstream.
      
      Checking for adc->ts_pend already claimed should be done with the
      lock held.
      Signed-off-by: default avatarTodd Poynor <toddpoynor@google.com>
      Acked-by: default avatarBen Dooks <ben-linux@fluff.org>
      Signed-off-by: default avatarKukjin Kim <kgene.kim@samsung.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      3bbc9e19
    • Jean Delvare's avatar
      hwmon: (it87) Preserve configuration register bits on init · eb42f93d
      Jean Delvare authored
      commit 41002f8d upstream.
      
      We were accidentally losing one bit in the configuration register on
      device initialization. It was reported to freeze one specific system
      right away. Properly preserve all bits we don't explicitly want to
      change in order to prevent that.
      Reported-by: default avatarStevie Trujillo <stevie.trujillo@gmail.com>
      Signed-off-by: default avatarJean Delvare <khali@linux-fr.org>
      Reviewed-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      eb42f93d
    • Thomas Renninger's avatar
      cpufreq / ACPI: Fix not loading acpi-cpufreq driver regression · 51954298
      Thomas Renninger authored
      commit c4686c71 upstream.
      
      Commit d640113f introduced a regression on SMP
      systems where the processor core with ACPI id zero is disabled
      (typically should be the case because of hyperthreading).
      The regression got spread through stable kernels.
      On 3.0.X it got introduced via 3.0.18.
      
      Such platforms may be rare, but do exist.
      Look out for a disabled processor with acpi_id 0 in dmesg:
      ACPI: LAPIC (acpi_id[0x00] lapic_id[0x10] disabled)
      
      This problem has been observed on a:
      HP Proliant BL280c G6 blade
      
      This patch restricts the introduced workaround to platforms
      with nr_cpu_ids <= 1.
      Signed-off-by: default avatarThomas Renninger <trenn@suse.de>
      Signed-off-by: default avatarRafael J. Wysocki <rjw@sisk.pl>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      51954298
    • Bob Liu's avatar
      fs: ramfs: file-nommu: add SetPageUptodate() · 3501ec35
      Bob Liu authored
      commit fea9f718 upstream.
      
      There is a bug in the below scenario for !CONFIG_MMU:
      
       1. create a new file
       2. mmap the file and write to it
       3. read the file can't get the correct value
      
      Because
      
        sys_read() -> generic_file_aio_read() -> simple_readpage() -> clear_page()
      
      which causes the page to be zeroed.
      
      Add SetPageUptodate() to ramfs_nommu_expand_for_mapping() so that
      generic_file_aio_read() do not call simple_readpage().
      Signed-off-by: default avatarBob Liu <lliubbo@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@uclinux.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      3501ec35
    • Benoît Thébaudeau's avatar
      drivers/rtc/rtc-mxc.c: fix irq enabled interrupts warning · d1d5b31c
      Benoît Thébaudeau authored
      commit b59f6d1f upstream.
      
      Fixes
      
        WARNING: at irq/handle.c:146 handle_irq_event_percpu+0x19c/0x1b8()
        irq 25 handler mxc_rtc_interrupt+0x0/0xac enabled interrupts
        Modules linked in:
         (unwind_backtrace+0x0/0xf0) from (warn_slowpath_common+0x4c/0x64)
         (warn_slowpath_common+0x4c/0x64) from (warn_slowpath_fmt+0x30/0x40)
         (warn_slowpath_fmt+0x30/0x40) from (handle_irq_event_percpu+0x19c/0x1b8)
         (handle_irq_event_percpu+0x19c/0x1b8) from (handle_irq_event+0x28/0x38)
         (handle_irq_event+0x28/0x38) from (handle_level_irq+0x80/0xc4)
         (handle_level_irq+0x80/0xc4) from (generic_handle_irq+0x24/0x38)
         (generic_handle_irq+0x24/0x38) from (handle_IRQ+0x30/0x84)
         (handle_IRQ+0x30/0x84) from (avic_handle_irq+0x2c/0x4c)
         (avic_handle_irq+0x2c/0x4c) from (__irq_svc+0x40/0x60)
        Exception stack(0xc050bf60 to 0xc050bfa8)
        bf60: 00000001 00000000 003c4208 c0018e20 c050a000 c050a000 c054a4c8 c050a000
        bf80: c05157a8 4117b363 80503bb4 00000000 01000000 c050bfa8 c0018e2c c000e808
        bfa0: 60000013 ffffffff
         (__irq_svc+0x40/0x60) from (default_idle+0x1c/0x30)
         (default_idle+0x1c/0x30) from (cpu_idle+0x68/0xa8)
         (cpu_idle+0x68/0xa8) from (start_kernel+0x22c/0x26c)
      Signed-off-by: default avatarBenoît Thébaudeau <benoit.thebaudeau@advansee.com>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Cc: Sascha Hauer <kernel@pengutronix.de>
      Acked-by: default avatarUwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      d1d5b31c
    • David Rientjes's avatar
      mm, thp: abort compaction if migration page cannot be charged to memcg · e13d6560
      David Rientjes authored
      commit 4bf2bba3 upstream.
      
      If page migration cannot charge the temporary page to the memcg,
      migrate_pages() will return -ENOMEM.  This isn't considered in memory
      compaction however, and the loop continues to iterate over all
      pageblocks trying to isolate and migrate pages.  If a small number of
      very large memcgs happen to be oom, however, these attempts will mostly
      be futile leading to an enormous amout of cpu consumption due to the
      page migration failures.
      
      This patch will short circuit and fail memory compaction if
      migrate_pages() returns -ENOMEM.  COMPACT_PARTIAL is returned in case
      some migrations were successful so that the page allocator will retry.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      e13d6560
    • Luis Henriques's avatar
      ocfs2: fix NULL pointer dereference in __ocfs2_change_file_space() · 5eb695db
      Luis Henriques authored
      commit a4e08d00 upstream.
      
      As ocfs2_fallocate() will invoke __ocfs2_change_file_space() with a NULL
      as the first parameter (file), it may trigger a NULL pointer dereferrence
      due to a missing check.
      
      Addresses http://bugs.launchpad.net/bugs/1006012Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      Reported-by: default avatarBret Towe <magnade@gmail.com>
      Tested-by: default avatarBret Towe <magnade@gmail.com>
      Cc: Sunil Mushran <sunil.mushran@oracle.com>
      Acked-by: default avatarJoel Becker <jlbec@evilplan.org>
      Acked-by: default avatarMark Fasheh <mfasheh@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      5eb695db
    • Jiang Liu's avatar
      memory hotplug: fix invalid memory access caused by stale kswapd pointer · 0906d248
      Jiang Liu authored
      commit d8adde17 upstream.
      
      kswapd_stop() is called to destroy the kswapd work thread when all memory
      of a NUMA node has been offlined.  But kswapd_stop() only terminates the
      work thread without resetting NODE_DATA(nid)->kswapd to NULL.  The stale
      pointer will prevent kswapd_run() from creating a new work thread when
      adding memory to the memory-less NUMA node again.  Eventually the stale
      pointer may cause invalid memory access.
      
      An example stack dump as below. It's reproduced with 2.6.32, but latest
      kernel has the same issue.
      
        BUG: unable to handle kernel NULL pointer dereference at (null)
        IP: [<ffffffff81051a94>] exit_creds+0x12/0x78
        PGD 0
        Oops: 0000 [#1] SMP
        last sysfs file: /sys/devices/system/memory/memory391/state
        CPU 11
        Modules linked in: cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq microcode fuse loop dm_mod tpm_tis rtc_cmos i2c_i801 rtc_core tpm serio_raw pcspkr sg tpm_bios igb i2c_core iTCO_wdt rtc_lib mptctl iTCO_vendor_support button dca bnx2 usbhid hid uhci_hcd ehci_hcd usbcore sd_mod crc_t10dif edd ext3 mbcache jbd fan ide_pci_generic ide_core ata_generic ata_piix libata thermal processor thermal_sys hwmon mptsas mptscsih mptbase scsi_transport_sas scsi_mod
        Pid: 7949, comm: sh Not tainted 2.6.32.12-qiuxishi-5-default #92 Tecal RH2285
        RIP: 0010:exit_creds+0x12/0x78
        RSP: 0018:ffff8806044f1d78  EFLAGS: 00010202
        RAX: 0000000000000000 RBX: ffff880604f22140 RCX: 0000000000019502
        RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000
        RBP: ffff880604f22150 R08: 0000000000000000 R09: ffffffff81a4dc10
        R10: 00000000000032a0 R11: ffff880006202500 R12: 0000000000000000
        R13: 0000000000c40000 R14: 0000000000008000 R15: 0000000000000001
        FS:  00007fbc03d066f0(0000) GS:ffff8800282e0000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 0000000000000000 CR3: 000000060f029000 CR4: 00000000000006e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Process sh (pid: 7949, threadinfo ffff8806044f0000, task ffff880603d7c600)
        Stack:
         ffff880604f22140 ffffffff8103aac5 ffff880604f22140 ffffffff8104d21e
         ffff880006202500 0000000000008000 0000000000c38000 ffffffff810bd5b1
         0000000000000000 ffff880603d7c600 00000000ffffdd29 0000000000000003
        Call Trace:
          __put_task_struct+0x5d/0x97
          kthread_stop+0x50/0x58
          offline_pages+0x324/0x3da
          memory_block_change_state+0x179/0x1db
          store_mem_state+0x9e/0xbb
          sysfs_write_file+0xd0/0x107
          vfs_write+0xad/0x169
          sys_write+0x45/0x6e
          system_call_fastpath+0x16/0x1b
        Code: ff 4d 00 0f 94 c0 84 c0 74 08 48 89 ef e8 1f fd ff ff 5b 5d 31 c0 41 5c c3 53 48 8b 87 20 06 00 00 48 89 fb 48 8b bf 18 06 00 00 <8b> 00 48 c7 83 18 06 00 00 00 00 00 00 f0 ff 0f 0f 94 c0 84 c0
        RIP  exit_creds+0x12/0x78
         RSP <ffff8806044f1d78>
        CR2: 0000000000000000
      
      [akpm@linux-foundation.org: add pglist_data.kswapd locking comments]
      Signed-off-by: default avatarXishi Qiu <qiuxishi@huawei.com>
      Signed-off-by: default avatarJiang Liu <jiang.liu@huawei.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      0906d248