Skip to content
Snippets Groups Projects
  1. Jun 03, 2019
    • Linus Torvalds's avatar
      rcu: locking and unlocking need to always be at least barriers · 66be4e66
      Linus Torvalds authored
      
      Herbert Xu pointed out that commit bb73c52b ("rcu: Don't disable
      preemption for Tiny and Tree RCU readers") was incorrect in making the
      preempt_disable/enable() be conditional on CONFIG_PREEMPT_COUNT.
      
      If CONFIG_PREEMPT_COUNT isn't enabled, the preemption enable/disable is
      a no-op, but still is a compiler barrier.
      
      And RCU locking still _needs_ that compiler barrier.
      
      It is simply fundamentally not true that RCU locking would be a complete
      no-op: we still need to guarantee (for example) that things that can
      trap and cause preemption cannot migrate into the RCU locked region.
      
      The way we do that is by making it a barrier.
      
      See for example commit 386afc91 ("spinlocks and preemption points
      need to be at least compiler barriers") from back in 2013 that had
      similar issues with spinlocks that become no-ops on UP: they must still
      constrain the compiler from moving other operations into the critical
      region.
      
      Now, it is true that a lot of RCU operations already use READ_ONCE() and
      WRITE_ONCE() (which in practice likely would never be re-ordered wrt
      anything remotely interesting), but it is also true that that is not
      globally the case, and that it's not even necessarily always possible
      (ie bitfields etc).
      
      Reported-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Fixes: bb73c52b ("rcu: Don't disable preemption for Tiny and Tree RCU readers")
      Cc: stable@kernel.org
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      66be4e66
  2. Jun 01, 2019
    • Jonathan Corbet's avatar
      include/linux/generic-radix-tree.h: fix kerneldoc comment · 590ba22b
      Jonathan Corbet authored
      The DOC comment block section in include/linux/generic-radix-tree.h
      contained a spurious colon, causing this warning in the documentation
      build:
      
        include/linux/generic-radix-tree.h:1: warning: no structured comments found
      
      Remove the colon and make the docs build happy.
      
      Link: http://lkml.kernel.org/r/20190524141933.74ae9050@lwn.net
      
      
      Signed-off-by: default avatarJonathan Corbet <corbet@lwn.net>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      590ba22b
    • Jiri Slaby's avatar
      memcg: make it work on sparse non-0-node systems · 3e858996
      Jiri Slaby authored
      We have a single node system with node 0 disabled:
        Scanning NUMA topology in Northbridge 24
        Number of physical nodes 2
        Skipping disabled node 0
        Node 1 MemBase 0000000000000000 Limit 00000000fbff0000
        NODE_DATA(1) allocated [mem 0xfbfda000-0xfbfeffff]
      
      This causes crashes in memcg when system boots:
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
        #PF error: [normal kernel read fault]
      ...
        RIP: 0010:list_lru_add+0x94/0x170
      ...
        Call Trace:
         d_lru_add+0x44/0x50
         dput.part.34+0xfc/0x110
         __fput+0x108/0x230
         task_work_run+0x9f/0xc0
         exit_to_usermode_loop+0xf5/0x100
      
      It is reproducible as far as 4.12.  I did not try older kernels.  You have
      to have a new enough systemd, e.g.  241 (the reason is unknown -- was not
      investigated).  Cannot be reproduced with systemd 234.
      
      The system crashes because the size of lru array is never updated in
      memcg_update_all_list_lrus and the reads are past the zero-sized array,
      causing dereferences of random memory.
      
      The root cause are list_lru_memcg_aware checks in the list_lru code.  The
      test in list_lru_memcg_aware is broken: it assumes node 0 is always
      present, but it is not true on some systems as can be seen above.
      
      So fix this by avoiding checks on node 0.  Remember the memcg-awareness by
      a bool flag in struct list_lru.
      
      Link: http://lkml.kernel.org/r/20190522091940.3615-1-jslaby@suse.cz
      
      
      Fixes: 60d3fd32 ("list_lru: introduce per-memcg lists")
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Suggested-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3e858996
    • Chris Down's avatar
      mm, memcg: consider subtrees in memory.events · 9852ae3f
      Chris Down authored
      memory.stat and other files already consider subtrees in their output, and
      we should too in order to not present an inconsistent interface.
      
      The current situation is fairly confusing, because people interacting with
      cgroups expect hierarchical behaviour in the vein of memory.stat,
      cgroup.events, and other files.  For example, this causes confusion when
      debugging reclaim events under low, as currently these always read "0" at
      non-leaf memcg nodes, which frequently causes people to misdiagnose breach
      behaviour.  The same confusion applies to other counters in this file when
      debugging issues.
      
      Aggregation is done at write time instead of at read-time since these
      counters aren't hot (unlike memory.stat which is per-page, so it does it
      at read time), and it makes sense to bundle this with the file
      notifications.
      
      After this patch, events are propagated up the hierarchy:
      
          [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
          low 0
          high 0
          max 0
          oom 0
          oom_kill 0
          [root@ktst ~]# systemd-run -p MemoryMax=1 true
          Running as unit: run-r251162a189fb4562b9dabfdc9b0422f5.service
          [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
          low 0
          high 0
          max 7
          oom 1
          oom_kill 1
      
      As this is a change in behaviour, this can be reverted to the old
      behaviour by mounting with the `memory_localevents' flag set.  However, we
      use the new behaviour by default as there's a lack of evidence that there
      are any current users of memory.events that would find this change
      undesirable.
      
      akpm: this is a behaviour change, so Cc:stable.  THis is so that
      forthcoming distros which use cgroup v2 are more likely to pick up the
      revised behaviour.
      
      Link: http://lkml.kernel.org/r/20190208224419.GA24772@chrisdown.name
      
      
      Signed-off-by: default avatarChris Down <chris@chrisdown.name>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9852ae3f
  3. May 31, 2019
  4. May 30, 2019
  5. May 27, 2019
    • Miklos Szeredi's avatar
      fuse: add FUSE_WRITE_KILL_PRIV · 4a2abf99
      Miklos Szeredi authored
      
      In the FOPEN_DIRECT_IO case the write path doesn't call file_remove_privs()
      and that means setuid bit is not cleared if unpriviliged user writes to a
      file with setuid bit set.
      
      pjdfstest chmod test 12.t tests this and fails.
      
      Fix this by adding a flag to the FUSE_WRITE message that requests clearing
      privileges on the given file.  This needs 
      
      This better than just calling fuse_remove_privs(), because the attributes
      may not be up to date, so in that case a write may miss clearing the
      privileges.
      
      Test case:
      
        $ passthrough_ll /mnt/pasthrough-mnt -o default_permissions,allow_other,cache=never
        $ mkdir /mnt/pasthrough-mnt/testdir
        $ cd /mnt/pasthrough-mnt/testdir
        $ prove -rv pjdfstests/tests/chmod/12.t
      
      Reported-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      Tested-by: default avatarVivek Goyal <vgoyal@redhat.com>
      4a2abf99
    • Rafael J. Wysocki's avatar
      PCI: PM: Avoid possible suspend-to-idle issue · d491f2b7
      Rafael J. Wysocki authored
      
      If a PCI driver leaves the device handled by it in D0 and calls
      pci_save_state() on the device in its ->suspend() or ->suspend_late()
      callback, it can expect the device to stay in D0 over the whole
      s2idle cycle.  However, that may not be the case if there is a
      spurious wakeup while the system is suspended, because in that case
      pci_pm_suspend_noirq() will run again after pci_pm_resume_noirq()
      which calls pci_restore_state(), via pci_pm_default_resume_early(),
      so state_saved is cleared and the second iteration of
      pci_pm_suspend_noirq() will invoke pci_prepare_to_sleep() which
      may change the power state of the device.
      
      To avoid that, add a new internal flag, skip_bus_pm, that will be set
      by pci_pm_suspend_noirq() when it runs for the first time during the
      given system suspend-resume cycle if the state of the device has
      been saved already and the device is still in D0.  Setting that flag
      will cause the next iterations of pci_pm_suspend_noirq() to set
      state_saved for pci_pm_resume_noirq(), so that it always restores the
      device state from the originally saved data, and avoid calling
      pci_prepare_to_sleep() for the device.
      
      Fixes: 33e4f80e ("ACPI / PM: Ignore spurious SCI wakeups from suspend-to-idle")
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Reviewed-by: default avatarKeith Busch <keith.busch@intel.com>
      Reviewed-by: default avatarMika Westerberg <mika.westerberg@linux.intel.com>
      d491f2b7
    • Rafael J. Wysocki's avatar
      ACPI: PM: Call pm_set_suspend_via_firmware() during hibernation · bb186901
      Rafael J. Wysocki authored
      
      On systems with ACPI platform firmware the last stage of hibernation
      is analogous to system suspend to S3 (suspend-to-RAM), so it should
      be handled analogously.  In particular, pm_suspend_via_firmware()
      should return 'true' in that stage to let the callers of it know that
      control will be passed to the platform firmware going forward, so
      pm_set_suspend_via_firmware() needs to be called then in analogy with
      acpi_suspend_begin().
      
      However, the platform hibernation ->begin() callback is invoked
      during the "freeze" transition (before creating a snapshot image of
      system memory) as well as during the "hibernate" transition which is
      the last stage of it and pm_set_suspend_via_firmware() should be
      invoked by that callback in the latter stage only.
      
      In order to implement that redefine the hibernation ->begin()
      callback to take a pm_message_t argument to indicate which stage
      of hibernation is taking place and rework acpi_hibernation_begin()
      and acpi_hibernation_begin_old() to take it into account as needed.
      
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      bb186901
  6. May 25, 2019
  7. May 24, 2019
Loading