Skip to content
Snippets Groups Projects
  1. May 30, 2019
  2. Mar 08, 2019
    • Roman Penyaev's avatar
      epoll: use rwlock in order to reduce ep_poll_callback() contention · a218cc49
      Roman Penyaev authored
      The goal of this patch is to reduce contention of ep_poll_callback()
      which can be called concurrently from different CPUs in case of high
      events rates and many fds per epoll.  Problem can be very well
      reproduced by generating events (write to pipe or eventfd) from many
      threads, while consumer thread does polling.  In other words this patch
      increases the bandwidth of events which can be delivered from sources to
      the poller by adding poll items in a lockless way to the list.
      
      The main change is in replacement of the spinlock with a rwlock, which
      is taken on read in ep_poll_callback(), and then by adding poll items to
      the tail of the list using xchg atomic instruction.  Write lock is taken
      everywhere else in order to stop list modifications and guarantee that
      list updates are fully completed (I assume that write side of a rwlock
      does not starve, it seems qrwlock implementation has these guarantees).
      
      The following are some microbenchmark results based on the test [1]
      which starts threads which generate N events each.  The test ends when
      all events are successfully fetched by the poller thread:
      
       spinlock
       ========
      
       threads  events/ms  run-time ms
             8       6402        12495
            16       7045        22709
            32       7395        43268
      
       rwlock + xchg
       =============
      
       threads  events/ms  run-time ms
             8      10038         7969
            16      12178        13138
            32      13223        24199
      
      According to the results bandwidth of delivered events is significantly
      increased, thus execution time is reduced.
      
      This patch was tested with different sort of microbenchmarks and
      artificial delays (e.g.  "udelay(get_random_int() & 0xff)") introduced
      in kernel on paths where items are added to lists.
      
      [1] https://github.com/rouming/test-tools/blob/master/stress-epoll.c
      
      Link: http://lkml.kernel.org/r/20190103150104.17128-5-rpenyaev@suse.de
      
      
      Signed-off-by: default avatarRoman Penyaev <rpenyaev@suse.de>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a218cc49
    • Roman Penyaev's avatar
      epoll: unify awaking of wakeup source on ep_poll_callback() path · c3e320b6
      Roman Penyaev authored
      Original comment "Activate ep->ws since epi->ws may get deactivated at
      any time" indeed sounds loud, but it is incorrect, because the path
      where we check epi->ws is a path where insert to ovflist happens, i.e.
      ep_scan_ready_list() has taken ep->mtx and waits for this callback to
      finish, thus ep_modify() (which unregisters wakeup source) waits for
      ep_scan_ready_list().
      
      Here in this patch I simply call ep_pm_stay_awake_rcu(), which is a bit
      extra for this path (indirectly protected by main ep->mtx, so even rcu
      is not needed), but I do not want to create another naked
      __ep_pm_stay_awake() variant only for this particular case, so rcu variant
      is just better for all the cases.
      
      Link: http://lkml.kernel.org/r/20190103150104.17128-4-rpenyaev@suse.de
      
      
      Signed-off-by: default avatarRoman Penyaev <rpenyaev@suse.de>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c3e320b6
    • Roman Penyaev's avatar
      epoll: make sure all elements in ready list are in FIFO order · c141175d
      Roman Penyaev authored
      Patch series "use rwlock in order to reduce ep_poll_callback()
      contention", v3.
      
      The last patch targets the contention problem in ep_poll_callback(),
      which can be very well reproduced by generating events (write to pipe or
      eventfd) from many threads, while consumer thread does polling.
      
      The following are some microbenchmark results based on the test [1]
      which starts threads which generate N events each.  The test ends when
      all events are successfully fetched by the poller thread:
      
       spinlock
       ========
      
       threads  events/ms  run-time ms
             8       6402        12495
            16       7045        22709
            32       7395        43268
      
       rwlock + xchg
       =============
      
       threads  events/ms  run-time ms
             8      10038         7969
            16      12178        13138
            32      13223        24199
      
      According to the results bandwidth of delivered events is significantly
      increased, thus execution time is reduced.
      
      This patch (of 4):
      
      All coming events are stored in FIFO order and this is also should be
      applicable to ->ovflist, which originally is stack, i.e.  LIFO.
      
      Thus to keep correct FIFO order ->ovflist should reversed by adding
      elements to the head of the read list but not to the tail.
      
      Link: http://lkml.kernel.org/r/20190103150104.17128-2-rpenyaev@suse.de
      
      
      Signed-off-by: default avatarRoman Penyaev <rpenyaev@suse.de>
      Reviewed-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c141175d
  3. Jan 04, 2019
  4. Dec 06, 2018
    • Deepa Dinamani's avatar
      signal: Add restore_user_sigmask() · 854a6ed5
      Deepa Dinamani authored
      
      Refactor the logic to restore the sigmask before the syscall
      returns into an api.
      This is useful for versions of syscalls that pass in the
      sigmask and expect the current->sigmask to be changed during
      the execution and restored after the execution of the syscall.
      
      With the advent of new y2038 syscalls in the subsequent patches,
      we add two more new versions of the syscalls (for pselect, ppoll
      and io_pgetevents) in addition to the existing native and compat
      versions. Adding such an api reduces the logic that would need to
      be replicated otherwise.
      
      Signed-off-by: default avatarDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      854a6ed5
    • Deepa Dinamani's avatar
      signal: Add set_user_sigmask() · ded653cc
      Deepa Dinamani authored
      
      Refactor reading sigset from userspace and updating sigmask
      into an api.
      
      This is useful for versions of syscalls that pass in the
      sigmask and expect the current->sigmask to be changed during,
      and restored after, the execution of the syscall.
      
      With the advent of new y2038 syscalls in the subsequent patches,
      we add two more new versions of the syscalls (for pselect, ppoll,
      and io_pgetevents) in addition to the existing native and compat
      versions. Adding such an api reduces the logic that would need to
      be replicated otherwise.
      
      Note that the calls to sigprocmask() ignored the return value
      from the api as the function only returns an error on an invalid
      first argument that is hardcoded at these call sites.
      The updated logic uses set_current_blocked() instead.
      
      Signed-off-by: default avatarDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      ded653cc
  5. Aug 22, 2018
  6. Jun 28, 2018
    • Linus Torvalds's avatar
      Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL · a11e1d43
      Linus Torvalds authored
      
      The poll() changes were not well thought out, and completely
      unexplained.  They also caused a huge performance regression, because
      "->poll()" was no longer a trivial file operation that just called down
      to the underlying file operations, but instead did at least two indirect
      calls.
      
      Indirect calls are sadly slow now with the Spectre mitigation, but the
      performance problem could at least be largely mitigated by changing the
      "->get_poll_head()" operation to just have a per-file-descriptor pointer
      to the poll head instead.  That gets rid of one of the new indirections.
      
      But that doesn't fix the new complexity that is completely unwarranted
      for the regular case.  The (undocumented) reason for the poll() changes
      was some alleged AIO poll race fixing, but we don't make the common case
      slower and more complex for some uncommon special case, so this all
      really needs way more explanations and most likely a fundamental
      redesign.
      
      [ This revert is a revert of about 30 different commits, not reverted
        individually because that would just be unnecessarily messy  - Linus ]
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a11e1d43
  7. Jun 15, 2018
  8. May 26, 2018
  9. Apr 02, 2018
  10. Feb 11, 2018
    • Linus Torvalds's avatar
      vfs: do bulk POLL* -> EPOLL* replacement · a9a08845
      Linus Torvalds authored
      
      This is the mindless scripted replacement of kernel use of POLL*
      variables as described by Al, done by this script:
      
          for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
              L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
              for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
          done
      
      with de-mangling cleanups yet to come.
      
      NOTE! On almost all architectures, the EPOLL* constants have the same
      values as the POLL* constants do.  But they keyword here is "almost".
      For various bad reasons they aren't the same, and epoll() doesn't
      actually work quite correctly in some cases due to this on Sparc et al.
      
      The next patch from Al will sort out the final differences, and we
      should be all done.
      
      Scripted-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9a08845
  11. Feb 01, 2018
  12. Nov 29, 2017
  13. Nov 27, 2017
  14. Nov 18, 2017
    • Jason Baron's avatar
      epoll: remove ep_call_nested() from ep_eventpoll_poll() · 37b5e521
      Jason Baron authored
      The use of ep_call_nested() in ep_eventpoll_poll(), which is the .poll
      routine for an epoll fd, is used to prevent excessively deep epoll
      nesting, and to prevent circular paths.
      
      However, we are already preventing these conditions during
      EPOLL_CTL_ADD.  In terms of too deep epoll chains, we do in fact allow
      deep nesting of the epoll fds themselves (deeper than EP_MAX_NESTS),
      however we don't allow more than EP_MAX_NESTS when an epoll file
      descriptor is actually connected to a wakeup source.  Thus, we do not
      require the use of ep_call_nested(), since ep_eventpoll_poll(), which is
      called via ep_scan_ready_list() only continues nesting if there are
      events available.
      
      Since ep_call_nested() is implemented using a global lock, applications
      that make use of nested epoll can see large performance improvements
      with this change.
      
      Davidlohr said:
      
      : Improvements are quite obscene actually, such as for the following
      : epoll_wait() benchmark with 2 level nesting on a 80 core IvyBridge:
      :
      : ncpus  vanilla     dirty     delta
      : 1      2447092     3028315   +23.75%
      : 4      231265      2986954   +1191.57%
      : 8      121631      2898796   +2283.27%
      : 16     59749       2902056   +4757.07%
      : 32     26837	     2326314   +8568.30%
      : 64     12926       1341281   +10276.61%
      :
      : (http://linux-scalability.org/epoll/epoll-test.c)
      
      Link: http://lkml.kernel.org/r/1509430214-5599-1-git-send-email-jbaron@akamai.com
      
      
      Signed-off-by: default avatarJason Baron <jbaron@akamai.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Salman Qazi <sqazi@google.com>
      Cc: Hou Tao <houtao1@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      37b5e521
    • Jason Baron's avatar
      epoll: avoid calling ep_call_nested() from ep_poll_safewake() · 57a173bd
      Jason Baron authored
      ep_poll_safewake() is used to wakeup potentially nested epoll file
      descriptors.  The function uses ep_call_nested() to prevent entering the
      same wake up queue more than once, and to prevent excessively deep
      wakeup paths (deeper than EP_MAX_NESTS).  However, this is not necessary
      since we are already preventing these conditions during EPOLL_CTL_ADD.
      This saves extra function calls, and avoids taking a global lock during
      the ep_call_nested() calls.
      
      I have, however, left ep_call_nested() for the CONFIG_DEBUG_LOCK_ALLOC
      case, since ep_call_nested() keeps track of the nesting level, and this
      is required by the call to spin_lock_irqsave_nested().  It would be nice
      to remove the ep_call_nested() calls for the CONFIG_DEBUG_LOCK_ALLOC
      case as well, however its not clear how to simply pass the nesting level
      through multiple wake_up() levels without more surgery.  In any case, I
      don't think CONFIG_DEBUG_LOCK_ALLOC is generally used for production.
      This patch, also apparently fixes a workload at Google that Salman Qazi
      reported by completely removing the poll_safewake_ncalls->lock from
      wakeup paths.
      
      Link: http://lkml.kernel.org/r/1507920533-8812-1-git-send-email-jbaron@akamai.com
      
      
      Signed-off-by: default avatarJason Baron <jbaron@akamai.com>
      Acked-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Salman Qazi <sqazi@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      57a173bd
    • Shakeel Butt's avatar
      epoll: account epitem and eppoll_entry to kmemcg · 2ae928a9
      Shakeel Butt authored
      A userspace application can directly trigger the allocations from
      eventpoll_epi and eventpoll_pwq slabs.  A buggy or malicious application
      can consume a significant amount of system memory by triggering such
      allocations.  Indeed we have seen in production where a buggy
      application was leaking the epoll references and causing a burst of
      eventpoll_epi and eventpoll_pwq slab allocations.  This patch opt-in the
      charging of eventpoll_epi and eventpoll_pwq slabs.
      
      There is a per-user limit (~4% of total memory if no highmem) on these
      caches.  I think it is too generous particularly in the scenario where
      jobs of multiple users are running on the system and the administrator
      is reducing cost by overcomitting the memory.  This is unaccounted
      kernel memory and will not be considered by the oom-killer.  I think by
      accounting it to kmemcg, for systems with kmem accounting enabled, we
      can provide better isolation between jobs of different users.
      
      Link: http://lkml.kernel.org/r/20171003021519.23907-1-shakeelb@google.com
      
      
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2ae928a9
  15. Sep 19, 2017
  16. Sep 09, 2017
  17. Sep 01, 2017
    • Oleg Nesterov's avatar
      epoll: fix race between ep_poll_callback(POLLFREE) and ep_free()/ep_remove() · 138e4ad6
      Oleg Nesterov authored
      
      The race was introduced by me in commit 971316f0 ("epoll:
      ep_unregister_pollwait() can use the freed pwq->whead").  I did not
      realize that nothing can protect eventpoll after ep_poll_callback() sets
      ->whead = NULL, only whead->lock can save us from the race with
      ep_free() or ep_remove().
      
      Move ->whead = NULL to the end of ep_poll_callback() and add the
      necessary barriers.
      
      TODO: cleanup the ewake/EPOLLEXCLUSIVE logic, it was confusing even
      before this patch.
      
      Hopefully this explains use-after-free reported by syzcaller:
      
      	BUG: KASAN: use-after-free in debug_spin_lock_before
      	...
      	 _raw_spin_lock_irqsave+0x4a/0x60 kernel/locking/spinlock.c:159
      	 ep_poll_callback+0x29f/0xff0 fs/eventpoll.c:1148
      
      this is spin_lock(eventpoll->lock),
      
      	...
      	Freed by task 17774:
      	...
      	 kfree+0xe8/0x2c0 mm/slub.c:3883
      	 ep_free+0x22c/0x2a0 fs/eventpoll.c:865
      
      Fixes: 971316f0 ("epoll: ep_unregister_pollwait() can use the freed pwq->whead")
      Reported-by: default avatar范龙飞 <long7573@126.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      138e4ad6
  18. Jul 12, 2017
Loading