Skip to content
Snippets Groups Projects
  1. Jul 13, 2019
    • Yuyang Du's avatar
      locking/lockdep: Fix lock used or unused stats error · 68d41d8c
      Yuyang Du authored
      
      The stats variable nr_unused_locks is incremented every time a new lock
      class is register and decremented when the lock is first used in
      __lock_acquire(). And after all, it is shown and checked in lockdep_stats.
      
      However, under configurations that either CONFIG_TRACE_IRQFLAGS or
      CONFIG_PROVE_LOCKING is not defined:
      
      The commit:
      
        09180651 ("locking/lockdep: Consolidate lock usage bit initialization")
      
      missed marking the LOCK_USED flag at IRQ usage initialization because
      as mark_usage() is not called. And the commit:
      
        886532ae ("locking/lockdep: Move mark_lock() inside CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING")
      
      further made mark_lock() not defined such that the LOCK_USED cannot be
      marked at all when the lock is first acquired.
      
      As a result, we fix this by not showing and checking the stats under such
      configurations for lockdep_stats.
      
      Reported-by: default avatarQian Cai <cai@lca.pw>
      Signed-off-by: default avatarYuyang Du <duyuyang@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: arnd@arndb.de
      Cc: frederic@kernel.org
      Link: https://lkml.kernel.org/r/20190709101522.9117-1-duyuyang@gmail.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      68d41d8c
    • Peter Zijlstra's avatar
      sched/core: Fix preempt warning in ttwu · e3d85487
      Peter Zijlstra authored
      
      John reported a DEBUG_PREEMPT warning caused by commit:
      
        aacedf26 ("sched/core: Optimize try_to_wake_up() for local wakeups")
      
      I overlooked that ttwu_stat() requires preemption disabled.
      
      Reported-by: default avatarJohn Stultz <john.stultz@linaro.org>
      Tested-by: default avatarJohn Stultz <john.stultz@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: aacedf26 ("sched/core: Optimize try_to_wake_up() for local wakeups")
      Link: https://lkml.kernel.org/r/20190710105736.GK3402@hirez.programming.kicks-ass.net
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      e3d85487
    • Alexander Shishkin's avatar
      perf/core: Fix exclusive events' grouping · 8a58ddae
      Alexander Shishkin authored
      
      So far, we tried to disallow grouping exclusive events for the fear of
      complications they would cause with moving between contexts. Specifically,
      moving a software group to a hardware context would violate the exclusivity
      rules if both groups contain matching exclusive events.
      
      This attempt was, however, unsuccessful: the check that we have in the
      perf_event_open() syscall is both wrong (looks at wrong PMU) and
      insufficient (group leader may still be exclusive), as can be illustrated
      by running:
      
        $ perf record -e '{intel_pt//,cycles}' uname
        $ perf record -e '{cycles,intel_pt//}' uname
      
      ultimately successfully.
      
      Furthermore, we are completely free to trigger the exclusivity violation
      by:
      
         perf -e '{cycles,intel_pt//}' -e '{intel_pt//,instructions}'
      
      even though the helpful perf record will not allow that, the ABI will.
      
      The warning later in the perf_event_open() path will also not trigger, because
      it's also wrong.
      
      Fix all this by validating the original group before moving, getting rid
      of broken safeguards and placing a useful one to perf_install_in_context().
      
      Signed-off-by: default avatarAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: mathieu.poirier@linaro.org
      Cc: will.deacon@arm.com
      Fixes: bed5b25a ("perf: Add a pmu capability for "exclusive" events")
      Link: https://lkml.kernel.org/r/20190701110755.24646-1-alexander.shishkin@linux.intel.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      8a58ddae
    • Peter Zijlstra's avatar
      perf/core: Fix race between close() and fork() · 1cf8dfe8
      Peter Zijlstra authored
      
      Syzcaller reported the following Use-after-Free bug:
      
      	close()						clone()
      
      							  copy_process()
      							    perf_event_init_task()
      							      perf_event_init_context()
      							        mutex_lock(parent_ctx->mutex)
      								inherit_task_group()
      								  inherit_group()
      								    inherit_event()
      								      mutex_lock(event->child_mutex)
      								      // expose event on child list
      								      list_add_tail()
      								      mutex_unlock(event->child_mutex)
      							        mutex_unlock(parent_ctx->mutex)
      
      							    ...
      							    goto bad_fork_*
      
      							  bad_fork_cleanup_perf:
      							    perf_event_free_task()
      
      	  perf_release()
      	    perf_event_release_kernel()
      	      list_for_each_entry()
      		mutex_lock(ctx->mutex)
      		mutex_lock(event->child_mutex)
      		// event is from the failing inherit
      		// on the other CPU
      		perf_remove_from_context()
      		list_move()
      		mutex_unlock(event->child_mutex)
      		mutex_unlock(ctx->mutex)
      
      							      mutex_lock(ctx->mutex)
      							      list_for_each_entry_safe()
      							        // event already stolen
      							      mutex_unlock(ctx->mutex)
      
      							    delayed_free_task()
      							      free_task()
      
      	     list_for_each_entry_safe()
      	       list_del()
      	       free_event()
      	         _free_event()
      		   // and so event->hw.target
      		   // is the already freed failed clone()
      		   if (event->hw.target)
      		     put_task_struct(event->hw.target)
      		       // WHOOPSIE, already quite dead
      
      Which puts the lie to the the comment on perf_event_free_task():
      'unexposed, unused context' not so much.
      
      Which is a 'fun' confluence of fail; copy_process() doing an
      unconditional free_task() and not respecting refcounts, and perf having
      creative locking. In particular:
      
        82d94856 ("perf/core: Fix lock inversion between perf,trace,cpuhp")
      
      seems to have overlooked this 'fun' parade.
      
      Solve it by using the fact that detached events still have a reference
      count on their (previous) context. With this perf_event_free_task()
      can detect when events have escaped and wait for their destruction.
      
      Debugged-by: default avatarAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Reported-by: default avatar <syzbot+a24c397a29ad22d86c98@syzkaller.appspotmail.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Cc: <stable@vger.kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: 82d94856 ("perf/core: Fix lock inversion between perf,trace,cpuhp")
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      1cf8dfe8
  2. Jul 12, 2019
  3. Jul 10, 2019
  4. Jul 09, 2019
  5. Jul 08, 2019
  6. Jul 07, 2019
  7. Jul 06, 2019
  8. Jul 04, 2019
    • Jann Horn's avatar
      ptrace: Fix ->ptracer_cred handling for PTRACE_TRACEME · 6994eefb
      Jann Horn authored
      
      Fix two issues:
      
      When called for PTRACE_TRACEME, ptrace_link() would obtain an RCU
      reference to the parent's objective credentials, then give that pointer
      to get_cred().  However, the object lifetime rules for things like
      struct cred do not permit unconditionally turning an RCU reference into
      a stable reference.
      
      PTRACE_TRACEME records the parent's credentials as if the parent was
      acting as the subject, but that's not the case.  If a malicious
      unprivileged child uses PTRACE_TRACEME and the parent is privileged, and
      at a later point, the parent process becomes attacker-controlled
      (because it drops privileges and calls execve()), the attacker ends up
      with control over two processes with a privileged ptrace relationship,
      which can be abused to ptrace a suid binary and obtain root privileges.
      
      Fix both of these by always recording the credentials of the process
      that is requesting the creation of the ptrace relationship:
      current_cred() can't change under us, and current is the proper subject
      for access control.
      
      This change is theoretically userspace-visible, but I am not aware of
      any code that it will actually break.
      
      Fixes: 64b875f7 ("ptrace: Capture the ptracer's creds not PT_PTRACE_CAP")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6994eefb
  9. Jul 03, 2019
  10. Jul 02, 2019
  11. Jul 01, 2019
    • Christian Brauner's avatar
      fork: return proper negative error code · 28dd29c0
      Christian Brauner authored
      Make sure to return a proper negative error code from copy_process()
      when anon_inode_getfile() fails with CLONE_PIDFD.
      Otherwise _do_fork() will not detect an error and get_task_pid() will
      operator on a nonsensical pointer:
      
      R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006dbc2c
      R13: 00007ffc15fbb0ff R14: 00007ff07e47e9c0 R15: 0000000000000000
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] PREEMPT SMP KASAN
      CPU: 1 PID: 7990 Comm: syz-executor290 Not tainted 5.2.0-rc6+ #9
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline]
      RIP: 0010:get_task_pid+0xe1/0x210 kernel/pid.c:372
      Code: 89 ff e8 62 27 5f 00 49 8b 07 44 89 f1 4c 8d bc c8 90 01 00 00 eb 0c
      e8 0d fe 25 00 49 81 c7 38 05 00 00 4c 89 f8 48 c1 e8 03 <80> 3c 18 00 74
      08 4c 89 ff e8 31 27 5f 00 4d 8b 37 e8 f9 47 12 00
      RSP: 0018:ffff88808a4a7d78 EFLAGS: 00010203
      RAX: 00000000000000a7 RBX: dffffc0000000000 RCX: ffff888088180600
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: ffff88808a4a7d90 R08: ffffffff814fb3a8 R09: ffffed1015d66bf8
      R10: ffffed1015d66bf8 R11: 1ffff11015d66bf7 R12: 0000000000041ffc
      R13: 1ffff11011494fbc R14: 0000000000000000 R15: 000000000000053d
      FS:  00007ff07e47e700(0000) GS:ffff8880aeb00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000004b5100 CR3: 0000000094df2000 CR4: 00000000001406e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
        _do_fork+0x1b9/0x5f0 kernel/fork.c:2360
        __do_sys_clone kernel/fork.c:2454 [inline]
        __se_sys_clone kernel/fork.c:2448 [inline]
        __x64_sys_clone+0xc1/0xd0 kernel/fork.c:2448
        do_syscall_64+0xfe/0x140 arch/x86/entry/common.c:301
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Link: https://lore.kernel.org/lkml/000000000000e0dc0d058c9e7142@google.com
      
      
      Reported-and-tested-by: default avatar <syzbot+002e636502bc4b64eb5c@syzkaller.appspotmail.com>
      Fixes: 6fd2fe49 ("copy_process(): don't use ksys_close() on cleanups")
      Cc: Jann Horn <jannh@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarChristian Brauner <christian@brauner.io>
      Unverified
      28dd29c0
  12. Jun 30, 2019
  13. Jun 29, 2019
  14. Jun 28, 2019
    • Toke Høiland-Jørgensen's avatar
      devmap: Allow map lookups from eBPF · 0cdbb4b0
      Toke Høiland-Jørgensen authored
      
      We don't currently allow lookups into a devmap from eBPF, because the map
      lookup returns a pointer directly to the dev->ifindex, which shouldn't be
      modifiable from eBPF.
      
      However, being able to do lookups in devmaps is useful to know (e.g.)
      whether forwarding to a specific interface is enabled. Currently, programs
      work around this by keeping a shadow map of another type which indicates
      whether a map index is valid.
      
      Since we now have a flag to make maps read-only from the eBPF side, we can
      simply lift the lookup restriction if we make sure this flag is always set.
      
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Acked-by: default avatarJonathan Lemon <jonathan.lemon@gmail.com>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      0cdbb4b0
    • Toke Høiland-Jørgensen's avatar
      devmap/cpumap: Use flush list instead of bitmap · d5df2830
      Toke Høiland-Jørgensen authored
      
      The socket map uses a linked list instead of a bitmap to keep track of
      which entries to flush. Do the same for devmap and cpumap, as this means we
      don't have to care about the map index when enqueueing things into the
      map (and so we can cache the map lookup).
      
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Acked-by: default avatarJonathan Lemon <jonathan.lemon@gmail.com>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      d5df2830
    • Toke Høiland-Jørgensen's avatar
      xskmap: Move non-standard list manipulation to helper · c8af5cd7
      Toke Høiland-Jørgensen authored
      
      Add a helper in list.h for the non-standard way of clearing a list that is
      used in xskmap. This makes it easier to reuse it in the other map types,
      and also makes sure this usage is not forgotten in any list refactorings in
      the future.
      
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      c8af5cd7
    • Eiichi Tsukata's avatar
      tracing/snapshot: Resize spare buffer if size changed · 46cc0b44
      Eiichi Tsukata authored
      Current snapshot implementation swaps two ring_buffers even though their
      sizes are different from each other, that can cause an inconsistency
      between the contents of buffer_size_kb file and the current buffer size.
      
      For example:
      
        # cat buffer_size_kb
        7 (expanded: 1408)
        # echo 1 > events/enable
        # grep bytes per_cpu/cpu0/stats
        bytes: 1441020
        # echo 1 > snapshot             // current:1408, spare:1408
        # echo 123 > buffer_size_kb     // current:123,  spare:1408
        # echo 1 > snapshot             // current:1408, spare:123
        # grep bytes per_cpu/cpu0/stats
        bytes: 1443700
        # cat buffer_size_kb
        123                             // != current:1408
      
      And also, a similar per-cpu case hits the following WARNING:
      
      Reproducer:
      
        # echo 1 > per_cpu/cpu0/snapshot
        # echo 123 > buffer_size_kb
        # echo 1 > per_cpu/cpu0/snapshot
      
      WARNING:
      
        WARNING: CPU: 0 PID: 1946 at kernel/trace/trace.c:1607 update_max_tr_single.part.0+0x2b8/0x380
        Modules linked in:
        CPU: 0 PID: 1946 Comm: bash Not tainted 5.2.0-rc6 #20
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
        RIP: 0010:update_max_tr_single.part.0+0x2b8/0x380
        Code: ff e8 dc da f9 ff 0f 0b e9 88 fe ff ff e8 d0 da f9 ff 44 89 ee bf f5 ff ff ff e8 33 dc f9 ff 41 83 fd f5 74 96 e8 b8 da f9 ff <0f> 0b eb 8d e8 af da f9 ff 0f 0b e9 bf fd ff ff e8 a3 da f9 ff 48
        RSP: 0018:ffff888063e4fca0 EFLAGS: 00010093
        RAX: ffff888066214380 RBX: ffffffff99850fe0 RCX: ffffffff964298a8
        RDX: 0000000000000000 RSI: 00000000fffffff5 RDI: 0000000000000005
        RBP: 1ffff1100c7c9f96 R08: ffff888066214380 R09: ffffed100c7c9f9b
        R10: ffffed100c7c9f9a R11: 0000000000000003 R12: 0000000000000000
        R13: 00000000ffffffea R14: ffff888066214380 R15: ffffffff99851060
        FS:  00007f9f8173c700(0000) GS:ffff88806d000000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000714dc0 CR3: 0000000066fa6000 CR4: 00000000000006f0
        Call Trace:
         ? trace_array_printk_buf+0x140/0x140
         ? __mutex_lock_slowpath+0x10/0x10
         tracing_snapshot_write+0x4c8/0x7f0
         ? trace_printk_init_buffers+0x60/0x60
         ? selinux_file_permission+0x3b/0x540
         ? tracer_preempt_off+0x38/0x506
         ? trace_printk_init_buffers+0x60/0x60
         __vfs_write+0x81/0x100
         vfs_write+0x1e1/0x560
         ksys_write+0x126/0x250
         ? __ia32_sys_read+0xb0/0xb0
         ? do_syscall_64+0x1f/0x390
         do_syscall_64+0xc1/0x390
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      This patch adds resize_buffer_duplicate_size() to check if there is a
      difference between current/spare buffer sizes and resize a spare buffer
      if necessary.
      
      Link: http://lkml.kernel.org/r/20190625012910.13109-1-devel@etsukata.com
      
      
      
      Cc: stable@vger.kernel.org
      Fixes: ad909e21 ("tracing: Add internal tracing_snapshot() functions")
      Signed-off-by: default avatarEiichi Tsukata <devel@etsukata.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      46cc0b44
    • Takeshi Misawa's avatar
      tracing: Fix memory leak in tracing_err_log_open() · d122ed62
      Takeshi Misawa authored
      When tracing_err_log_open() calls seq_open(), allocated memory is not freed.
      
      kmemleak report:
      
      unreferenced object 0xffff92c0781d1100 (size 128):
        comm "tail", pid 15116, jiffies 4295163855 (age 22.704s)
        hex dump (first 32 bytes):
          00 f0 08 e5 c0 92 ff ff 00 10 00 00 00 00 00 00  ................
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<000000000d0687d5>] kmem_cache_alloc+0x11f/0x1e0
          [<000000003e3039a8>] seq_open+0x2f/0x90
          [<000000008dd36b7d>] tracing_err_log_open+0x67/0x140
          [<000000005a431ae2>] do_dentry_open+0x1df/0x3a0
          [<00000000a2910603>] vfs_open+0x2f/0x40
          [<0000000038b0a383>] path_openat+0x2e8/0x1690
          [<00000000fe025bda>] do_filp_open+0x9b/0x110
          [<00000000483a5091>] do_sys_open+0x1ba/0x260
          [<00000000c558b5fd>] __x64_sys_openat+0x20/0x30
          [<000000006881ec07>] do_syscall_64+0x5a/0x130
          [<00000000571c2e94>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fix this by calling seq_release() in tracing_err_log_fops.release().
      
      Link: http://lkml.kernel.org/r/20190628105640.GA1863@DESKTOP
      
      
      
      Fixes: 8a062902 ("tracing: Add tracing error log")
      Reviewed-by: default avatarTom Zanussi <zanussi@kernel.org>
      Signed-off-by: default avatarTakeshi Misawa <jeliantsurux@gmail.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      d122ed62
    • Petr Mladek's avatar
      ftrace/x86: Remove possible deadlock between register_kprobe() and ftrace_run_update_code() · d5b844a2
      Petr Mladek authored
      The commit 9f255b63 ("module: Fix livepatch/ftrace module text
      permissions race") causes a possible deadlock between register_kprobe()
      and ftrace_run_update_code() when ftrace is using stop_machine().
      
      The existing dependency chain (in reverse order) is:
      
      -> #1 (text_mutex){+.+.}:
             validate_chain.isra.21+0xb32/0xd70
             __lock_acquire+0x4b8/0x928
             lock_acquire+0x102/0x230
             __mutex_lock+0x88/0x908
             mutex_lock_nested+0x32/0x40
             register_kprobe+0x254/0x658
             init_kprobes+0x11a/0x168
             do_one_initcall+0x70/0x318
             kernel_init_freeable+0x456/0x508
             kernel_init+0x22/0x150
             ret_from_fork+0x30/0x34
             kernel_thread_starter+0x0/0xc
      
      -> #0 (cpu_hotplug_lock.rw_sem){++++}:
             check_prev_add+0x90c/0xde0
             validate_chain.isra.21+0xb32/0xd70
             __lock_acquire+0x4b8/0x928
             lock_acquire+0x102/0x230
             cpus_read_lock+0x62/0xd0
             stop_machine+0x2e/0x60
             arch_ftrace_update_code+0x2e/0x40
             ftrace_run_update_code+0x40/0xa0
             ftrace_startup+0xb2/0x168
             register_ftrace_function+0x64/0x88
             klp_patch_object+0x1a2/0x290
             klp_enable_patch+0x554/0x980
             do_one_initcall+0x70/0x318
             do_init_module+0x6e/0x250
             load_module+0x1782/0x1990
             __s390x_sys_finit_module+0xaa/0xf0
             system_call+0xd8/0x2d0
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(text_mutex);
                                     lock(cpu_hotplug_lock.rw_sem);
                                     lock(text_mutex);
        lock(cpu_hotplug_lock.rw_sem);
      
      It is similar problem that has been solved by the commit 2d1e38f5
      ("kprobes: Cure hotplug lock ordering issues"). Many locks are involved.
      To be on the safe side, text_mutex must become a low level lock taken
      after cpu_hotplug_lock.rw_sem.
      
      This can't be achieved easily with the current ftrace design.
      For example, arm calls set_all_modules_text_rw() already in
      ftrace_arch_code_modify_prepare(), see arch/arm/kernel/ftrace.c.
      This functions is called:
      
        + outside stop_machine() from ftrace_run_update_code()
        + without stop_machine() from ftrace_module_enable()
      
      Fortunately, the problematic fix is needed only on x86_64. It is
      the only architecture that calls set_all_modules_text_rw()
      in ftrace path and supports livepatching at the same time.
      
      Therefore it is enough to move text_mutex handling from the generic
      kernel/trace/ftrace.c into arch/x86/kernel/ftrace.c:
      
         ftrace_arch_code_modify_prepare()
         ftrace_arch_code_modify_post_process()
      
      This patch basically reverts the ftrace part of the problematic
      commit 9f255b63 ("module: Fix livepatch/ftrace module
      text permissions race"). And provides x86_64 specific-fix.
      
      Some refactoring of the ftrace code will be needed when livepatching
      is implemented for arm or nds32. These architectures call
      set_all_modules_text_rw() and use stop_machine() at the same time.
      
      Link: http://lkml.kernel.org/r/20190627081334.12793-1-pmladek@suse.com
      
      
      
      Fixes: 9f255b63 ("module: Fix livepatch/ftrace module text permissions race")
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reported-by: default avatarMiroslav Benes <mbenes@suse.cz>
      Reviewed-by: default avatarMiroslav Benes <mbenes@suse.cz>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: default avatarPetr Mladek <pmladek@suse.com>
      [
        As reviewed by Miroslav Benes <mbenes@suse.cz>, removed return value of
        ftrace_run_update_code() as it is a void function.
      ]
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      d5b844a2
    • Christian Brauner's avatar
      pid: add pidfd_open() · 32fcb426
      Christian Brauner authored
      
      This adds the pidfd_open() syscall. It allows a caller to retrieve pollable
      pidfds for a process which did not get created via CLONE_PIDFD, i.e. for a
      process that is created via traditional fork()/clone() calls that is only
      referenced by a PID:
      
      int pidfd = pidfd_open(1234, 0);
      ret = pidfd_send_signal(pidfd, SIGSTOP, NULL, 0);
      
      With the introduction of pidfds through CLONE_PIDFD it is possible to
      created pidfds at process creation time.
      However, a lot of processes get created with traditional PID-based calls
      such as fork() or clone() (without CLONE_PIDFD). For these processes a
      caller can currently not create a pollable pidfd. This is a problem for
      Android's low memory killer (LMK) and service managers such as systemd.
      Both are examples of tools that want to make use of pidfds to get reliable
      notification of process exit for non-parents (pidfd polling) and race-free
      signal sending (pidfd_send_signal()). They intend to switch to this API for
      process supervision/management as soon as possible. Having no way to get
      pollable pidfds from PID-only processes is one of the biggest blockers for
      them in adopting this api. With pidfd_open() making it possible to retrieve
      pidfds for PID-based processes we enable them to adopt this api.
      
      In line with Arnd's recent changes to consolidate syscall numbers across
      architectures, I have added the pidfd_open() syscall to all architectures
      at the same time.
      
      Signed-off-by: default avatarChristian Brauner <christian@brauner.io>
      Reviewed-by: default avatarDavid Howells <dhowells@redhat.com>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Jann Horn <jannh@google.com>
      Cc: Andy Lutomirsky <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-api@vger.kernel.org
      Unverified
      32fcb426
    • Joel Fernandes (Google)'s avatar
      pidfd: add polling support · b53b0b9d
      Joel Fernandes (Google) authored
      
      This patch adds polling support to pidfd.
      
      Android low memory killer (LMK) needs to know when a process dies once
      it is sent the kill signal. It does so by checking for the existence of
      /proc/pid which is both racy and slow. For example, if a PID is reused
      between when LMK sends a kill signal and checks for existence of the
      PID, since the wrong PID is now possibly checked for existence.
      Using the polling support, LMK will be able to get notified when a process
      exists in race-free and fast way, and allows the LMK to do other things
      (such as by polling on other fds) while awaiting the process being killed
      to die.
      
      For notification to polling processes, we follow the same existing
      mechanism in the kernel used when the parent of the task group is to be
      notified of a child's death (do_notify_parent). This is precisely when the
      tasks waiting on a poll of pidfd are also awakened in this patch.
      
      We have decided to include the waitqueue in struct pid for the following
      reasons:
      1. The wait queue has to survive for the lifetime of the poll. Including
         it in task_struct would not be option in this case because the task can
         be reaped and destroyed before the poll returns.
      
      2. By including the struct pid for the waitqueue means that during
         de_thread(), the new thread group leader automatically gets the new
         waitqueue/pid even though its task_struct is different.
      
      Appropriate test cases are added in the second patch to provide coverage of
      all the cases the patch is handling.
      
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Jonathan Kowalski <bl0pbl33p@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: kernel-team@android.com
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Co-developed-by: default avatarDaniel Colascione <dancol@google.com>
      Signed-off-by: default avatarDaniel Colascione <dancol@google.com>
      Signed-off-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: default avatarChristian Brauner <christian@brauner.io>
      Unverified
      b53b0b9d
Loading