Skip to content
Snippets Groups Projects
  1. Apr 03, 2019
    • Mel Gorman's avatar
      sched/fair: Do not re-read ->h_load_next during hierarchical load calculation · 0e9f0245
      Mel Gorman authored
      
      A NULL pointer dereference bug was reported on a distribution kernel but
      the same issue should be present on mainline kernel. It occured on s390
      but should not be arch-specific.  A partial oops looks like:
      
        Unable to handle kernel pointer dereference in virtual kernel address space
        ...
        Call Trace:
          ...
          try_to_wake_up+0xfc/0x450
          vhost_poll_wakeup+0x3a/0x50 [vhost]
          __wake_up_common+0xbc/0x178
          __wake_up_common_lock+0x9e/0x160
          __wake_up_sync_key+0x4e/0x60
          sock_def_readable+0x5e/0x98
      
      The bug hits any time between 1 hour to 3 days. The dereference occurs
      in update_cfs_rq_h_load when accumulating h_load. The problem is that
      cfq_rq->h_load_next is not protected by any locking and can be updated
      by parallel calls to task_h_load. Depending on the compiler, code may be
      generated that re-reads cfq_rq->h_load_next after the check for NULL and
      then oops when reading se->avg.load_avg. The dissassembly showed that it
      was possible to reread h_load_next after the check for NULL.
      
      While this does not appear to be an issue for later compilers, it's still
      an accident if the correct code is generated. Full locking in this path
      would have high overhead so this patch uses READ_ONCE to read h_load_next
      only once and check for NULL before dereferencing. It was confirmed that
      there were no further oops after 10 days of testing.
      
      As Peter pointed out, it is also necessary to use WRITE_ONCE() to avoid any
      potential problems with store tearing.
      
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: <stable@vger.kernel.org>
      Fixes: 68520796 ("sched: Move h_load calculation to task_h_load()")
      Link: https://lkml.kernel.org/r/20190319123610.nsivgf3mjbjjesxb@techsingularity.net
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      0e9f0245
  2. Mar 29, 2019
  3. Mar 28, 2019
    • Thomas Gleixner's avatar
      cpu/hotplug: Prevent crash when CPU bringup fails on CONFIG_HOTPLUG_CPU=n · 206b9235
      Thomas Gleixner authored
      
      Tianyu reported a crash in a CPU hotplug teardown callback when booting a
      kernel which has CONFIG_HOTPLUG_CPU disabled with the 'nosmt' boot
      parameter.
      
      It turns out that the SMP=y CONFIG_HOTPLUG_CPU=n case has been broken
      forever in case that a bringup callback fails. Unfortunately this issue was
      not recognized when the CPU hotplug code was reworked, so the shortcoming
      just stayed in place.
      
      When a bringup callback fails, the CPU hotplug code rolls back the
      operation and takes the CPU offline.
      
      The 'nosmt' command line argument uses a bringup failure to abort the
      bringup of SMT sibling CPUs. This partial bringup is required due to the
      MCE misdesign on Intel CPUs.
      
      With CONFIG_HOTPLUG_CPU=y the rollback works perfectly fine, but
      CONFIG_HOTPLUG_CPU=n lacks essential mechanisms to exercise the low level
      teardown of a CPU including the synchronizations in various facilities like
      RCU, NOHZ and others.
      
      As a consequence the teardown callbacks which must be executed on the
      outgoing CPU within stop machine with interrupts disabled are executed on
      the control CPU in interrupt enabled and preemptible context causing the
      kernel to crash and burn. The pre state machine code has a different
      failure mode which is more subtle and resulting in a less obvious use after
      free crash because the control side frees resources which are still in use
      by the undead CPU.
      
      But this is not a x86 only problem. Any architecture which supports the
      SMP=y HOTPLUG_CPU=n combination suffers from the same issue. It's just less
      likely to be triggered because in 99.99999% of the cases all bringup
      callbacks succeed.
      
      The easy solution of making HOTPLUG_CPU mandatory for SMP is not working on
      all architectures as the following architectures have either no hotplug
      support at all or not all subarchitectures support it:
      
       alpha, arc, hexagon, openrisc, riscv, sparc (32bit), mips (partial).
      
      Crashing the kernel in such a situation is not an acceptable state
      either.
      
      Implement a minimal rollback variant by limiting the teardown to the point
      where all regular teardown callbacks have been invoked and leave the CPU in
      the 'dead' idle state. This has the following consequences:
      
       - the CPU is brought down to the point where the stop_machine takedown
         would happen.
      
       - the CPU stays there forever and is idle
      
       - The CPU is cleared in the CPU active mask, but not in the CPU online
         mask which is a legit state.
      
       - Interrupts are not forced away from the CPU
      
       - All facilities which only look at online mask would still see it, but
         that is the case during normal hotplug/unplug operations as well. It's
         just a (way) longer time frame.
      
      This will expose issues, which haven't been exposed before or only seldom,
      because now the normally transient state of being non active but online is
      a permanent state. In testing this exposed already an issue vs. work queues
      where the vmstat code schedules work on the almost dead CPU which ends up
      in an unbound workqueue and triggers 'preemtible context' warnings. This is
      not a problem of this change, it merily exposes an already existing issue.
      Still this is better than crashing fully without a chance to debug it.
      
      This is mainly thought as workaround for those architectures which do not
      support HOTPLUG_CPU. All others should enforce HOTPLUG_CPU for SMP.
      
      Fixes: 2e1a3483 ("cpu/hotplug: Split out the state walk into functions")
      Reported-by: default avatarTianyu Lan <Tianyu.Lan@microsoft.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarTianyu Lan <Tianyu.Lan@microsoft.com>
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Konrad Wilk <konrad.wilk@oracle.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Mukesh Ojha <mojha@codeaurora.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Micheal Kelley <michael.h.kelley@microsoft.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190326163811.503390616@linutronix.de
      206b9235
    • Thomas Gleixner's avatar
      watchdog: Respect watchdog cpumask on CPU hotplug · 7dd47617
      Thomas Gleixner authored
      
      The rework of the watchdog core to use cpu_stop_work broke the watchdog
      cpumask on CPU hotplug.
      
      The watchdog_enable/disable() functions are now called unconditionally from
      the hotplug callback, i.e. even on CPUs which are not in the watchdog
      cpumask. As a consequence the watchdog can become unstoppable.
      
      Only invoke them when the plugged CPU is in the watchdog cpumask.
      
      Fixes: 9cf57731 ("watchdog/softlockup: Replace "watchdog/%u" threads with cpu_stop_work")
      Reported-by: default avatarMaxime Coquelin <maxime.coquelin@redhat.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarMaxime Coquelin <maxime.coquelin@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1903262245490.1789@nanos.tec.linutronix.de
      7dd47617
  4. Mar 26, 2019
    • Hariprasad Kelam's avatar
      ftrace: Fix warning using plain integer as NULL & spelling corrections · 9efb85c5
      Hariprasad Kelam authored
      Changed  0 --> NULL to avoid sparse warning
      Corrected spelling mistakes reported by checkpatch.pl
      Sparse warning below:
      
      sudo make C=2 CF=-D__CHECK_ENDIAN__ M=kernel/trace
      
      CHECK   kernel/trace/ftrace.c
      kernel/trace/ftrace.c:3007:24: warning: Using plain integer as NULL pointer
      kernel/trace/ftrace.c:4758:37: warning: Using plain integer as NULL pointer
      
      Link: http://lkml.kernel.org/r/20190323183523.GA2244@hari-Inspiron-1545
      
      
      
      Signed-off-by: default avatarHariprasad Kelam <hariprasad.kelam@gmail.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      9efb85c5
    • Frank Rowand's avatar
      tracing: initialize variable in create_dyn_event() · 3dee10da
      Frank Rowand authored
      Fix compile warning in create_dyn_event(): 'ret' may be used uninitialized
      in this function [-Wuninitialized].
      
      Link: http://lkml.kernel.org/r/1553237900-8555-1-git-send-email-frowand.list@gmail.com
      
      
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
      Cc: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
      Cc: stable@vger.kernel.org
      Fixes: 5448d44c ("tracing: Add unified dynamic event framework")
      Signed-off-by: default avatarFrank Rowand <frank.rowand@sony.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      3dee10da
    • Tom Zanussi's avatar
      tracing: Remove unnecessary var_ref destroy in track_data_destroy() · ff9d31d0
      Tom Zanussi authored
      Commit 656fe2ba (tracing: Use hist trigger's var_ref array to
      destroy var_refs) centralized the destruction of all the var_refs
      in one place so that other code didn't have to do it.
      
      The track_data_destroy() added later ignored that and also destroyed
      the track_data var_ref, causing a double-free error flagged by KASAN.
      
      ==================================================================
      BUG: KASAN: use-after-free in destroy_hist_field+0x30/0x70
      Read of size 8 at addr ffff888086df2210 by task bash/1694
      
      CPU: 6 PID: 1694 Comm: bash Not tainted 5.1.0-rc1-test+ #15
      Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03
      07/14/2016
      Call Trace:
       dump_stack+0x71/0xa0
       ? destroy_hist_field+0x30/0x70
       print_address_description.cold.3+0x9/0x1fb
       ? destroy_hist_field+0x30/0x70
       ? destroy_hist_field+0x30/0x70
       kasan_report.cold.4+0x1a/0x33
       ? __kasan_slab_free+0x100/0x150
       ? destroy_hist_field+0x30/0x70
       destroy_hist_field+0x30/0x70
       track_data_destroy+0x55/0xe0
       destroy_hist_data+0x1f0/0x350
       hist_unreg_all+0x203/0x220
       event_trigger_open+0xbb/0x130
       do_dentry_open+0x296/0x700
       ? stacktrace_count_trigger+0x30/0x30
       ? generic_permission+0x56/0x200
       ? __x64_sys_fchdir+0xd0/0xd0
       ? inode_permission+0x55/0x200
       ? security_inode_permission+0x18/0x60
       path_openat+0x633/0x22b0
       ? path_lookupat.isra.50+0x420/0x420
       ? __kasan_kmalloc.constprop.12+0xc1/0xd0
       ? kmem_cache_alloc+0xe5/0x260
       ? getname_flags+0x6c/0x2a0
       ? do_sys_open+0x149/0x2b0
       ? do_syscall_64+0x73/0x1b0
       ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
       ? _raw_write_lock_bh+0xe0/0xe0
       ? __kernel_text_address+0xe/0x30
       ? unwind_get_return_address+0x2f/0x50
       ? __list_add_valid+0x2d/0x70
       ? deactivate_slab.isra.62+0x1f4/0x5a0
       ? getname_flags+0x6c/0x2a0
       ? set_track+0x76/0x120
       do_filp_open+0x11a/0x1a0
       ? may_open_dev+0x50/0x50
       ? _raw_spin_lock+0x7a/0xd0
       ? _raw_write_lock_bh+0xe0/0xe0
       ? __alloc_fd+0x10f/0x200
       do_sys_open+0x1db/0x2b0
       ? filp_open+0x50/0x50
       do_syscall_64+0x73/0x1b0
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x7fa7b24a4ca2
      Code: 25 00 00 41 00 3d 00 00 41 00 74 4c 48 8d 05 85 7a 0d 00 8b 00 85 c0
      75 6d 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d 00 f0 ff ff
      0f 87 a2 00 00 00 48 8b 4c 24 28 64 48 33 0c 25
      RSP: 002b:00007fffbafb3af0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
      RAX: ffffffffffffffda RBX: 000055d3648ade30 RCX: 00007fa7b24a4ca2
      RDX: 0000000000000241 RSI: 000055d364a55240 RDI: 00000000ffffff9c
      RBP: 00007fffbafb3bf0 R08: 0000000000000020 R09: 0000000000000002
      R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000000
      R13: 0000000000000003 R14: 0000000000000001 R15: 000055d364a55240
      ==================================================================
      
      So remove the track_data_destroy() destroy_hist_field() call for that
      var_ref.
      
      Link: http://lkml.kernel.org/r/1deffec420f6a16d11dd8647318d34a66d1989a9.camel@linux.intel.com
      
      
      
      Fixes: 466f4528 ("tracing: Generalize hist trigger onmax and save action")
      Reported-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarTom Zanussi <tom.zanussi@linux.intel.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      ff9d31d0
  5. Mar 23, 2019
  6. Mar 22, 2019
  7. Mar 21, 2019
    • Xu Yu's avatar
      bpf: do not restore dst_reg when cur_state is freed · 0803278b
      Xu Yu authored
      
      Syzkaller hit 'KASAN: use-after-free Write in sanitize_ptr_alu' bug.
      
      Call trace:
      
        dump_stack+0xbf/0x12e
        print_address_description+0x6a/0x280
        kasan_report+0x237/0x360
        sanitize_ptr_alu+0x85a/0x8d0
        adjust_ptr_min_max_vals+0x8f2/0x1ca0
        adjust_reg_min_max_vals+0x8ed/0x22e0
        do_check+0x1ca6/0x5d00
        bpf_check+0x9ca/0x2570
        bpf_prog_load+0xc91/0x1030
        __se_sys_bpf+0x61e/0x1f00
        do_syscall_64+0xc8/0x550
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Fault injection trace:
      
        kfree+0xea/0x290
        free_func_state+0x4a/0x60
        free_verifier_state+0x61/0xe0
        push_stack+0x216/0x2f0	          <- inject failslab
        sanitize_ptr_alu+0x2b1/0x8d0
        adjust_ptr_min_max_vals+0x8f2/0x1ca0
        adjust_reg_min_max_vals+0x8ed/0x22e0
        do_check+0x1ca6/0x5d00
        bpf_check+0x9ca/0x2570
        bpf_prog_load+0xc91/0x1030
        __se_sys_bpf+0x61e/0x1f00
        do_syscall_64+0xc8/0x550
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      When kzalloc() fails in push_stack(), free_verifier_state() will free
      current verifier state. As push_stack() returns, dst_reg was restored
      if ptr_is_dst_reg is false. However, as member of the cur_state,
      dst_reg is also freed, and error occurs when dereferencing dst_reg.
      Simply fix it by testing ret of push_stack() before restoring dst_reg.
      
      Fixes: 979d63d5 ("bpf: prevent out of bounds speculation on pointer arithmetic")
      Signed-off-by: default avatarXu Yu <xuyu@linux.alibaba.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      0803278b
    • Bart Van Assche's avatar
      workqueue: Only unregister a registered lockdep key · 82efcab3
      Bart Van Assche authored
      
      The recent change to prevent use after free and a memory leak introduced an
      unconditional call to wq_unregister_lockdep() in the error handling
      path. If the lockdep key had not been registered yet, then the lockdep core
      emits a warning.
      
      Only call wq_unregister_lockdep() if wq_register_lockdep() has been
      called first.
      
      Fixes: 009bb421 ("workqueue, lockdep: Fix an alloc_workqueue() error path")
      Reported-by: default avatar <syzbot+be0c198232f86389c3dd@syzkaller.appspotmail.com>
      Signed-off-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Qian Cai <cai@lca.pw>
      Link: https://lkml.kernel.org/r/20190311230255.176081-1-bvanassche@acm.org
      82efcab3
    • Martin KaFai Lau's avatar
      bpf: Only print ref_obj_id for refcounted reg · cba368c1
      Martin KaFai Lau authored
      
      Naresh reported that test_align fails because of the mismatch at the
      verbose printout of the register states.  The reason is due to the newly
      added ref_obj_id.
      
      ref_obj_id is only useful for refcounted reg.  Thus, this patch fixes it
      by only printing ref_obj_id for refcounted reg.  While at it, it also uses
      comma instead of space to separate between "id" and "ref_obj_id".
      
      Fixes: 1b986589 ("bpf: Fix bpf_tcp_sock and bpf_sk_fullsock issue related to bpf_sk_release")
      Reported-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      cba368c1
  8. Mar 19, 2019
  9. Mar 18, 2019
    • Martynas Pumputis's avatar
      bpf: Try harder when allocating memory for large maps · f01a7dbe
      Martynas Pumputis authored
      It has been observed that sometimes a higher order memory allocation
      for BPF maps fails when there is no obvious memory pressure in a system.
      E.g. the map (BPF_MAP_TYPE_LRU_HASH, key=38, value=56, max_elems=524288)
      could not be created due to vmalloc unable to allocate 75497472B,
      when the system's memory consumption (in MB) was the following:
      
          Total: 3942 Used: 837 (21.24%) Free: 138 Buffers: 239 Cached: 2727
      
      Later analysis [1] by Michal Hocko showed that the vmalloc was not trying
      to reclaim memory from the page cache and was failing prematurely due to
      __GFP_NORETRY.
      
      Considering dcda9b04 ("mm, tree wide: replace __GFP_REPEAT by
      __GFP_RETRY_MAYFAIL with more useful semantic") and [1], we can replace
      __GFP_NORETRY with __GFP_RETRY_MAYFAIL, as it won't invoke OOM killer
      and will try harder to fulfil allocation requests.
      
      Unfortunately, replacing the body of the BPF map memory allocation
      function with the kvmalloc_node helper function is not an option at
      this point in time, given 1) kmalloc is non-optional for higher order
      allocations, and 2) passing __GFP_RETRY_MAYFAIL to the kmalloc would
      stress the slab allocator too much for large requests.
      
      The change has been tested with the workloads mentioned above and by
      observing oom_kill value from /proc/vmstat.
      
      [1]: https://lore.kernel.org/bpf/20190310071318.GW5232@dhcp22.suse.cz/
      
      
      
      Signed-off-by: default avatarMartynas Pumputis <m@lambda.lt>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20190318153940.GL8924@dhcp22.suse.cz/
      f01a7dbe
  10. Mar 14, 2019
  11. Mar 13, 2019
    • Mathieu Malaterre's avatar
      blkcg: annotate implicit fall through · f6d85f04
      Mathieu Malaterre authored
      
      There is a plan to build the kernel with -Wimplicit-fallthrough and
      this place in the code produced a warning (W=1).
      
      This commit remove the following warning:
      
        kernel/trace/blktrace.c:725:9: warning: this statement may fall through [-Wimplicit-fallthrough=]
      
      Signed-off-by: default avatarMathieu Malaterre <malat@debian.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f6d85f04
    • Martin KaFai Lau's avatar
      bpf: Fix bpf_tcp_sock and bpf_sk_fullsock issue related to bpf_sk_release · 1b986589
      Martin KaFai Lau authored
      
      Lorenz Bauer [thanks!] reported that a ptr returned by bpf_tcp_sock(sk)
      can still be accessed after bpf_sk_release(sk).
      Both bpf_tcp_sock() and bpf_sk_fullsock() have the same issue.
      This patch addresses them together.
      
      A simple reproducer looks like this:
      
      	sk = bpf_sk_lookup_tcp();
      	/* if (!sk) ... */
      	tp = bpf_tcp_sock(sk);
      	/* if (!tp) ... */
      	bpf_sk_release(sk);
      	snd_cwnd = tp->snd_cwnd; /* oops! The verifier does not complain. */
      
      The problem is the verifier did not scrub the register's states of
      the tcp_sock ptr (tp) after bpf_sk_release(sk).
      
      [ Note that when calling bpf_tcp_sock(sk), the sk is not always
        refcount-acquired. e.g. bpf_tcp_sock(skb->sk). The verifier works
        fine for this case. ]
      
      Currently, the verifier does not track if a helper's return ptr (in REG_0)
      is "carry"-ing one of its argument's refcount status. To carry this info,
      the reg1->id needs to be stored in reg0.
      
      One approach was tried, like "reg0->id = reg1->id", when calling
      "bpf_tcp_sock()".  The main idea was to avoid adding another "ref_obj_id"
      for the same reg.  However, overlapping the NULL marking and ref
      tracking purpose in one "id" does not work well:
      
      	ref_sk = bpf_sk_lookup_tcp();
      	fullsock = bpf_sk_fullsock(ref_sk);
      	tp = bpf_tcp_sock(ref_sk);
      	if (!fullsock) {
      	     bpf_sk_release(ref_sk);
      	     return 0;
      	}
      	/* fullsock_reg->id is marked for NOT-NULL.
      	 * Same for tp_reg->id because they have the same id.
      	 */
      
      	/* oops. verifier did not complain about the missing !tp check */
      	snd_cwnd = tp->snd_cwnd;
      
      Hence, a new "ref_obj_id" is needed in "struct bpf_reg_state".
      With a new ref_obj_id, when bpf_sk_release(sk) is called, the verifier can
      scrub all reg states which has a ref_obj_id match.  It is done with the
      changes in release_reg_references() in this patch.
      
      While fixing it, sk_to_full_sk() is removed from bpf_tcp_sock() and
      bpf_sk_fullsock() to avoid these helpers from returning
      another ptr. It will make bpf_sk_release(tp) possible:
      
      	sk = bpf_sk_lookup_tcp();
      	/* if (!sk) ... */
      	tp = bpf_tcp_sock(sk);
      	/* if (!tp) ... */
      	bpf_sk_release(tp);
      
      A separate helper "bpf_get_listener_sock()" will be added in a later
      patch to do sk_to_full_sk().
      
      Misc change notes:
      - To allow bpf_sk_release(tp), the arg of bpf_sk_release() is changed
        from ARG_PTR_TO_SOCKET to ARG_PTR_TO_SOCK_COMMON.  ARG_PTR_TO_SOCKET
        is removed from bpf.h since no helper is using it.
      
      - arg_type_is_refcounted() is renamed to arg_type_may_be_refcounted()
        because ARG_PTR_TO_SOCK_COMMON is the only one and skb->sk is not
        refcounted.  All bpf_sk_release(), bpf_sk_fullsock() and bpf_tcp_sock()
        take ARG_PTR_TO_SOCK_COMMON.
      
      - check_refcount_ok() ensures is_acquire_function() cannot take
        arg_type_may_be_refcounted() as its argument.
      
      - The check_func_arg() can only allow one refcount-ed arg.  It is
        guaranteed by check_refcount_ok() which ensures at most one arg can be
        refcounted.  Hence, it is a verifier internal error if >1 refcount arg
        found in check_func_arg().
      
      - In release_reference(), release_reference_state() is called
        first to ensure a match on "reg->ref_obj_id" can be found before
        scrubbing the reg states with release_reg_references().
      
      - reg_is_refcounted() is no longer needed.
        1. In mark_ptr_or_null_regs(), its usage is replaced by
           "ref_obj_id && ref_obj_id == id" because,
           when is_null == true, release_reference_state() should only be
           called on the ref_obj_id obtained by a acquire helper (i.e.
           is_acquire_function() == true).  Otherwise, the following
           would happen:
      
      	sk = bpf_sk_lookup_tcp();
      	/* if (!sk) { ... } */
      	fullsock = bpf_sk_fullsock(sk);
      	if (!fullsock) {
      		/*
      		 * release_reference_state(fullsock_reg->ref_obj_id)
      		 * where fullsock_reg->ref_obj_id == sk_reg->ref_obj_id.
      		 *
      		 * Hence, the following bpf_sk_release(sk) will fail
      		 * because the ref state has already been released in the
      		 * earlier release_reference_state(fullsock_reg->ref_obj_id).
      		 */
      		bpf_sk_release(sk);
      	}
      
        2. In release_reg_references(), the current reg_is_refcounted() call
           is unnecessary because the id check is enough.
      
      - The type_is_refcounted() and type_is_refcounted_or_null()
        are no longer needed also because reg_is_refcounted() is removed.
      
      Fixes: 655a51e5 ("bpf: Add struct bpf_tcp_sock and BPF_FUNC_tcp_sock")
      Reported-by: default avatarLorenz Bauer <lmb@cloudflare.com>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1b986589
    • Douglas Anderson's avatar
      tracing: kdb: Fix ftdump to not sleep · 31b265b3
      Douglas Anderson authored
      As reported back in 2016-11 [1], the "ftdump" kdb command triggers a
      BUG for "sleeping function called from invalid context".
      
      kdb's "ftdump" command wants to call ring_buffer_read_prepare() in
      atomic context.  A very simple solution for this is to add allocation
      flags to ring_buffer_read_prepare() so kdb can call it without
      triggering the allocation error.  This patch does that.
      
      Note that in the original email thread about this, it was suggested
      that perhaps the solution for kdb was to either preallocate the buffer
      ahead of time or create our own iterator.  I'm hoping that this
      alternative of adding allocation flags to ring_buffer_read_prepare()
      can be considered since it means I don't need to duplicate more of the
      core trace code into "trace_kdb.c" (for either creating my own
      iterator or re-preparing a ring allocator whose memory was already
      allocated).
      
      NOTE: another option for kdb is to actually figure out how to make it
      reuse the existing ftrace_dump() function and totally eliminate the
      duplication.  This sounds very appealing and actually works (the "sr
      z" command can be seen to properly dump the ftrace buffer).  The
      downside here is that ftrace_dump() fully consumes the trace buffer.
      Unless that is changed I'd rather not use it because it means "ftdump
      | grep xyz" won't be very useful to search the ftrace buffer since it
      will throw away the whole trace on the first grep.  A future patch to
      dump only the last few lines of the buffer will also be hard to
      implement.
      
      [1] https://lkml.kernel.org/r/20161117191605.GA21459@google.com
      
      Link: http://lkml.kernel.org/r/20190308193205.213659-1-dianders@chromium.org
      
      
      
      Reported-by: default avatarBrian Norris <briannorris@chromium.org>
      Signed-off-by: default avatarDouglas Anderson <dianders@chromium.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      31b265b3
  12. Mar 12, 2019
  13. Mar 11, 2019
    • Stephane Eranian's avatar
      perf/core: Restore mmap record type correctly · d9c1bb2f
      Stephane Eranian authored
      
      On mmap(), perf_events generates a RECORD_MMAP record and then checks
      which events are interested in this record. There are currently 2
      versions of mmap records: RECORD_MMAP and RECORD_MMAP2. MMAP2 is larger.
      The event configuration controls which version the user level tool
      accepts.
      
      If the event->attr.mmap2=1 field then MMAP2 record is returned.  The
      perf_event_mmap_output() takes care of this. It checks attr->mmap2 and
      corrects the record fields before putting it in the sampling buffer of
      the event.  At the end the function restores the modified MMAP record
      fields.
      
      The problem is that the function restores the size but not the type.
      Thus, if a subsequent event only accepts MMAP type, then it would
      instead receive an MMAP2 record with a size of MMAP record.
      
      This patch fixes the problem by restoring the record type on exit.
      
      Signed-off-by: default avatarStephane Eranian <eranian@google.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Fixes: 13d7a241 ("perf: Add attr->mmap2 attribute to an event")
      Link: http://lkml.kernel.org/r/20190307185233.225521-1-eranian@google.com
      
      
      Signed-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      d9c1bb2f
  14. Mar 09, 2019
    • Qian Cai's avatar
      workqueue, lockdep: Fix a memory leak in wq->lock_name · 69a106c0
      Qian Cai authored
      
      The following commit:
      
        669de8bd ("kernel/workqueue: Use dynamic lockdep keys for workqueues")
      
      introduced a memory leak as wq_free_lockdep() calls kfree(wq->lock_name),
      but wq_init_lockdep() does not point wq->lock_name to the newly allocated
      slab object.
      
      This can be reproduced by running LTP fallocate04 followed by oom01 tests:
      
       unreferenced object 0xc0000005876384d8 (size 64):
        comm "fallocate04", pid 26972, jiffies 4297139141 (age 40370.480s)
        hex dump (first 32 bytes):
          28 77 71 5f 63 6f 6d 70 6c 65 74 69 6f 6e 29 65  (wq_completion)e
          78 74 34 2d 72 73 76 2d 63 6f 6e 76 65 72 73 69  xt4-rsv-conversi
        backtrace:
          [<00000000cb452883>] kvasprintf+0x6c/0xe0
          [<000000004654ddac>] kasprintf+0x34/0x60
          [<000000001c68f311>] alloc_workqueue+0x1f8/0x6ac
          [<0000000003c2ad83>] ext4_fill_super+0x23d4/0x3c80 [ext4]
          [<0000000006610538>] mount_bdev+0x25c/0x290
          [<00000000bcf955ec>] ext4_mount+0x28/0x50 [ext4]
          [<0000000016e08fd3>] legacy_get_tree+0x4c/0xb0
          [<0000000042b6a5fc>] vfs_get_tree+0x6c/0x190
          [<00000000268ab022>] do_mount+0xb9c/0x1100
          [<00000000698e6898>] ksys_mount+0x158/0x180
          [<0000000064e391fd>] sys_mount+0x20/0x30
          [<00000000ba378f12>] system_call+0x5c/0x70
      
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: catalin.marinas@arm.com
      Cc: jiangshanlai@gmail.com
      Cc: tj@kernel.org
      Fixes: 669de8bd ("kernel/workqueue: Use dynamic lockdep keys for workqueues")
      Link: https://lkml.kernel.org/r/20190307002731.47371-1-cai@lca.pw
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      69a106c0
    • Bart Van Assche's avatar
      workqueue, lockdep: Fix an alloc_workqueue() error path · 009bb421
      Bart Van Assche authored
      
      This patch fixes a use-after-free and a memory leak in an alloc_workqueue()
      error path.
      
      Repoted by syzkaller and KASAN:
      
        BUG: KASAN: use-after-free in __read_once_size include/linux/compiler.h:197 [inline]
        BUG: KASAN: use-after-free in lockdep_register_key+0x3b9/0x490 kernel/locking/lockdep.c:1023
        Read of size 8 at addr ffff888090fc2698 by task syz-executor134/7858
      
        CPU: 1 PID: 7858 Comm: syz-executor134 Not tainted 5.0.0-rc8-next-20190301 #1
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        Call Trace:
         __dump_stack lib/dump_stack.c:77 [inline]
         dump_stack+0x172/0x1f0 lib/dump_stack.c:113
         print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187
         kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
         __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
         __read_once_size include/linux/compiler.h:197 [inline]
         lockdep_register_key+0x3b9/0x490 kernel/locking/lockdep.c:1023
         wq_init_lockdep kernel/workqueue.c:3444 [inline]
         alloc_workqueue+0x427/0xe70 kernel/workqueue.c:4263
         ucma_open+0x76/0x290 drivers/infiniband/core/ucma.c:1732
         misc_open+0x398/0x4c0 drivers/char/misc.c:141
         chrdev_open+0x247/0x6b0 fs/char_dev.c:417
         do_dentry_open+0x488/0x1160 fs/open.c:771
         vfs_open+0xa0/0xd0 fs/open.c:880
         do_last fs/namei.c:3416 [inline]
         path_openat+0x10e9/0x46e0 fs/namei.c:3533
         do_filp_open+0x1a1/0x280 fs/namei.c:3563
         do_sys_open+0x3fe/0x5d0 fs/open.c:1063
         __do_sys_openat fs/open.c:1090 [inline]
         __se_sys_openat fs/open.c:1084 [inline]
         __x64_sys_openat+0x9d/0x100 fs/open.c:1084
         do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        Allocated by task 7789:
         save_stack+0x45/0xd0 mm/kasan/common.c:75
         set_track mm/kasan/common.c:87 [inline]
         __kasan_kmalloc mm/kasan/common.c:497 [inline]
         __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:470
         kasan_kmalloc+0x9/0x10 mm/kasan/common.c:511
         __do_kmalloc mm/slab.c:3726 [inline]
         __kmalloc+0x15c/0x740 mm/slab.c:3735
         kmalloc include/linux/slab.h:553 [inline]
         kzalloc include/linux/slab.h:743 [inline]
         alloc_workqueue+0x13c/0xe70 kernel/workqueue.c:4236
         ucma_open+0x76/0x290 drivers/infiniband/core/ucma.c:1732
         misc_open+0x398/0x4c0 drivers/char/misc.c:141
         chrdev_open+0x247/0x6b0 fs/char_dev.c:417
         do_dentry_open+0x488/0x1160 fs/open.c:771
         vfs_open+0xa0/0xd0 fs/open.c:880
         do_last fs/namei.c:3416 [inline]
         path_openat+0x10e9/0x46e0 fs/namei.c:3533
         do_filp_open+0x1a1/0x280 fs/namei.c:3563
         do_sys_open+0x3fe/0x5d0 fs/open.c:1063
         __do_sys_openat fs/open.c:1090 [inline]
         __se_sys_openat fs/open.c:1084 [inline]
         __x64_sys_openat+0x9d/0x100 fs/open.c:1084
         do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        Freed by task 7789:
         save_stack+0x45/0xd0 mm/kasan/common.c:75
         set_track mm/kasan/common.c:87 [inline]
         __kasan_slab_free+0x102/0x150 mm/kasan/common.c:459
         kasan_slab_free+0xe/0x10 mm/kasan/common.c:467
         __cache_free mm/slab.c:3498 [inline]
         kfree+0xcf/0x230 mm/slab.c:3821
         alloc_workqueue+0xc3e/0xe70 kernel/workqueue.c:4295
         ucma_open+0x76/0x290 drivers/infiniband/core/ucma.c:1732
         misc_open+0x398/0x4c0 drivers/char/misc.c:141
         chrdev_open+0x247/0x6b0 fs/char_dev.c:417
         do_dentry_open+0x488/0x1160 fs/open.c:771
         vfs_open+0xa0/0xd0 fs/open.c:880
         do_last fs/namei.c:3416 [inline]
         path_openat+0x10e9/0x46e0 fs/namei.c:3533
         do_filp_open+0x1a1/0x280 fs/namei.c:3563
         do_sys_open+0x3fe/0x5d0 fs/open.c:1063
         __do_sys_openat fs/open.c:1090 [inline]
         __se_sys_openat fs/open.c:1084 [inline]
         __x64_sys_openat+0x9d/0x100 fs/open.c:1084
         do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        The buggy address belongs to the object at ffff888090fc2580
         which belongs to the cache kmalloc-512 of size 512
        The buggy address is located 280 bytes inside of
         512-byte region [ffff888090fc2580, ffff888090fc2780)
      
      Reported-by: default avatar <syzbot+17335689e239ce135d8b@syzkaller.appspotmail.com>
      Signed-off-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Fixes: 669de8bd ("kernel/workqueue: Use dynamic lockdep keys for workqueues")
      Link: https://lkml.kernel.org/r/20190303220046.29448-1-bvanassche@acm.org
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      009bb421
Loading