Skip to content
Snippets Groups Projects
  1. Apr 14, 2019
  2. Apr 12, 2019
    • Peter Zijlstra's avatar
      perf/core: Fix perf_event_disable_inatomic() race · 1d54ad94
      Peter Zijlstra authored
      
      Thomas-Mich Richter reported he triggered a WARN()ing from event_function_local()
      on his s390. The problem boils down to:
      
      	CPU-A				CPU-B
      
      	perf_event_overflow()
      	  perf_event_disable_inatomic()
      	    @pending_disable = 1
      	    irq_work_queue();
      
      	sched-out
      	  event_sched_out()
      	    @pending_disable = 0
      
      					sched-in
      					perf_event_overflow()
      					  perf_event_disable_inatomic()
      					    @pending_disable = 1;
      					    irq_work_queue(); // FAILS
      
      	irq_work_run()
      	  perf_pending_event()
      	    if (@pending_disable)
      	      perf_event_disable_local(); // WHOOPS
      
      The problem exists in generic, but s390 is particularly sensitive
      because it doesn't implement arch_irq_work_raise(), nor does it call
      irq_work_run() from it's PMU interrupt handler (nor would that be
      sufficient in this case, because s390 also generates
      perf_event_overflow() from pmu::stop). Add to that the fact that s390
      is a virtual architecture and (virtual) CPU-A can stall long enough
      for the above race to happen, even if it would self-IPI.
      
      Adding a irq_work_sync() to event_sched_in() would work for all hardare
      PMUs that properly use irq_work_run() but fails for software PMUs.
      
      Instead encode the CPU number in @pending_disable, such that we can
      tell which CPU requested the disable. This then allows us to detect
      the above scenario and even redirect the IPI to make up for the failed
      queue.
      
      Reported-by: default avatarThomas-Mich Richter <tmricht@linux.ibm.com>
      Tested-by: default avatarThomas Richter <tmricht@linux.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Hendrik Brueckner <brueckner@linux.ibm.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      1d54ad94
  3. Apr 11, 2019
    • Scott Wood's avatar
      dma-debug: only skip one stackframe entry · 8c516543
      Scott Wood authored
      
      With skip set to 1, I get a traceback like this:
      
      [  106.867637] DMA-API: Mapped at:
      [  106.870784]  afu_dma_map_region+0x2cd/0x4f0 [dfl_afu]
      [  106.875839]  afu_ioctl+0x258/0x380 [dfl_afu]
      [  106.880108]  do_vfs_ioctl+0xa9/0x720
      [  106.883688]  ksys_ioctl+0x60/0x90
      [  106.887007]  __x64_sys_ioctl+0x16/0x20
      
      With the previous value of 2, afu_dma_map_region was being omitted.  I
      suspect that the code paths have simply changed since the value of 2 was
      chosen a decade ago, but it's also possible that it varies based on which
      mapping function was used, compiler inlining choices, etc.  In any case,
      it's best to err on the side of skipping less.
      
      Signed-off-by: default avatarScott Wood <swood@redhat.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      8c516543
  4. Apr 10, 2019
    • Andrei Vagin's avatar
      alarmtimer: Return correct remaining time · 07d7e120
      Andrei Vagin authored
      
      To calculate a remaining time, it's required to subtract the current time
      from the expiration time. In alarm_timer_remaining() the arguments of
      ktime_sub are swapped.
      
      Fixes: d653d845 ("alarmtimer: Implement remaining callback")
      Signed-off-by: default avatarAndrei Vagin <avagin@gmail.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarMukesh Ojha <mojha@codeaurora.org>
      Cc: Stephen Boyd <sboyd@kernel.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190408041542.26338-1-avagin@gmail.com
      07d7e120
    • Bart Van Assche's avatar
      locking/lockdep: Zap lock classes even with lock debugging disabled · 90c1cba2
      Bart Van Assche authored
      
      The following commit:
      
        a0b0fd53 ("locking/lockdep: Free lock classes that are no longer in use")
      
      changed the behavior of lockdep_free_key_range() from
      unconditionally zapping lock classes into only zapping lock classes if
      debug_lock == true. Not zapping lock classes if debug_lock == false leaves
      dangling pointers in several lockdep datastructures, e.g. lock_class::name
      in the all_lock_classes list.
      
      The shell command "cat /proc/lockdep" causes the kernel to iterate the
      all_lock_classes list. Hence the "unable to handle kernel paging request" cash
      that Shenghui encountered by running cat /proc/lockdep.
      
      Since the new behavior can cause cat /proc/lockdep to crash, restore the
      pre-v5.1 behavior.
      
      This patch avoids that cat /proc/lockdep triggers the following crash
      with debug_lock == false:
      
        BUG: unable to handle kernel paging request at fffffbfff40ca448
        RIP: 0010:__asan_load1+0x28/0x50
        Call Trace:
         string+0xac/0x180
         vsnprintf+0x23e/0x820
         seq_vprintf+0x82/0xc0
         seq_printf+0x92/0xb0
         print_name+0x34/0xb0
         l_show+0x184/0x200
         seq_read+0x59e/0x6c0
         proc_reg_read+0x11f/0x170
         __vfs_read+0x4d/0x90
         vfs_read+0xc5/0x1f0
         ksys_read+0xab/0x130
         __x64_sys_read+0x43/0x50
         do_syscall_64+0x71/0x210
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Reported-by: default avatarshenghui <shhuiw@foxmail.com>
      Signed-off-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Fixes: a0b0fd53 ("locking/lockdep: Free lock classes that are no longer in use") # v5.1-rc1.
      Link: https://lkml.kernel.org/r/20190403233552.124673-1-bvanassche@acm.org
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      90c1cba2
  5. Apr 06, 2019
    • Will Deacon's avatar
      kernel/sysctl.c: fix out-of-bounds access when setting file-max · 9002b214
      Will Deacon authored
      Commit 32a5ad9c ("sysctl: handle overflow for file-max") hooked up
      min/max values for the file-max sysctl parameter via the .extra1 and
      .extra2 fields in the corresponding struct ctl_table entry.
      
      Unfortunately, the minimum value points at the global 'zero' variable,
      which is an int.  This results in a KASAN splat when accessed as a long
      by proc_doulongvec_minmax on 64-bit architectures:
      
        | BUG: KASAN: global-out-of-bounds in __do_proc_doulongvec_minmax+0x5d8/0x6a0
        | Read of size 8 at addr ffff2000133d1c20 by task systemd/1
        |
        | CPU: 0 PID: 1 Comm: systemd Not tainted 5.1.0-rc3-00012-g40b114779944 #2
        | Hardware name: linux,dummy-virt (DT)
        | Call trace:
        |  dump_backtrace+0x0/0x228
        |  show_stack+0x14/0x20
        |  dump_stack+0xe8/0x124
        |  print_address_description+0x60/0x258
        |  kasan_report+0x140/0x1a0
        |  __asan_report_load8_noabort+0x18/0x20
        |  __do_proc_doulongvec_minmax+0x5d8/0x6a0
        |  proc_doulongvec_minmax+0x4c/0x78
        |  proc_sys_call_handler.isra.19+0x144/0x1d8
        |  proc_sys_write+0x34/0x58
        |  __vfs_write+0x54/0xe8
        |  vfs_write+0x124/0x3c0
        |  ksys_write+0xbc/0x168
        |  __arm64_sys_write+0x68/0x98
        |  el0_svc_common+0x100/0x258
        |  el0_svc_handler+0x48/0xc0
        |  el0_svc+0x8/0xc
        |
        | The buggy address belongs to the variable:
        |  zero+0x0/0x40
        |
        | Memory state around the buggy address:
        |  ffff2000133d1b00: 00 00 00 00 00 00 00 00 fa fa fa fa 04 fa fa fa
        |  ffff2000133d1b80: fa fa fa fa 04 fa fa fa fa fa fa fa 04 fa fa fa
        | >ffff2000133d1c00: fa fa fa fa 04 fa fa fa fa fa fa fa 00 00 00 00
        |                                ^
        |  ffff2000133d1c80: fa fa fa fa 00 fa fa fa fa fa fa fa 00 00 00 00
        |  ffff2000133d1d00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      
      Fix the splat by introducing a unsigned long 'zero_ul' and using that
      instead.
      
      Link: http://lkml.kernel.org/r/20190403153409.17307-1-will.deacon@arm.com
      
      
      Fixes: 32a5ad9c ("sysctl: handle overflow for file-max")
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Acked-by: default avatarChristian Brauner <christian@brauner.io>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Matteo Croce <mcroce@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9002b214
  6. Apr 05, 2019
    • Stephen Boyd's avatar
      genirq: Respect IRQCHIP_SKIP_SET_WAKE in irq_chip_set_wake_parent() · 325aa195
      Stephen Boyd authored
      
      If a child irqchip calls irq_chip_set_wake_parent() but its parent irqchip
      has the IRQCHIP_SKIP_SET_WAKE flag set an error is returned.
      
      This is inconsistent behaviour vs. set_irq_wake_real() which returns 0 when
      the irqchip has the IRQCHIP_SKIP_SET_WAKE flag set. It doesn't attempt to
      walk the chain of parents and set irq wake on any chips that don't have the
      flag set either. If the intent is to call the .irq_set_wake() callback of
      the parent irqchip, then we expect irqchip implementations to omit the
      IRQCHIP_SKIP_SET_WAKE flag and implement an .irq_set_wake() function that
      calls irq_chip_set_wake_parent().
      
      The problem has been observed on a Qualcomm sdm845 device where set wake
      fails on any GPIO interrupts after applying work in progress wakeup irq
      patches to the GPIO driver. The chain of chips looks like this:
      
           QCOM GPIO -> QCOM PDC (SKIP) -> ARM GIC (SKIP)
      
      The GPIO controllers parent is the QCOM PDC irqchip which in turn has ARM
      GIC as parent.  The QCOM PDC irqchip has the IRQCHIP_SKIP_SET_WAKE flag
      set, and so does the grandparent ARM GIC.
      
      The GPIO driver doesn't know if the parent needs to set wake or not, so it
      unconditionally calls irq_chip_set_wake_parent() causing this function to
      return a failure because the parent irqchip (PDC) doesn't have the
      .irq_set_wake() callback set. Returning 0 instead makes everything work and
      irqs from the GPIO controller can be configured for wakeup.
      
      Make it consistent by returning 0 (success) from irq_chip_set_wake_parent()
      when a parent chip has IRQCHIP_SKIP_SET_WAKE set.
      
      [ tglx: Massaged changelog ]
      
      Fixes: 08b55e2a ("genirq: Add irqchip_set_wake_parent")
      Signed-off-by: default avatarStephen Boyd <swboyd@chromium.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-gpio@vger.kernel.org
      Cc: Lina Iyer <ilina@codeaurora.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190325181026.247796-1-swboyd@chromium.org
      325aa195
    • Steven Rostedt (Red Hat)'s avatar
      syscalls: Remove start and number from syscall_get_arguments() args · b35f549d
      Steven Rostedt (Red Hat) authored
      At Linux Plumbers, Andy Lutomirski approached me and pointed out that the
      function call syscall_get_arguments() implemented in x86 was horribly
      written and not optimized for the standard case of passing in 0 and 6 for
      the starting index and the number of system calls to get. When looking at
      all the users of this function, I discovered that all instances pass in only
      0 and 6 for these arguments. Instead of having this function handle
      different cases that are never used, simply rewrite it to return the first 6
      arguments of a system call.
      
      This should help out the performance of tracing system calls by ptrace,
      ftrace and perf.
      
      Link: http://lkml.kernel.org/r/20161107213233.754809394@goodmis.org
      
      
      
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: Dave Martin <dave.martin@arm.com>
      Cc: "Dmitry V. Levin" <ldv@altlinux.org>
      Cc: x86@kernel.org
      Cc: linux-snps-arc@lists.infradead.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-c6x-dev@linux-c6x.org
      Cc: uclinux-h8-devel@lists.sourceforge.jp
      Cc: linux-hexagon@vger.kernel.org
      Cc: linux-ia64@vger.kernel.org
      Cc: linux-mips@vger.kernel.org
      Cc: nios2-dev@lists.rocketboards.org
      Cc: openrisc@lists.librecores.org
      Cc: linux-parisc@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: linux-riscv@lists.infradead.org
      Cc: linux-s390@vger.kernel.org
      Cc: linux-sh@vger.kernel.org
      Cc: sparclinux@vger.kernel.org
      Cc: linux-um@lists.infradead.org
      Cc: linux-xtensa@linux-xtensa.org
      Cc: linux-arch@vger.kernel.org
      Acked-by: Paul Burton <paul.burton@mips.com> # MIPS parts
      Acked-by: Max Filippov <jcmvbkbc@gmail.com> # For xtensa changes
      Acked-by: Will Deacon <will.deacon@arm.com> # For the arm64 bits
      Reviewed-by: Thomas Gleixner <tglx@linutronix.de> # for x86
      Reviewed-by: default avatarDmitry V. Levin <ldv@altlinux.org>
      Reported-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      b35f549d
    • Kefeng Wang's avatar
      genirq: Initialize request_mutex if CONFIG_SPARSE_IRQ=n · e8458e7a
      Kefeng Wang authored
      
      When CONFIG_SPARSE_IRQ is disable, the request_mutex in struct irq_desc
      is not initialized which causes malfunction.
      
      Fixes: 9114014c ("genirq: Add mutex to irq desc to serialize request/free_irq()")
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarMukesh Ojha <mojha@codeaurora.org>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: <linux-arm-kernel@lists.infradead.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190404074512.145533-1-wangkefeng.wang@huawei.com
      e8458e7a
  7. Apr 04, 2019
  8. Apr 03, 2019
    • Mel Gorman's avatar
      sched/fair: Do not re-read ->h_load_next during hierarchical load calculation · 0e9f0245
      Mel Gorman authored
      
      A NULL pointer dereference bug was reported on a distribution kernel but
      the same issue should be present on mainline kernel. It occured on s390
      but should not be arch-specific.  A partial oops looks like:
      
        Unable to handle kernel pointer dereference in virtual kernel address space
        ...
        Call Trace:
          ...
          try_to_wake_up+0xfc/0x450
          vhost_poll_wakeup+0x3a/0x50 [vhost]
          __wake_up_common+0xbc/0x178
          __wake_up_common_lock+0x9e/0x160
          __wake_up_sync_key+0x4e/0x60
          sock_def_readable+0x5e/0x98
      
      The bug hits any time between 1 hour to 3 days. The dereference occurs
      in update_cfs_rq_h_load when accumulating h_load. The problem is that
      cfq_rq->h_load_next is not protected by any locking and can be updated
      by parallel calls to task_h_load. Depending on the compiler, code may be
      generated that re-reads cfq_rq->h_load_next after the check for NULL and
      then oops when reading se->avg.load_avg. The dissassembly showed that it
      was possible to reread h_load_next after the check for NULL.
      
      While this does not appear to be an issue for later compilers, it's still
      an accident if the correct code is generated. Full locking in this path
      would have high overhead so this patch uses READ_ONCE to read h_load_next
      only once and check for NULL before dereferencing. It was confirmed that
      there were no further oops after 10 days of testing.
      
      As Peter pointed out, it is also necessary to use WRITE_ONCE() to avoid any
      potential problems with store tearing.
      
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: <stable@vger.kernel.org>
      Fixes: 68520796 ("sched: Move h_load calculation to task_h_load()")
      Link: https://lkml.kernel.org/r/20190319123610.nsivgf3mjbjjesxb@techsingularity.net
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      0e9f0245
  9. Apr 01, 2019
    • Jann Horn's avatar
      signal: don't silently convert SI_USER signals to non-current pidfd · 556a888a
      Jann Horn authored
      
      The current sys_pidfd_send_signal() silently turns signals with explicit
      SI_USER context that are sent to non-current tasks into signals with
      kernel-generated siginfo.
      This is unlike do_rt_sigqueueinfo(), which returns -EPERM in this case.
      If a user actually wants to send a signal with kernel-provided siginfo,
      they can do that with pidfd_send_signal(pidfd, sig, NULL, 0); so allowing
      this case is unnecessary.
      
      Instead of silently replacing the siginfo, just bail out with an error;
      this is consistent with other interfaces and avoids special-casing behavior
      based on security checks.
      
      Fixes: 3eb39f47 ("signal: add pidfd_send_signal() syscall")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarChristian Brauner <christian@brauner.io>
      556a888a
  10. Mar 29, 2019
    • Jesper Dangaard Brouer's avatar
      xdp: fix cpumap redirect SKB creation bug · 676e4a6f
      Jesper Dangaard Brouer authored
      
      We want to avoid leaking pointer info from xdp_frame (that is placed in
      top of frame) like commit 6dfb970d ("xdp: avoid leaking info stored in
      frame data on page reuse"), and followup commit 97e19cce ("bpf:
      reserve xdp_frame size in xdp headroom") that reserve this headroom.
      
      These changes also affected how cpumap constructed SKBs, as xdpf->headroom
      size changed, the skb data starting point were in-effect shifted with 32
      bytes (sizeof xdp_frame). This was still okay, as the cpumap frame_size
      calculation also included xdpf->headroom which were reduced by same amount.
      
      A bug was introduced in commit 77ea5f4c ("bpf/cpumap: make sure
      frame_size for build_skb is aligned if headroom isn't"), where the
      xdpf->headroom became part of the SKB_DATA_ALIGN rounding up. This
      round-up to find the frame_size is in principle still correct as it does
      not exceed the 2048 bytes frame_size (which is max for ixgbe and i40e),
      but the 32 bytes offset of pkt_data_start puts this over the 2048 bytes
      limit. This cause skb_shared_info to spill into next frame. It is a little
      hard to trigger, as the SKB need to use above 15 skb_shinfo->frags[] as
      far as I calculate. This does happen in practise for TCP streams when
      skb_try_coalesce() kicks in.
      
      KASAN can be used to detect these wrong memory accesses, I've seen:
       BUG: KASAN: use-after-free in skb_try_coalesce+0x3cb/0x760
       BUG: KASAN: wild-memory-access in skb_release_data+0xe2/0x250
      
      Driver veth also construct a SKB from xdp_frame in this way, but is not
      affected, as it doesn't reserve/deduct the room (used by xdp_frame) from
      the SKB headroom. Instead is clears the pointers via xdp_scrub_frame(),
      and allows SKB to use this area.
      
      The fix in this patch is to do like veth and instead allow SKB to (re)use
      the area occupied by xdp_frame, by clearing via xdp_scrub_frame().  (This
      does kill the idea of the SKB being able to access (mem) info from this
      area, but I guess it was a bad idea anyhow, and it was already killed by
      the veth changes.)
      
      Fixes: 77ea5f4c ("bpf/cpumap: make sure frame_size for build_skb is aligned if headroom isn't")
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      676e4a6f
    • Andrei Vagin's avatar
      ptrace: take into account saved_sigmask in PTRACE{GET,SET}SIGMASK · fcfc2aa0
      Andrei Vagin authored
      There are a few system calls (pselect, ppoll, etc) which replace a task
      sigmask while they are running in a kernel-space
      
      When a task calls one of these syscalls, the kernel saves a current
      sigmask in task->saved_sigmask and sets a syscall sigmask.
      
      On syscall-exit-stop, ptrace traps a task before restoring the
      saved_sigmask, so PTRACE_GETSIGMASK returns the syscall sigmask and
      PTRACE_SETSIGMASK does nothing, because its sigmask is replaced by
      saved_sigmask, when the task returns to user-space.
      
      This patch fixes this problem.  PTRACE_GETSIGMASK returns saved_sigmask
      if it's set.  PTRACE_SETSIGMASK drops the TIF_RESTORE_SIGMASK flag.
      
      Link: http://lkml.kernel.org/r/20181120060616.6043-1-avagin@gmail.com
      
      
      Fixes: 29000cae ("ptrace: add ability to get/set signal-blocked mask")
      Signed-off-by: default avatarAndrei Vagin <avagin@gmail.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fcfc2aa0
  11. Mar 28, 2019
    • Thomas Gleixner's avatar
      cpu/hotplug: Prevent crash when CPU bringup fails on CONFIG_HOTPLUG_CPU=n · 206b9235
      Thomas Gleixner authored
      
      Tianyu reported a crash in a CPU hotplug teardown callback when booting a
      kernel which has CONFIG_HOTPLUG_CPU disabled with the 'nosmt' boot
      parameter.
      
      It turns out that the SMP=y CONFIG_HOTPLUG_CPU=n case has been broken
      forever in case that a bringup callback fails. Unfortunately this issue was
      not recognized when the CPU hotplug code was reworked, so the shortcoming
      just stayed in place.
      
      When a bringup callback fails, the CPU hotplug code rolls back the
      operation and takes the CPU offline.
      
      The 'nosmt' command line argument uses a bringup failure to abort the
      bringup of SMT sibling CPUs. This partial bringup is required due to the
      MCE misdesign on Intel CPUs.
      
      With CONFIG_HOTPLUG_CPU=y the rollback works perfectly fine, but
      CONFIG_HOTPLUG_CPU=n lacks essential mechanisms to exercise the low level
      teardown of a CPU including the synchronizations in various facilities like
      RCU, NOHZ and others.
      
      As a consequence the teardown callbacks which must be executed on the
      outgoing CPU within stop machine with interrupts disabled are executed on
      the control CPU in interrupt enabled and preemptible context causing the
      kernel to crash and burn. The pre state machine code has a different
      failure mode which is more subtle and resulting in a less obvious use after
      free crash because the control side frees resources which are still in use
      by the undead CPU.
      
      But this is not a x86 only problem. Any architecture which supports the
      SMP=y HOTPLUG_CPU=n combination suffers from the same issue. It's just less
      likely to be triggered because in 99.99999% of the cases all bringup
      callbacks succeed.
      
      The easy solution of making HOTPLUG_CPU mandatory for SMP is not working on
      all architectures as the following architectures have either no hotplug
      support at all or not all subarchitectures support it:
      
       alpha, arc, hexagon, openrisc, riscv, sparc (32bit), mips (partial).
      
      Crashing the kernel in such a situation is not an acceptable state
      either.
      
      Implement a minimal rollback variant by limiting the teardown to the point
      where all regular teardown callbacks have been invoked and leave the CPU in
      the 'dead' idle state. This has the following consequences:
      
       - the CPU is brought down to the point where the stop_machine takedown
         would happen.
      
       - the CPU stays there forever and is idle
      
       - The CPU is cleared in the CPU active mask, but not in the CPU online
         mask which is a legit state.
      
       - Interrupts are not forced away from the CPU
      
       - All facilities which only look at online mask would still see it, but
         that is the case during normal hotplug/unplug operations as well. It's
         just a (way) longer time frame.
      
      This will expose issues, which haven't been exposed before or only seldom,
      because now the normally transient state of being non active but online is
      a permanent state. In testing this exposed already an issue vs. work queues
      where the vmstat code schedules work on the almost dead CPU which ends up
      in an unbound workqueue and triggers 'preemtible context' warnings. This is
      not a problem of this change, it merily exposes an already existing issue.
      Still this is better than crashing fully without a chance to debug it.
      
      This is mainly thought as workaround for those architectures which do not
      support HOTPLUG_CPU. All others should enforce HOTPLUG_CPU for SMP.
      
      Fixes: 2e1a3483 ("cpu/hotplug: Split out the state walk into functions")
      Reported-by: default avatarTianyu Lan <Tianyu.Lan@microsoft.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarTianyu Lan <Tianyu.Lan@microsoft.com>
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Konrad Wilk <konrad.wilk@oracle.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Mukesh Ojha <mojha@codeaurora.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Micheal Kelley <michael.h.kelley@microsoft.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190326163811.503390616@linutronix.de
      206b9235
    • Thomas Gleixner's avatar
      watchdog: Respect watchdog cpumask on CPU hotplug · 7dd47617
      Thomas Gleixner authored
      
      The rework of the watchdog core to use cpu_stop_work broke the watchdog
      cpumask on CPU hotplug.
      
      The watchdog_enable/disable() functions are now called unconditionally from
      the hotplug callback, i.e. even on CPUs which are not in the watchdog
      cpumask. As a consequence the watchdog can become unstoppable.
      
      Only invoke them when the plugged CPU is in the watchdog cpumask.
      
      Fixes: 9cf57731 ("watchdog/softlockup: Replace "watchdog/%u" threads with cpu_stop_work")
      Reported-by: default avatarMaxime Coquelin <maxime.coquelin@redhat.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarMaxime Coquelin <maxime.coquelin@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1903262245490.1789@nanos.tec.linutronix.de
      7dd47617
  12. Mar 26, 2019
    • Paul Chaignon's avatar
      bpf: remove incorrect 'verifier bug' warning · 927cb781
      Paul Chaignon authored
      
      The BPF verifier checks the maximum number of call stack frames twice,
      first in the main CFG traversal (do_check) and then in a subsequent
      traversal (check_max_stack_depth).  If the second check fails, it logs a
      'verifier bug' warning and errors out, as the number of call stack frames
      should have been verified already.
      
      However, the second check may fail without indicating a verifier bug: if
      the excessive function calls reside in dead code, the main CFG traversal
      may not visit them; the subsequent traversal visits all instructions,
      including dead code.
      
      This case raises the question of how invalid dead code should be treated.
      This patch implements the conservative option and rejects such code.
      
      Signed-off-by: default avatarPaul Chaignon <paul.chaignon@orange.com>
      Tested-by: default avatarXiao Han <xiao.han@orange.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      927cb781
    • Hariprasad Kelam's avatar
      ftrace: Fix warning using plain integer as NULL & spelling corrections · 9efb85c5
      Hariprasad Kelam authored
      Changed  0 --> NULL to avoid sparse warning
      Corrected spelling mistakes reported by checkpatch.pl
      Sparse warning below:
      
      sudo make C=2 CF=-D__CHECK_ENDIAN__ M=kernel/trace
      
      CHECK   kernel/trace/ftrace.c
      kernel/trace/ftrace.c:3007:24: warning: Using plain integer as NULL pointer
      kernel/trace/ftrace.c:4758:37: warning: Using plain integer as NULL pointer
      
      Link: http://lkml.kernel.org/r/20190323183523.GA2244@hari-Inspiron-1545
      
      
      
      Signed-off-by: default avatarHariprasad Kelam <hariprasad.kelam@gmail.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      9efb85c5
    • Frank Rowand's avatar
      tracing: initialize variable in create_dyn_event() · 3dee10da
      Frank Rowand authored
      Fix compile warning in create_dyn_event(): 'ret' may be used uninitialized
      in this function [-Wuninitialized].
      
      Link: http://lkml.kernel.org/r/1553237900-8555-1-git-send-email-frowand.list@gmail.com
      
      
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
      Cc: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
      Cc: stable@vger.kernel.org
      Fixes: 5448d44c ("tracing: Add unified dynamic event framework")
      Signed-off-by: default avatarFrank Rowand <frank.rowand@sony.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      3dee10da
    • Tom Zanussi's avatar
      tracing: Remove unnecessary var_ref destroy in track_data_destroy() · ff9d31d0
      Tom Zanussi authored
      Commit 656fe2ba (tracing: Use hist trigger's var_ref array to
      destroy var_refs) centralized the destruction of all the var_refs
      in one place so that other code didn't have to do it.
      
      The track_data_destroy() added later ignored that and also destroyed
      the track_data var_ref, causing a double-free error flagged by KASAN.
      
      ==================================================================
      BUG: KASAN: use-after-free in destroy_hist_field+0x30/0x70
      Read of size 8 at addr ffff888086df2210 by task bash/1694
      
      CPU: 6 PID: 1694 Comm: bash Not tainted 5.1.0-rc1-test+ #15
      Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03
      07/14/2016
      Call Trace:
       dump_stack+0x71/0xa0
       ? destroy_hist_field+0x30/0x70
       print_address_description.cold.3+0x9/0x1fb
       ? destroy_hist_field+0x30/0x70
       ? destroy_hist_field+0x30/0x70
       kasan_report.cold.4+0x1a/0x33
       ? __kasan_slab_free+0x100/0x150
       ? destroy_hist_field+0x30/0x70
       destroy_hist_field+0x30/0x70
       track_data_destroy+0x55/0xe0
       destroy_hist_data+0x1f0/0x350
       hist_unreg_all+0x203/0x220
       event_trigger_open+0xbb/0x130
       do_dentry_open+0x296/0x700
       ? stacktrace_count_trigger+0x30/0x30
       ? generic_permission+0x56/0x200
       ? __x64_sys_fchdir+0xd0/0xd0
       ? inode_permission+0x55/0x200
       ? security_inode_permission+0x18/0x60
       path_openat+0x633/0x22b0
       ? path_lookupat.isra.50+0x420/0x420
       ? __kasan_kmalloc.constprop.12+0xc1/0xd0
       ? kmem_cache_alloc+0xe5/0x260
       ? getname_flags+0x6c/0x2a0
       ? do_sys_open+0x149/0x2b0
       ? do_syscall_64+0x73/0x1b0
       ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
       ? _raw_write_lock_bh+0xe0/0xe0
       ? __kernel_text_address+0xe/0x30
       ? unwind_get_return_address+0x2f/0x50
       ? __list_add_valid+0x2d/0x70
       ? deactivate_slab.isra.62+0x1f4/0x5a0
       ? getname_flags+0x6c/0x2a0
       ? set_track+0x76/0x120
       do_filp_open+0x11a/0x1a0
       ? may_open_dev+0x50/0x50
       ? _raw_spin_lock+0x7a/0xd0
       ? _raw_write_lock_bh+0xe0/0xe0
       ? __alloc_fd+0x10f/0x200
       do_sys_open+0x1db/0x2b0
       ? filp_open+0x50/0x50
       do_syscall_64+0x73/0x1b0
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x7fa7b24a4ca2
      Code: 25 00 00 41 00 3d 00 00 41 00 74 4c 48 8d 05 85 7a 0d 00 8b 00 85 c0
      75 6d 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d 00 f0 ff ff
      0f 87 a2 00 00 00 48 8b 4c 24 28 64 48 33 0c 25
      RSP: 002b:00007fffbafb3af0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
      RAX: ffffffffffffffda RBX: 000055d3648ade30 RCX: 00007fa7b24a4ca2
      RDX: 0000000000000241 RSI: 000055d364a55240 RDI: 00000000ffffff9c
      RBP: 00007fffbafb3bf0 R08: 0000000000000020 R09: 0000000000000002
      R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000000
      R13: 0000000000000003 R14: 0000000000000001 R15: 000055d364a55240
      ==================================================================
      
      So remove the track_data_destroy() destroy_hist_field() call for that
      var_ref.
      
      Link: http://lkml.kernel.org/r/1deffec420f6a16d11dd8647318d34a66d1989a9.camel@linux.intel.com
      
      
      
      Fixes: 466f4528 ("tracing: Generalize hist trigger onmax and save action")
      Reported-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarTom Zanussi <tom.zanussi@linux.intel.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      ff9d31d0
    • Daniel Borkmann's avatar
      bpf: fix use after free in bpf_evict_inode · 1da6c4d9
      Daniel Borkmann authored
      
      syzkaller was able to generate the following UAF in bpf:
      
        BUG: KASAN: use-after-free in lookup_last fs/namei.c:2269 [inline]
        BUG: KASAN: use-after-free in path_lookupat.isra.43+0x9f8/0xc00 fs/namei.c:2318
        Read of size 1 at addr ffff8801c4865c47 by task syz-executor2/9423
      
        CPU: 0 PID: 9423 Comm: syz-executor2 Not tainted 4.20.0-rc1-next-20181109+
        #110
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
        Google 01/01/2011
        Call Trace:
          __dump_stack lib/dump_stack.c:77 [inline]
          dump_stack+0x244/0x39d lib/dump_stack.c:113
          print_address_description.cold.7+0x9/0x1ff mm/kasan/report.c:256
          kasan_report_error mm/kasan/report.c:354 [inline]
          kasan_report.cold.8+0x242/0x309 mm/kasan/report.c:412
          __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:430
          lookup_last fs/namei.c:2269 [inline]
          path_lookupat.isra.43+0x9f8/0xc00 fs/namei.c:2318
          filename_lookup+0x26a/0x520 fs/namei.c:2348
          user_path_at_empty+0x40/0x50 fs/namei.c:2608
          user_path include/linux/namei.h:62 [inline]
          do_mount+0x180/0x1ff0 fs/namespace.c:2980
          ksys_mount+0x12d/0x140 fs/namespace.c:3258
          __do_sys_mount fs/namespace.c:3272 [inline]
          __se_sys_mount fs/namespace.c:3269 [inline]
          __x64_sys_mount+0xbe/0x150 fs/namespace.c:3269
          do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
          entry_SYSCALL_64_after_hwframe+0x49/0xbe
        RIP: 0033:0x457569
        Code: fd b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7
        48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
        ff 0f 83 cb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
        RSP: 002b:00007fde6ed96c78 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
        RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 0000000000457569
        RDX: 0000000020000040 RSI: 0000000020000000 RDI: 0000000000000000
        RBP: 000000000072bf00 R08: 0000000020000340 R09: 0000000000000000
        R10: 0000000000200000 R11: 0000000000000246 R12: 00007fde6ed976d4
        R13: 00000000004c2c24 R14: 00000000004d4990 R15: 00000000ffffffff
      
        Allocated by task 9424:
          save_stack+0x43/0xd0 mm/kasan/kasan.c:448
          set_track mm/kasan/kasan.c:460 [inline]
          kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
          __do_kmalloc mm/slab.c:3722 [inline]
          __kmalloc_track_caller+0x157/0x760 mm/slab.c:3737
          kstrdup+0x39/0x70 mm/util.c:49
          bpf_symlink+0x26/0x140 kernel/bpf/inode.c:356
          vfs_symlink+0x37a/0x5d0 fs/namei.c:4127
          do_symlinkat+0x242/0x2d0 fs/namei.c:4154
          __do_sys_symlink fs/namei.c:4173 [inline]
          __se_sys_symlink fs/namei.c:4171 [inline]
          __x64_sys_symlink+0x59/0x80 fs/namei.c:4171
          do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
          entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        Freed by task 9425:
          save_stack+0x43/0xd0 mm/kasan/kasan.c:448
          set_track mm/kasan/kasan.c:460 [inline]
          __kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521
          kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
          __cache_free mm/slab.c:3498 [inline]
          kfree+0xcf/0x230 mm/slab.c:3817
          bpf_evict_inode+0x11f/0x150 kernel/bpf/inode.c:565
          evict+0x4b9/0x980 fs/inode.c:558
          iput_final fs/inode.c:1550 [inline]
          iput+0x674/0xa90 fs/inode.c:1576
          do_unlinkat+0x733/0xa30 fs/namei.c:4069
          __do_sys_unlink fs/namei.c:4110 [inline]
          __se_sys_unlink fs/namei.c:4108 [inline]
          __x64_sys_unlink+0x42/0x50 fs/namei.c:4108
          do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
          entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      In this scenario path lookup under RCU is racing with the final
      unlink in case of symlinks. As Linus puts it in his analysis:
      
        [...] We actually RCU-delay the inode freeing itself, but
        when we do the final iput(), the "evict()" function is called
        synchronously. Now, the simple fix would seem to just RCU-delay
        the kfree() of the symlink data in bpf_evict_inode(). Maybe
        that's the right thing to do. [...]
      
      Al suggested to piggy-back on the ->destroy_inode() callback in
      order to implement RCU deferral there which can then kfree() the
      inode->i_link eventually right before putting inode back into
      inode cache. By reusing free_inode_nonrcu() from there we can
      avoid the need for our own inode cache and just reuse generic
      one as we currently do.
      
      And in-fact on top of all this we should just get rid of the
      bpf_evict_inode() entirely. This means truncate_inode_pages_final()
      and clear_inode() will then simply be called by the fs core via
      evict(). Dropping the reference should really only be done when
      inode is unhashed and nothing reachable anymore, so it's better
      also moved into the final ->destroy_inode() callback.
      
      Fixes: 0f98621b ("bpf, inode: add support for symlinks and fix mtime/ctime")
      Reported-by: default avatar <syzbot+fb731ca573367b7f6564@syzkaller.appspotmail.com>
      Reported-by: default avatar <syzbot+a13e5ead792d6df37818@syzkaller.appspotmail.com>
      Reported-by: default avatar <syzbot+7a8ba368b47fdefca61e@syzkaller.appspotmail.com>
      Suggested-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Analyzed-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Link: https://lore.kernel.org/lkml/0000000000006946d2057bbd0eef@google.com/T/
      1da6c4d9
  13. Mar 23, 2019
  14. Mar 22, 2019
  15. Mar 21, 2019
    • Xu Yu's avatar
      bpf: do not restore dst_reg when cur_state is freed · 0803278b
      Xu Yu authored
      
      Syzkaller hit 'KASAN: use-after-free Write in sanitize_ptr_alu' bug.
      
      Call trace:
      
        dump_stack+0xbf/0x12e
        print_address_description+0x6a/0x280
        kasan_report+0x237/0x360
        sanitize_ptr_alu+0x85a/0x8d0
        adjust_ptr_min_max_vals+0x8f2/0x1ca0
        adjust_reg_min_max_vals+0x8ed/0x22e0
        do_check+0x1ca6/0x5d00
        bpf_check+0x9ca/0x2570
        bpf_prog_load+0xc91/0x1030
        __se_sys_bpf+0x61e/0x1f00
        do_syscall_64+0xc8/0x550
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Fault injection trace:
      
        kfree+0xea/0x290
        free_func_state+0x4a/0x60
        free_verifier_state+0x61/0xe0
        push_stack+0x216/0x2f0	          <- inject failslab
        sanitize_ptr_alu+0x2b1/0x8d0
        adjust_ptr_min_max_vals+0x8f2/0x1ca0
        adjust_reg_min_max_vals+0x8ed/0x22e0
        do_check+0x1ca6/0x5d00
        bpf_check+0x9ca/0x2570
        bpf_prog_load+0xc91/0x1030
        __se_sys_bpf+0x61e/0x1f00
        do_syscall_64+0xc8/0x550
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      When kzalloc() fails in push_stack(), free_verifier_state() will free
      current verifier state. As push_stack() returns, dst_reg was restored
      if ptr_is_dst_reg is false. However, as member of the cur_state,
      dst_reg is also freed, and error occurs when dereferencing dst_reg.
      Simply fix it by testing ret of push_stack() before restoring dst_reg.
      
      Fixes: 979d63d5 ("bpf: prevent out of bounds speculation on pointer arithmetic")
      Signed-off-by: default avatarXu Yu <xuyu@linux.alibaba.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      0803278b
    • Bart Van Assche's avatar
      workqueue: Only unregister a registered lockdep key · 82efcab3
      Bart Van Assche authored
      
      The recent change to prevent use after free and a memory leak introduced an
      unconditional call to wq_unregister_lockdep() in the error handling
      path. If the lockdep key had not been registered yet, then the lockdep core
      emits a warning.
      
      Only call wq_unregister_lockdep() if wq_register_lockdep() has been
      called first.
      
      Fixes: 009bb421 ("workqueue, lockdep: Fix an alloc_workqueue() error path")
      Reported-by: default avatar <syzbot+be0c198232f86389c3dd@syzkaller.appspotmail.com>
      Signed-off-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Qian Cai <cai@lca.pw>
      Link: https://lkml.kernel.org/r/20190311230255.176081-1-bvanassche@acm.org
      82efcab3
    • Martin KaFai Lau's avatar
      bpf: Only print ref_obj_id for refcounted reg · cba368c1
      Martin KaFai Lau authored
      
      Naresh reported that test_align fails because of the mismatch at the
      verbose printout of the register states.  The reason is due to the newly
      added ref_obj_id.
      
      ref_obj_id is only useful for refcounted reg.  Thus, this patch fixes it
      by only printing ref_obj_id for refcounted reg.  While at it, it also uses
      comma instead of space to separate between "id" and "ref_obj_id".
      
      Fixes: 1b986589 ("bpf: Fix bpf_tcp_sock and bpf_sk_fullsock issue related to bpf_sk_release")
      Reported-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      cba368c1
  16. Mar 19, 2019
  17. Mar 18, 2019
    • Martynas Pumputis's avatar
      bpf: Try harder when allocating memory for large maps · f01a7dbe
      Martynas Pumputis authored
      It has been observed that sometimes a higher order memory allocation
      for BPF maps fails when there is no obvious memory pressure in a system.
      E.g. the map (BPF_MAP_TYPE_LRU_HASH, key=38, value=56, max_elems=524288)
      could not be created due to vmalloc unable to allocate 75497472B,
      when the system's memory consumption (in MB) was the following:
      
          Total: 3942 Used: 837 (21.24%) Free: 138 Buffers: 239 Cached: 2727
      
      Later analysis [1] by Michal Hocko showed that the vmalloc was not trying
      to reclaim memory from the page cache and was failing prematurely due to
      __GFP_NORETRY.
      
      Considering dcda9b04 ("mm, tree wide: replace __GFP_REPEAT by
      __GFP_RETRY_MAYFAIL with more useful semantic") and [1], we can replace
      __GFP_NORETRY with __GFP_RETRY_MAYFAIL, as it won't invoke OOM killer
      and will try harder to fulfil allocation requests.
      
      Unfortunately, replacing the body of the BPF map memory allocation
      function with the kvmalloc_node helper function is not an option at
      this point in time, given 1) kmalloc is non-optional for higher order
      allocations, and 2) passing __GFP_RETRY_MAYFAIL to the kmalloc would
      stress the slab allocator too much for large requests.
      
      The change has been tested with the workloads mentioned above and by
      observing oom_kill value from /proc/vmstat.
      
      [1]: https://lore.kernel.org/bpf/20190310071318.GW5232@dhcp22.suse.cz/
      
      
      
      Signed-off-by: default avatarMartynas Pumputis <m@lambda.lt>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20190318153940.GL8924@dhcp22.suse.cz/
      f01a7dbe
  18. Mar 14, 2019
Loading