Skip to content
Snippets Groups Projects
  1. May 16, 2019
    • Roman Gushchin's avatar
      signal: unconditionally leave the frozen state in ptrace_stop() · 05b28926
      Roman Gushchin authored
      
      Alex Xu reported a regression in strace, caused by the introduction of
      the cgroup v2 freezer. The regression can be reproduced by stracing
      the following simple program:
      
        #include <unistd.h>
      
        int main() {
            write(1, "a", 1);
            return 0;
        }
      
      An attempt to run strace ./a.out leads to the infinite loop:
        [ pre-main omitted ]
        write(1, "a", 1)                        = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
        write(1, "a", 1)                        = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
        write(1, "a", 1)                        = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
        write(1, "a", 1)                        = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
        write(1, "a", 1)                        = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
        write(1, "a", 1)                        = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
        [ repeats forever ]
      
      The problem occurs because the traced task leaves ptrace_stop()
      (and the signal handling loop) with the frozen bit set. So let's
      call cgroup_leave_frozen(true) unconditionally after sleeping
      in ptrace_stop().
      
      With this patch applied, strace works as expected:
        [ pre-main omitted ]
        write(1, "a", 1)                        = 1
        exit_group(0)                           = ?
        +++ exited with 0 +++
      
      Reported-by: default avatarAlex Xu <alex_y_xu@yahoo.ca>
      Fixes: 76f969e8 ("cgroup: cgroup v2 freezer")
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      05b28926
  2. May 15, 2019
  3. May 14, 2019
    • Christoph Hellwig's avatar
      640be2d1
    • Mike Rapoport's avatar
      mm: memblock: make keeping memblock memory opt-in rather than opt-out · 350e88ba
      Mike Rapoport authored
      Most architectures do not need the memblock memory after the page
      allocator is initialized, but only few enable ARCH_DISCARD_MEMBLOCK in the
      arch Kconfig.
      
      Replacing ARCH_DISCARD_MEMBLOCK with ARCH_KEEP_MEMBLOCK and inverting the
      logic makes it clear which architectures actually use memblock after
      system initialization and skips the necessity to add ARCH_DISCARD_MEMBLOCK
      to the architectures that are still missing that option.
      
      Link: http://lkml.kernel.org/r/1556102150-32517-1-git-send-email-rppt@linux.ibm.com
      
      
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      350e88ba
    • Michal Hocko's avatar
      mm, memory_hotplug: provide a more generic restrictions for memory hotplug · 940519f0
      Michal Hocko authored
      arch_add_memory, __add_pages take a want_memblock which controls whether
      the newly added memory should get the sysfs memblock user API (e.g.
      ZONE_DEVICE users do not want/need this interface).  Some callers even
      want to control where do we allocate the memmap from by configuring
      altmap.
      
      Add a more generic hotplug context for arch_add_memory and __add_pages.
      struct mhp_restrictions contains flags which contains additional features
      to be enabled by the memory hotplug (MHP_MEMBLOCK_API currently) and
      altmap for alternative memmap allocator.
      
      This patch shouldn't introduce any functional change.
      
      [akpm@linux-foundation.org: build fix]
      Link: http://lkml.kernel.org/r/20190408082633.2864-3-osalvador@suse.de
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      940519f0
    • Jérôme Glisse's avatar
      mm/mmu_notifier: use correct mmu_notifier events for each invalidation · 7269f999
      Jérôme Glisse authored
      This updates each existing invalidation to use the correct mmu notifier
      event that represent what is happening to the CPU page table.  See the
      patch which introduced the events to see the rational behind this.
      
      Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.com
      
      
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7269f999
    • Jérôme Glisse's avatar
      mm/mmu_notifier: contextual information for event triggering invalidation · 6f4f13e8
      Jérôme Glisse authored
      CPU page table update can happens for many reasons, not only as a result
      of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
      a result of kernel activities (memory compression, reclaim, migration,
      ...).
      
      Users of mmu notifier API track changes to the CPU page table and take
      specific action for them.  While current API only provide range of virtual
      address affected by the change, not why the changes is happening.
      
      This patchset do the initial mechanical convertion of all the places that
      calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
      event as well as the vma if it is know (most invalidation happens against
      a given vma).  Passing down the vma allows the users of mmu notifier to
      inspect the new vma page protection.
      
      The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
      should assume that every for the range is going away when that event
      happens.  A latter patch do convert mm call path to use a more appropriate
      events for each call.
      
      This is done as 2 patches so that no call site is forgotten especialy
      as it uses this following coccinelle patch:
      
      %<----------------------------------------------------------------------
      @@
      identifier I1, I2, I3, I4;
      @@
      static inline void mmu_notifier_range_init(struct mmu_notifier_range *I1,
      +enum mmu_notifier_event event,
      +unsigned flags,
      +struct vm_area_struct *vma,
      struct mm_struct *I2, unsigned long I3, unsigned long I4) { ... }
      
      @@
      @@
      -#define mmu_notifier_range_init(range, mm, start, end)
      +#define mmu_notifier_range_init(range, event, flags, vma, mm, start, end)
      
      @@
      expression E1, E3, E4;
      identifier I1;
      @@
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, I1,
      I1->vm_mm, E3, E4)
      ...>
      
      @@
      expression E1, E2, E3, E4;
      identifier FN, VMA;
      @@
      FN(..., struct vm_area_struct *VMA, ...) {
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, VMA,
      E2, E3, E4)
      ...> }
      
      @@
      expression E1, E2, E3, E4;
      identifier FN, VMA;
      @@
      FN(...) {
      struct vm_area_struct *VMA;
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, VMA,
      E2, E3, E4)
      ...> }
      
      @@
      expression E1, E2, E3, E4;
      identifier FN;
      @@
      FN(...) {
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, NULL,
      E2, E3, E4)
      ...> }
      ---------------------------------------------------------------------->%
      
      Applied with:
      spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
      spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
      spatch --sp-file mmu-notifier.spatch --dir mm --in-place
      
      Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com
      
      
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6f4f13e8
    • Ira Weiny's avatar
      mm/gup: change GUP fast to use flags rather than a write 'bool' · 73b0140b
      Ira Weiny authored
      To facilitate additional options to get_user_pages_fast() change the
      singular write parameter to be gup_flags.
      
      This patch does not change any functionality.  New functionality will
      follow in subsequent patches.
      
      Some of the get_user_pages_fast() call sites were unchanged because they
      already passed FOLL_WRITE or 0 for the write parameter.
      
      NOTE: It was suggested to change the ordering of the get_user_pages_fast()
      arguments to ensure that callers were converted.  This breaks the current
      GUP call site convention of having the returned pages be the final
      parameter.  So the suggestion was rejected.
      
      Link: http://lkml.kernel.org/r/20190328084422.29911-4-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190317183438.2057-4-ira.weiny@intel.com
      
      
      Signed-off-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarMike Marshall <hubcap@omnibond.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      73b0140b
    • Peter Xu's avatar
      userfaultfd/sysctl: add vm.unprivileged_userfaultfd · cefdca0a
      Peter Xu authored
      Userfaultfd can be misued to make it easier to exploit existing
      use-after-free (and similar) bugs that might otherwise only make a
      short window or race condition available.  By using userfaultfd to
      stall a kernel thread, a malicious program can keep some state that it
      wrote, stable for an extended period, which it can then access using an
      existing exploit.  While it doesn't cause the exploit itself, and while
      it's not the only thing that can stall a kernel thread when accessing a
      memory location, it's one of the few that never needs privilege.
      
      We can add a flag, allowing userfaultfd to be restricted, so that in
      general it won't be useable by arbitrary user programs, but in
      environments that require userfaultfd it can be turned back on.
      
      Add a global sysctl knob "vm.unprivileged_userfaultfd" to control
      whether userfaultfd is allowed by unprivileged users.  When this is
      set to zero, only privileged users (root user, or users with the
      CAP_SYS_PTRACE capability) will be able to use the userfaultfd
      syscalls.
      
      Andrea said:
      
      : The only difference between the bpf sysctl and the userfaultfd sysctl
      : this way is that the bpf sysctl adds the CAP_SYS_ADMIN capability
      : requirement, while userfaultfd adds the CAP_SYS_PTRACE requirement,
      : because the userfaultfd monitor is more likely to need CAP_SYS_PTRACE
      : already if it's doing other kind of tracking on processes runtime, in
      : addition of userfaultfd.  In other words both syscalls works only for
      : root, when the two sysctl are opt-in set to 1.
      
      [dgilbert@redhat.com: changelog additions]
      [akpm@linux-foundation.org: documentation tweak, per Mike]
      Link: http://lkml.kernel.org/r/20190319030722.12441-2-peterx@redhat.com
      
      
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Suggested-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Suggested-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cefdca0a
    • Cyrill Gorcunov's avatar
      kernel/sys.c: prctl: fix false positive in validate_prctl_map() · a9e73998
      Cyrill Gorcunov authored
      While validating new map we require the @start_data to be strictly less
      than @end_data, which is fine for regular applications (this is why this
      nit didn't trigger for that long).  These members are set from executable
      loaders such as elf handers, still it is pretty valid to have a loadable
      data section with zero size in file, in such case the start_data is equal
      to end_data once kernel loader finishes.
      
      As a result when we're trying to restore such programs the procedure fails
      and the kernel returns -EINVAL.  From the image dump of a program:
      
       | "mm_start_code": "0x400000",
       | "mm_end_code": "0x8f5fb4",
       | "mm_start_data": "0xf1bfb0",
       | "mm_end_data": "0xf1bfb0",
      
      Thus we need to change validate_prctl_map from strictly less to less or
      equal operator use.
      
      Link: http://lkml.kernel.org/r/20190408143554.GY1421@uranus.lan
      
      
      Fixes: f606b77f ("prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation")
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@gmail.com>
      Cc: Andrey Vagin <avagin@gmail.com>
      Cc: Dmitry Safonov <0x7f454c46@gmail.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9e73998
    • Wenlin Kang's avatar
      kdb: Fix bound check compiler warning · ca976bfb
      Wenlin Kang authored and Daniel Thompson's avatar Daniel Thompson committed
      
      The strncpy() function may leave the destination string buffer
      unterminated, better use strscpy() instead.
      
      This fixes the following warning with gcc 8.2:
      
      kernel/debug/kdb/kdb_io.c: In function 'kdb_getstr':
      kernel/debug/kdb/kdb_io.c:449:3: warning: 'strncpy' specified bound 256 equals destination size [-Wstringop-truncation]
         strncpy(kdb_prompt_str, prompt, CMD_BUFLEN);
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      Signed-off-by: default avatarWenlin Kang <wenlin.kang@windriver.com>
      Signed-off-by: default avatarDaniel Thompson <daniel.thompson@linaro.org>
      ca976bfb
  4. May 13, 2019
    • Krzesimir Nowak's avatar
      bpf: fix undefined behavior in narrow load handling · e2f7fc0a
      Krzesimir Nowak authored
      
      Commit 31fd8581 ("bpf: permits narrower load from bpf program
      context fields") made the verifier add AND instructions to clear the
      unwanted bits with a mask when doing a narrow load. The mask is
      computed with
      
        (1 << size * 8) - 1
      
      where "size" is the size of the narrow load. When doing a 4 byte load
      of a an 8 byte field the verifier shifts the literal 1 by 32 places to
      the left. This results in an overflow of a signed integer, which is an
      undefined behavior. Typically, the computed mask was zero, so the
      result of the narrow load ended up being zero too.
      
      Cast the literal to long long to avoid overflows. Note that narrow
      load of the 4 byte fields does not have the undefined behavior,
      because the load size can only be either 1 or 2 bytes, so shifting 1
      by 8 or 16 places will not overflow it. And reading 4 bytes would not
      be a narrow load of a 4 bytes field.
      
      Fixes: 31fd8581 ("bpf: permits narrower load from bpf program context fields")
      Reviewed-by: default avatarAlban Crequy <alban@kinvolk.io>
      Reviewed-by: default avatarIago López Galeiras <iago@kinvolk.io>
      Signed-off-by: default avatarKrzesimir Nowak <krzesimir@kinvolk.io>
      Cc: Yonghong Song <yhs@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      e2f7fc0a
  5. May 12, 2019
  6. May 11, 2019
    • Daniel Borkmann's avatar
      bpf: fix out of bounds backwards jmps due to dead code removal · af959b18
      Daniel Borkmann authored
      
      systemtap folks reported the following splat recently:
      
        [ 7790.862212] WARNING: CPU: 3 PID: 26759 at arch/x86/kernel/kprobes/core.c:1022 kprobe_fault_handler+0xec/0xf0
        [...]
        [ 7790.864113] CPU: 3 PID: 26759 Comm: sshd Not tainted 5.1.0-0.rc7.git1.1.fc31.x86_64 #1
        [ 7790.864198] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS[...]
        [ 7790.864314] RIP: 0010:kprobe_fault_handler+0xec/0xf0
        [ 7790.864375] Code: 48 8b 50 [...]
        [ 7790.864714] RSP: 0018:ffffc06800bdbb48 EFLAGS: 00010082
        [ 7790.864812] RAX: ffff9e2b75a16320 RBX: 0000000000000000 RCX: 0000000000000000
        [ 7790.865306] RDX: ffffffffffffffff RSI: 000000000000000e RDI: ffffc06800bdbbf8
        [ 7790.865514] RBP: ffffc06800bdbbf8 R08: 0000000000000000 R09: 0000000000000000
        [ 7790.865960] R10: 0000000000000000 R11: 0000000000000000 R12: ffffc06800bdbbf8
        [ 7790.866037] R13: ffff9e2ab56a0418 R14: ffff9e2b6d0bb400 R15: ffff9e2b6d268000
        [ 7790.866114] FS:  00007fde49937d80(0000) GS:ffff9e2b75a00000(0000) knlGS:0000000000000000
        [ 7790.866193] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [ 7790.866318] CR2: 0000000000000000 CR3: 000000012f312000 CR4: 00000000000006e0
        [ 7790.866419] Call Trace:
        [ 7790.866677]  do_user_addr_fault+0x64/0x480
        [ 7790.867513]  do_page_fault+0x33/0x210
        [ 7790.868002]  async_page_fault+0x1e/0x30
        [ 7790.868071] RIP: 0010:          (null)
        [ 7790.868144] Code: Bad RIP value.
        [ 7790.868229] RSP: 0018:ffffc06800bdbca8 EFLAGS: 00010282
        [ 7790.868362] RAX: ffff9e2b598b60f8 RBX: ffffc06800bdbe48 RCX: 0000000000000004
        [ 7790.868629] RDX: 0000000000000004 RSI: ffffc06800bdbc6c RDI: ffff9e2b598b60f0
        [ 7790.868834] RBP: ffffc06800bdbcf8 R08: 0000000000000000 R09: 0000000000000004
        [ 7790.870432] R10: 00000000ff6f7a03 R11: 0000000000000000 R12: 0000000000000001
        [ 7790.871859] R13: ffffc06800bdbcb8 R14: 0000000000000000 R15: ffff9e2acd0a5310
        [ 7790.873455]  ? vfs_read+0x5/0x170
        [ 7790.874639]  ? vfs_read+0x1/0x170
        [ 7790.875834]  ? trace_call_bpf+0xf6/0x260
        [ 7790.877044]  ? vfs_read+0x1/0x170
        [ 7790.878208]  ? vfs_read+0x5/0x170
        [ 7790.879345]  ? kprobe_perf_func+0x233/0x260
        [ 7790.880503]  ? vfs_read+0x1/0x170
        [ 7790.881632]  ? vfs_read+0x5/0x170
        [ 7790.882751]  ? kprobe_ftrace_handler+0x92/0xf0
        [ 7790.883926]  ? __vfs_read+0x30/0x30
        [ 7790.885050]  ? ftrace_ops_assist_func+0x94/0x100
        [ 7790.886183]  ? vfs_read+0x1/0x170
        [ 7790.887283]  ? vfs_read+0x5/0x170
        [ 7790.888348]  ? ksys_read+0x5a/0xe0
        [ 7790.889389]  ? do_syscall_64+0x5c/0xa0
        [ 7790.890401]  ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      After some debugging, turns out that the logic in 2cbd95a5
      ("bpf: change parameters of call/branch offset adjustment") has
      a bug that is exposed after 52875a04 ("bpf: verifier: remove
      dead code") in that we miss some of the jump offset adjustments
      after code patching when we remove dead code, more concretely,
      upon backward jump spanning over the area that is being removed.
      
      BPF insns of a case that was hit pre 52875a04:
      
        [...]
        676: (85) call bpf_perf_event_output#-47616
        677: (05) goto pc-636
        678: (62) *(u32 *)(r10 -64) = 0
        679: (bf) r7 = r10
        680: (07) r7 += -64
        681: (05) goto pc-44
        682: (05) goto pc-1
        683: (05) goto pc-1
      
      BPF insns afterwards:
      
        [...]
        618: (85) call bpf_perf_event_output#-47616
        619: (05) goto pc-638
        620: (62) *(u32 *)(r10 -64) = 0
        621: (bf) r7 = r10
        622: (07) r7 += -64
        623: (05) goto pc-44
      
      To illustrate the bug, situation looks as follows:
           ____
        0 |    | <-- foo: [...]
        1 |____|
        2 |____| <-- pos / end_new  ^
        3 |    |                    |
        4 |    |                    |  len
        5 |____|                    |  (remove region)
        6 |    | <-- end_old        v
        7 |    |
        8 |    | <-- curr  (jmp foo)
        9 |____|
      
      The condition curr >= end_new && curr + off + 1 < end_new in the
      branch delta adjustments is never hit because curr + off + 1 <
      end_new is compared as unsigned and therefore curr + off + 1 >
      end_new in unsigned realm as curr + off + 1 becomes negative
      since the insns are memmove()'d before the offset adjustments.
      
      Correct BPF insns after this fix:
      
        [...]
        618: (85) call bpf_perf_event_output#-47216
        619: (05) goto pc-578
        620: (62) *(u32 *)(r10 -64) = 0
        621: (bf) r7 = r10
        622: (07) r7 += -64
        623: (05) goto pc-44
      
      Note that unprivileged case is not affected from this.
      
      Fixes: 52875a04 ("bpf: verifier: remove dead code")
      Fixes: 2cbd95a5 ("bpf: change parameters of call/branch offset adjustment")
      Reported-by: default avatarFrank Ch. Eigler <fche@redhat.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      af959b18
  7. May 10, 2019
Loading