Skip to content
Snippets Groups Projects
  1. May 15, 2019
  2. May 14, 2019
    • Mike Kravetz's avatar
      hugetlbfs: always use address space in inode for resv_map pointer · f27a5136
      Mike Kravetz authored
      Continuing discussion about 58b6e5e8 ("hugetlbfs: fix memory leak for
      resv_map") brought up the issue that inode->i_mapping may not point to the
      address space embedded within the inode at inode eviction time.  The
      hugetlbfs truncate routine handles this by explicitly using inode->i_data.
      However, code cleaning up the resv_map will still use the address space
      pointed to by inode->i_mapping.  Luckily, private_data is NULL for address
      spaces in all such cases today but, there is no guarantee this will
      continue.
      
      Change all hugetlbfs code getting a resv_map pointer to explicitly get it
      from the address space embedded within the inode.  In addition, add more
      comments in the code to indicate why this is being done.
      
      Link: http://lkml.kernel.org/r/20190419204435.16984-1-mike.kravetz@oracle.com
      
      
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: default avatarYufen Yu <yuyufen@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f27a5136
    • Amir Goldstein's avatar
      fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback · c553ea4f
      Amir Goldstein authored
      23d01270 ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE
      writeback") claims that sync_file_range(2) syscall was "created for
      userspace to be able to issue background writeout and so waiting for
      in-flight IO is undesirable there" and changes the writeback (back) to
      WB_SYNC_NONE.
      
      This claim is only partially true.  It is true for users that use the flag
      SYNC_FILE_RANGE_WRITE by itself, as does PostgreSQL, the user that was the
      reason for changing to WB_SYNC_NONE writeback.
      
      However, that claim is not true for users that use that flag combination
      SYNC_FILE_RANGE_{WAIT_BEFORE|WRITE|_WAIT_AFTER}.  Those users explicitly
      requested to wait for in-flight IO as well as to writeback of dirty pages.
      
      Re-brand that flag combination as SYNC_FILE_RANGE_WRITE_AND_WAIT and use
      WB_SYNC_ALL writeback to perform the full range sync request.
      
      Link: http://lkml.kernel.org/r/20190409114922.30095-1-amir73il@gmail.com
      Link: http://lkml.kernel.org/r/20190419072938.31320-1-amir73il@gmail.com
      
      
      Fixes: 23d01270 ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE")
      Signed-off-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Acked-by: default avatarJan Kara <jack@suse.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c553ea4f
    • Jérôme Glisse's avatar
      mm/mmu_notifier: use correct mmu_notifier events for each invalidation · 7269f999
      Jérôme Glisse authored
      This updates each existing invalidation to use the correct mmu notifier
      event that represent what is happening to the CPU page table.  See the
      patch which introduced the events to see the rational behind this.
      
      Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.com
      
      
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7269f999
    • Jérôme Glisse's avatar
      mm/mmu_notifier: contextual information for event triggering invalidation · 6f4f13e8
      Jérôme Glisse authored
      CPU page table update can happens for many reasons, not only as a result
      of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
      a result of kernel activities (memory compression, reclaim, migration,
      ...).
      
      Users of mmu notifier API track changes to the CPU page table and take
      specific action for them.  While current API only provide range of virtual
      address affected by the change, not why the changes is happening.
      
      This patchset do the initial mechanical convertion of all the places that
      calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
      event as well as the vma if it is know (most invalidation happens against
      a given vma).  Passing down the vma allows the users of mmu notifier to
      inspect the new vma page protection.
      
      The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
      should assume that every for the range is going away when that event
      happens.  A latter patch do convert mm call path to use a more appropriate
      events for each call.
      
      This is done as 2 patches so that no call site is forgotten especialy
      as it uses this following coccinelle patch:
      
      %<----------------------------------------------------------------------
      @@
      identifier I1, I2, I3, I4;
      @@
      static inline void mmu_notifier_range_init(struct mmu_notifier_range *I1,
      +enum mmu_notifier_event event,
      +unsigned flags,
      +struct vm_area_struct *vma,
      struct mm_struct *I2, unsigned long I3, unsigned long I4) { ... }
      
      @@
      @@
      -#define mmu_notifier_range_init(range, mm, start, end)
      +#define mmu_notifier_range_init(range, event, flags, vma, mm, start, end)
      
      @@
      expression E1, E3, E4;
      identifier I1;
      @@
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, I1,
      I1->vm_mm, E3, E4)
      ...>
      
      @@
      expression E1, E2, E3, E4;
      identifier FN, VMA;
      @@
      FN(..., struct vm_area_struct *VMA, ...) {
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, VMA,
      E2, E3, E4)
      ...> }
      
      @@
      expression E1, E2, E3, E4;
      identifier FN, VMA;
      @@
      FN(...) {
      struct vm_area_struct *VMA;
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, VMA,
      E2, E3, E4)
      ...> }
      
      @@
      expression E1, E2, E3, E4;
      identifier FN;
      @@
      FN(...) {
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, NULL,
      E2, E3, E4)
      ...> }
      ---------------------------------------------------------------------->%
      
      Applied with:
      spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
      spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
      spatch --sp-file mmu-notifier.spatch --dir mm --in-place
      
      Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com
      
      
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6f4f13e8
    • Mike Kravetz's avatar
      hugetlb: use same fault hash key for shared and private mappings · 1b426bac
      Mike Kravetz authored
      hugetlb uses a fault mutex hash table to prevent page faults of the
      same pages concurrently.  The key for shared and private mappings is
      different.  Shared keys off address_space and file index.  Private keys
      off mm and virtual address.  Consider a private mappings of a populated
      hugetlbfs file.  A fault will map the page from the file and if needed
      do a COW to map a writable page.
      
      Hugetlbfs hole punch uses the fault mutex to prevent mappings of file
      pages.  It uses the address_space file index key.  However, private
      mappings will use a different key and could race with this code to map
      the file page.  This causes problems (BUG) for the page cache remove
      code as it expects the page to be unmapped.  A sample stack is:
      
      page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
      kernel BUG at mm/filemap.c:169!
      ...
      RIP: 0010:unaccount_page_cache_page+0x1b8/0x200
      ...
      Call Trace:
      __delete_from_page_cache+0x39/0x220
      delete_from_page_cache+0x45/0x70
      remove_inode_hugepages+0x13c/0x380
      ? __add_to_page_cache_locked+0x162/0x380
      hugetlbfs_fallocate+0x403/0x540
      ? _cond_resched+0x15/0x30
      ? __inode_security_revalidate+0x5d/0x70
      ? selinux_file_permission+0x100/0x130
      vfs_fallocate+0x13f/0x270
      ksys_fallocate+0x3c/0x80
      __x64_sys_fallocate+0x1a/0x20
      do_syscall_64+0x5b/0x180
      entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      There seems to be another potential COW issue/race with this approach
      of different private and shared keys as noted in commit 8382d914
      ("mm, hugetlb: improve page-fault scalability").
      
      Since every hugetlb mapping (even anon and private) is actually a file
      mapping, just use the address_space index key for all mappings.  This
      results in potentially more hash collisions.  However, this should not
      be the common case.
      
      Link: http://lkml.kernel.org/r/20190328234704.27083-3-mike.kravetz@oracle.com
      Link: http://lkml.kernel.org/r/20190412165235.t4sscoujczfhuiyt@linux-r8p5
      
      
      Fixes: b5cec28d ("hugetlbfs: truncate_hugepages() takes a range of pages")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1b426bac
    • Aneesh Kumar K.V's avatar
      mm: page_mkclean vs MADV_DONTNEED race · 024eee0e
      Aneesh Kumar K.V authored
      MADV_DONTNEED is handled with mmap_sem taken in read mode.  We call
      page_mkclean without holding mmap_sem.
      
      MADV_DONTNEED implies that pages in the region are unmapped and subsequent
      access to the pages in that range is handled as a new page fault.  This
      implies that if we don't have parallel access to the region when
      MADV_DONTNEED is run we expect those range to be unallocated.
      
      w.r.t page_mkclean() we need to make sure that we don't break the
      MADV_DONTNEED semantics.  MADV_DONTNEED check for pmd_none without holding
      pmd_lock.  This implies we skip the pmd if we temporarily mark pmd none.
      Avoid doing that while marking the page clean.
      
      Keep the sequence same for dax too even though we don't support
      MADV_DONTNEED for dax mapping
      
      The bug was noticed by code review and I didn't observe any failures w.r.t
      test run.  This is similar to
      
      commit 58ceeb6b
      Author: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Date:   Thu Apr 13 14:56:26 2017 -0700
      
          thp: fix MADV_DONTNEED vs. MADV_FREE race
      
      commit ced10803
      Author: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Date:   Thu Apr 13 14:56:20 2017 -0700
      
          thp: fix MADV_DONTNEED vs. numa balancing race
      
      Link: http://lkml.kernel.org/r/20190321040610.14226-1-aneesh.kumar@linux.ibm.com
      
      
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc:"Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      024eee0e
    • Ira Weiny's avatar
      mm/gup: change GUP fast to use flags rather than a write 'bool' · 73b0140b
      Ira Weiny authored
      To facilitate additional options to get_user_pages_fast() change the
      singular write parameter to be gup_flags.
      
      This patch does not change any functionality.  New functionality will
      follow in subsequent patches.
      
      Some of the get_user_pages_fast() call sites were unchanged because they
      already passed FOLL_WRITE or 0 for the write parameter.
      
      NOTE: It was suggested to change the ordering of the get_user_pages_fast()
      arguments to ensure that callers were converted.  This breaks the current
      GUP call site convention of having the returned pages be the final
      parameter.  So the suggestion was rejected.
      
      Link: http://lkml.kernel.org/r/20190328084422.29911-4-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190317183438.2057-4-ira.weiny@intel.com
      
      
      Signed-off-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarMike Marshall <hubcap@omnibond.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      73b0140b
    • Ira Weiny's avatar
      mm/gup: replace get_user_pages_longterm() with FOLL_LONGTERM · 932f4a63
      Ira Weiny authored
      Pach series "Add FOLL_LONGTERM to GUP fast and use it".
      
      HFI1, qib, and mthca, use get_user_pages_fast() due to its performance
      advantages.  These pages can be held for a significant time.  But
      get_user_pages_fast() does not protect against mapping FS DAX pages.
      
      Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which
      retains the performance while also adding the FS DAX checks.  XDP has also
      shown interest in using this functionality.[1]
      
      In addition we change get_user_pages() to use the new FOLL_LONGTERM flag
      and remove the specialized get_user_pages_longterm call.
      
      [1] https://lkml.org/lkml/2019/3/19/939
      
      "longterm" is a relative thing and at this point is probably a misnomer.
      This is really flagging a pin which is going to be given to hardware and
      can't move.  I've thought of a couple of alternative names but I think we
      have to settle on if we are going to use FL_LAYOUT or something else to
      solve the "longterm" problem.  Then I think we can change the flag to a
      better name.
      
      Secondly, it depends on how often you are registering memory.  I have
      spoken with some RDMA users who consider MR in the performance path...
      For the overall application performance.  I don't have the numbers as the
      tests for HFI1 were done a long time ago.  But there was a significant
      advantage.  Some of which is probably due to the fact that you don't have
      to hold mmap_sem.
      
      Finally, architecturally I think it would be good for everyone to use
      *_fast.  There are patches submitted to the RDMA list which would allow
      the use of *_fast (they reworking the use of mmap_sem) and as soon as they
      are accepted I'll submit a patch to convert the RDMA core as well.  Also
      to this point others are looking to use *_fast.
      
      As an aside, Jasons pointed out in my previous submission that *_fast and
      *_unlocked look very much the same.  I agree and I think further cleanup
      will be coming.  But I'm focused on getting the final solution for DAX at
      the moment.
      
      This patch (of 7):
      
      This patch starts a series which aims to support FOLL_LONGTERM in
      get_user_pages_fast().  Some callers who would like to do a longterm (user
      controlled pin) of pages with the fast variant of GUP for performance
      purposes.
      
      Rather than have a separate get_user_pages_longterm() call, introduce
      FOLL_LONGTERM and change the longterm callers to use it.
      
      This patch does not change any functionality.  In the short term
      "longterm" or user controlled pins are unsafe for Filesystems and FS DAX
      in particular has been blocked.  However, callers of get_user_pages_fast()
      were not "protected".
      
      FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it
      requires vmas to determine if DAX is in use.
      
      NOTE: In merging with the CMA changes we opt to change the
      get_user_pages() call in check_and_migrate_cma_pages() to a call of
      __get_user_pages_locked() on the newly migrated pages.  This makes the
      code read better in that we are calling __get_user_pages_locked() on the
      pages before and after a potential migration.
      
      As a side affect some of the interfaces are cleaned up but this is not the
      primary purpose of the series.
      
      In review[1] it was asked:
      
      <quote>
      > This I don't get - if you do lock down long term mappings performance
      > of the actual get_user_pages call shouldn't matter to start with.
      >
      > What do I miss?
      
      A couple of points.
      
      First "longterm" is a relative thing and at this point is probably a
      misnomer.  This is really flagging a pin which is going to be given to
      hardware and can't move.  I've thought of a couple of alternative names
      but I think we have to settle on if we are going to use FL_LAYOUT or
      something else to solve the "longterm" problem.  Then I think we can
      change the flag to a better name.
      
      Second, It depends on how often you are registering memory.  I have spoken
      with some RDMA users who consider MR in the performance path...  For the
      overall application performance.  I don't have the numbers as the tests
      for HFI1 were done a long time ago.  But there was a significant
      advantage.  Some of which is probably due to the fact that you don't have
      to hold mmap_sem.
      
      Finally, architecturally I think it would be good for everyone to use
      *_fast.  There are patches submitted to the RDMA list which would allow
      the use of *_fast (they reworking the use of mmap_sem) and as soon as they
      are accepted I'll submit a patch to convert the RDMA core as well.  Also
      to this point others are looking to use *_fast.
      
      As an asside, Jasons pointed out in my previous submission that *_fast and
      *_unlocked look very much the same.  I agree and I think further cleanup
      will be coming.  But I'm focused on getting the final solution for DAX at
      the moment.
      
      </quote>
      
      [1] https://lore.kernel.org/lkml/20190220180255.GA12020@iweiny-DESK2.sc.intel.com/T/#md6abad2569f3bf6c1f03686c8097ab6563e94965
      
      [ira.weiny@intel.com: v3]
        Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190317183438.2057-2-ira.weiny@intel.com
      
      
      Signed-off-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Mike Marshall <hubcap@omnibond.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      932f4a63
    • Peter Xu's avatar
      userfaultfd/sysctl: add vm.unprivileged_userfaultfd · cefdca0a
      Peter Xu authored
      Userfaultfd can be misued to make it easier to exploit existing
      use-after-free (and similar) bugs that might otherwise only make a
      short window or race condition available.  By using userfaultfd to
      stall a kernel thread, a malicious program can keep some state that it
      wrote, stable for an extended period, which it can then access using an
      existing exploit.  While it doesn't cause the exploit itself, and while
      it's not the only thing that can stall a kernel thread when accessing a
      memory location, it's one of the few that never needs privilege.
      
      We can add a flag, allowing userfaultfd to be restricted, so that in
      general it won't be useable by arbitrary user programs, but in
      environments that require userfaultfd it can be turned back on.
      
      Add a global sysctl knob "vm.unprivileged_userfaultfd" to control
      whether userfaultfd is allowed by unprivileged users.  When this is
      set to zero, only privileged users (root user, or users with the
      CAP_SYS_PTRACE capability) will be able to use the userfaultfd
      syscalls.
      
      Andrea said:
      
      : The only difference between the bpf sysctl and the userfaultfd sysctl
      : this way is that the bpf sysctl adds the CAP_SYS_ADMIN capability
      : requirement, while userfaultfd adds the CAP_SYS_PTRACE requirement,
      : because the userfaultfd monitor is more likely to need CAP_SYS_PTRACE
      : already if it's doing other kind of tracking on processes runtime, in
      : addition of userfaultfd.  In other words both syscalls works only for
      : root, when the two sysctl are opt-in set to 1.
      
      [dgilbert@redhat.com: changelog additions]
      [akpm@linux-foundation.org: documentation tweak, per Mike]
      Link: http://lkml.kernel.org/r/20190319030722.12441-2-peterx@redhat.com
      
      
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Suggested-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Suggested-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cefdca0a
    • Shuning Zhang's avatar
      ocfs2: fix ocfs2 read inode data panic in ocfs2_iget · e091eab0
      Shuning Zhang authored
      In some cases, ocfs2_iget() reads the data of inode, which has been
      deleted for some reason.  That will make the system panic.  So We should
      judge whether this inode has been deleted, and tell the caller that the
      inode is a bad inode.
      
      For example, the ocfs2 is used as the backed of nfs, and the client is
      nfsv3.  This issue can be reproduced by the following steps.
      
      on the nfs server side,
      ..../patha/pathb
      
      Step 1: The process A was scheduled before calling the function fh_verify.
      
      Step 2: The process B is removing the 'pathb', and just completed the call
      to function dput.  Then the dentry of 'pathb' has been deleted from the
      dcache, and all ancestors have been deleted also.  The relationship of
      dentry and inode was deleted through the function hlist_del_init.  The
      following is the call stack.
      dentry_iput->hlist_del_init(&dentry->d_u.d_alias)
      
      At this time, the inode is still in the dcache.
      
      Step 3: The process A call the function ocfs2_get_dentry, which get the
      inode from dcache.  Then the refcount of inode is 1.  The following is the
      call stack.
      nfsd3_proc_getacl->fh_verify->exportfs_decode_fh->fh_to_dentry(ocfs2_get_dentry)
      
      Step 4: Dirty pages are flushed by bdi threads.  So the inode of 'patha'
      is evicted, and this directory was deleted.  But the inode of 'pathb'
      can't be evicted, because the refcount of the inode was 1.
      
      Step 5: The process A keep running, and call the function
      reconnect_path(in exportfs_decode_fh), which call function
      ocfs2_get_parent of ocfs2.  Get the block number of parent
      directory(patha) by the name of ...  Then read the data from disk by the
      block number.  But this inode has been deleted, so the system panic.
      
      Process A                                             Process B
      1. in nfsd3_proc_getacl                   |
      2.                                        |        dput
      3. fh_to_dentry(ocfs2_get_dentry)         |
      4. bdi flush dirty cache                  |
      5. ocfs2_iget                             |
      
      [283465.542049] OCFS2: ERROR (device sdp): ocfs2_validate_inode_block:
      Invalid dinode #580640: OCFS2_VALID_FL not set
      
      [283465.545490] Kernel panic - not syncing: OCFS2: (device sdp): panic forced
      after error
      
      [283465.546889] CPU: 5 PID: 12416 Comm: nfsd Tainted: G        W
      4.1.12-124.18.6.el6uek.bug28762940v3.x86_64 #2
      [283465.548382] Hardware name: VMware, Inc. VMware Virtual Platform/440BX
      Desktop Reference Platform, BIOS 6.00 09/21/2015
      [283465.549657]  0000000000000000 ffff8800a56fb7b8 ffffffff816e839c
      ffffffffa0514758
      [283465.550392]  000000000008dc20 ffff8800a56fb838 ffffffff816e62d3
      0000000000000008
      [283465.551056]  ffff880000000010 ffff8800a56fb848 ffff8800a56fb7e8
      ffff88005df9f000
      [283465.551710] Call Trace:
      [283465.552516]  [<ffffffff816e839c>] dump_stack+0x63/0x81
      [283465.553291]  [<ffffffff816e62d3>] panic+0xcb/0x21b
      [283465.554037]  [<ffffffffa04e66b0>] ocfs2_handle_error+0xf0/0xf0 [ocfs2]
      [283465.554882]  [<ffffffffa04e7737>] __ocfs2_error+0x67/0x70 [ocfs2]
      [283465.555768]  [<ffffffffa049c0f9>] ocfs2_validate_inode_block+0x229/0x230
      [ocfs2]
      [283465.556683]  [<ffffffffa047bcbc>] ocfs2_read_blocks+0x46c/0x7b0 [ocfs2]
      [283465.557408]  [<ffffffffa049bed0>] ? ocfs2_inode_cache_io_unlock+0x20/0x20
      [ocfs2]
      [283465.557973]  [<ffffffffa049f0eb>] ocfs2_read_inode_block_full+0x3b/0x60
      [ocfs2]
      [283465.558525]  [<ffffffffa049f5ba>] ocfs2_iget+0x4aa/0x880 [ocfs2]
      [283465.559082]  [<ffffffffa049146e>] ocfs2_get_parent+0x9e/0x220 [ocfs2]
      [283465.559622]  [<ffffffff81297c05>] reconnect_path+0xb5/0x300
      [283465.560156]  [<ffffffff81297f46>] exportfs_decode_fh+0xf6/0x2b0
      [283465.560708]  [<ffffffffa062faf0>] ? nfsd_proc_getattr+0xa0/0xa0 [nfsd]
      [283465.561262]  [<ffffffff810a8196>] ? prepare_creds+0x26/0x110
      [283465.561932]  [<ffffffffa0630860>] fh_verify+0x350/0x660 [nfsd]
      [283465.562862]  [<ffffffffa0637804>] ? nfsd_cache_lookup+0x44/0x630 [nfsd]
      [283465.563697]  [<ffffffffa063a8b9>] nfsd3_proc_getattr+0x69/0xf0 [nfsd]
      [283465.564510]  [<ffffffffa062cf60>] nfsd_dispatch+0xe0/0x290 [nfsd]
      [283465.565358]  [<ffffffffa05eb892>] ? svc_tcp_adjust_wspace+0x12/0x30
      [sunrpc]
      [283465.566272]  [<ffffffffa05ea652>] svc_process_common+0x412/0x6a0 [sunrpc]
      [283465.567155]  [<ffffffffa05eaa03>] svc_process+0x123/0x210 [sunrpc]
      [283465.568020]  [<ffffffffa062c90f>] nfsd+0xff/0x170 [nfsd]
      [283465.568962]  [<ffffffffa062c810>] ? nfsd_destroy+0x80/0x80 [nfsd]
      [283465.570112]  [<ffffffff810a622b>] kthread+0xcb/0xf0
      [283465.571099]  [<ffffffff810a6160>] ? kthread_create_on_node+0x180/0x180
      [283465.572114]  [<ffffffff816f11b8>] ret_from_fork+0x58/0x90
      [283465.573156]  [<ffffffff810a6160>] ? kthread_create_on_node+0x180/0x180
      
      Link: http://lkml.kernel.org/r/1554185919-3010-1-git-send-email-sunny.s.zhang@oracle.com
      
      
      Signed-off-by: default avatarShuning Zhang <sunny.s.zhang@oracle.com>
      Reviewed-by: default avatarJoseph Qi <jiangqi903@gmail.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: piaojun <piaojun@huawei.com>
      Cc: "Gang He" <ghe@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e091eab0
    • Phillip Potter's avatar
      ocfs2: use common file type conversion · 9dc2108d
      Phillip Potter authored
      Deduplicate the ocfs2 file type conversion implementation and remove
      OCFS2_FT_* definitions - file systems that use the same file types as
      defined by POSIX do not need to define their own versions and can use the
      common helper functions decared in fs_types.h and implemented in
      fs_types.c
      
      Common implementation can be found via bbe7449e ("fs: common
      implementation of file type").
      
      Link: http://lkml.kernel.org/r/20190326213919.GA20878@pathfinder
      
      
      Signed-off-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarPhillip Potter <phil@philpotter.co.uk>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9dc2108d
    • Dan Williams's avatar
      mm/huge_memory: fix vmf_insert_pfn_{pmd, pud}() crash, handle unaligned addresses · fce86ff5
      Dan Williams authored
      Starting with c6f3c5ee ("mm/huge_memory.c: fix modifying of page
      protection by insert_pfn_pmd()") vmf_insert_pfn_pmd() internally calls
      pmdp_set_access_flags().  That helper enforces a pmd aligned @address
      argument via VM_BUG_ON() assertion.
      
      Update the implementation to take a 'struct vm_fault' argument directly
      and apply the address alignment fixup internally to fix crash signatures
      like:
      
          kernel BUG at arch/x86/mm/pgtable.c:515!
          invalid opcode: 0000 [#1] SMP NOPTI
          CPU: 51 PID: 43713 Comm: java Tainted: G           OE     4.19.35 #1
          [..]
          RIP: 0010:pmdp_set_access_flags+0x48/0x50
          [..]
          Call Trace:
           vmf_insert_pfn_pmd+0x198/0x350
           dax_iomap_fault+0xe82/0x1190
           ext4_dax_huge_fault+0x103/0x1f0
           ? __switch_to_asm+0x40/0x70
           __handle_mm_fault+0x3f6/0x1370
           ? __switch_to_asm+0x34/0x70
           ? __switch_to_asm+0x40/0x70
           handle_mm_fault+0xda/0x200
           __do_page_fault+0x249/0x4f0
           do_page_fault+0x32/0x110
           ? page_fault+0x8/0x30
           page_fault+0x1e/0x30
      
      Link: http://lkml.kernel.org/r/155741946350.372037.11148198430068238140.stgit@dwillia2-desk3.amr.corp.intel.com
      
      
      Fixes: c6f3c5ee ("mm/huge_memory.c: fix modifying of page protection by insert_pfn_pmd()")
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reported-by: default avatarPiotr Balcer <piotr.balcer@intel.com>
      Tested-by: default avatarYan Ma <yan.ma@intel.com>
      Tested-by: default avatarPankaj Gupta <pagupta@redhat.com>
      Reviewed-by: default avatarMatthew Wilcox <willy@infradead.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Chandan Rajendra <chandan@linux.ibm.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fce86ff5
  3. May 13, 2019
  4. May 09, 2019
Loading