Skip to content
Snippets Groups Projects
  1. May 14, 2019
    • Ira Weiny's avatar
      mm/gup: change GUP fast to use flags rather than a write 'bool' · 73b0140b
      Ira Weiny authored
      To facilitate additional options to get_user_pages_fast() change the
      singular write parameter to be gup_flags.
      
      This patch does not change any functionality.  New functionality will
      follow in subsequent patches.
      
      Some of the get_user_pages_fast() call sites were unchanged because they
      already passed FOLL_WRITE or 0 for the write parameter.
      
      NOTE: It was suggested to change the ordering of the get_user_pages_fast()
      arguments to ensure that callers were converted.  This breaks the current
      GUP call site convention of having the returned pages be the final
      parameter.  So the suggestion was rejected.
      
      Link: http://lkml.kernel.org/r/20190328084422.29911-4-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190317183438.2057-4-ira.weiny@intel.com
      
      
      Signed-off-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarMike Marshall <hubcap@omnibond.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      73b0140b
    • Ira Weiny's avatar
      mm/gup: replace get_user_pages_longterm() with FOLL_LONGTERM · 932f4a63
      Ira Weiny authored
      Pach series "Add FOLL_LONGTERM to GUP fast and use it".
      
      HFI1, qib, and mthca, use get_user_pages_fast() due to its performance
      advantages.  These pages can be held for a significant time.  But
      get_user_pages_fast() does not protect against mapping FS DAX pages.
      
      Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which
      retains the performance while also adding the FS DAX checks.  XDP has also
      shown interest in using this functionality.[1]
      
      In addition we change get_user_pages() to use the new FOLL_LONGTERM flag
      and remove the specialized get_user_pages_longterm call.
      
      [1] https://lkml.org/lkml/2019/3/19/939
      
      "longterm" is a relative thing and at this point is probably a misnomer.
      This is really flagging a pin which is going to be given to hardware and
      can't move.  I've thought of a couple of alternative names but I think we
      have to settle on if we are going to use FL_LAYOUT or something else to
      solve the "longterm" problem.  Then I think we can change the flag to a
      better name.
      
      Secondly, it depends on how often you are registering memory.  I have
      spoken with some RDMA users who consider MR in the performance path...
      For the overall application performance.  I don't have the numbers as the
      tests for HFI1 were done a long time ago.  But there was a significant
      advantage.  Some of which is probably due to the fact that you don't have
      to hold mmap_sem.
      
      Finally, architecturally I think it would be good for everyone to use
      *_fast.  There are patches submitted to the RDMA list which would allow
      the use of *_fast (they reworking the use of mmap_sem) and as soon as they
      are accepted I'll submit a patch to convert the RDMA core as well.  Also
      to this point others are looking to use *_fast.
      
      As an aside, Jasons pointed out in my previous submission that *_fast and
      *_unlocked look very much the same.  I agree and I think further cleanup
      will be coming.  But I'm focused on getting the final solution for DAX at
      the moment.
      
      This patch (of 7):
      
      This patch starts a series which aims to support FOLL_LONGTERM in
      get_user_pages_fast().  Some callers who would like to do a longterm (user
      controlled pin) of pages with the fast variant of GUP for performance
      purposes.
      
      Rather than have a separate get_user_pages_longterm() call, introduce
      FOLL_LONGTERM and change the longterm callers to use it.
      
      This patch does not change any functionality.  In the short term
      "longterm" or user controlled pins are unsafe for Filesystems and FS DAX
      in particular has been blocked.  However, callers of get_user_pages_fast()
      were not "protected".
      
      FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it
      requires vmas to determine if DAX is in use.
      
      NOTE: In merging with the CMA changes we opt to change the
      get_user_pages() call in check_and_migrate_cma_pages() to a call of
      __get_user_pages_locked() on the newly migrated pages.  This makes the
      code read better in that we are calling __get_user_pages_locked() on the
      pages before and after a potential migration.
      
      As a side affect some of the interfaces are cleaned up but this is not the
      primary purpose of the series.
      
      In review[1] it was asked:
      
      <quote>
      > This I don't get - if you do lock down long term mappings performance
      > of the actual get_user_pages call shouldn't matter to start with.
      >
      > What do I miss?
      
      A couple of points.
      
      First "longterm" is a relative thing and at this point is probably a
      misnomer.  This is really flagging a pin which is going to be given to
      hardware and can't move.  I've thought of a couple of alternative names
      but I think we have to settle on if we are going to use FL_LAYOUT or
      something else to solve the "longterm" problem.  Then I think we can
      change the flag to a better name.
      
      Second, It depends on how often you are registering memory.  I have spoken
      with some RDMA users who consider MR in the performance path...  For the
      overall application performance.  I don't have the numbers as the tests
      for HFI1 were done a long time ago.  But there was a significant
      advantage.  Some of which is probably due to the fact that you don't have
      to hold mmap_sem.
      
      Finally, architecturally I think it would be good for everyone to use
      *_fast.  There are patches submitted to the RDMA list which would allow
      the use of *_fast (they reworking the use of mmap_sem) and as soon as they
      are accepted I'll submit a patch to convert the RDMA core as well.  Also
      to this point others are looking to use *_fast.
      
      As an asside, Jasons pointed out in my previous submission that *_fast and
      *_unlocked look very much the same.  I agree and I think further cleanup
      will be coming.  But I'm focused on getting the final solution for DAX at
      the moment.
      
      </quote>
      
      [1] https://lore.kernel.org/lkml/20190220180255.GA12020@iweiny-DESK2.sc.intel.com/T/#md6abad2569f3bf6c1f03686c8097ab6563e94965
      
      [ira.weiny@intel.com: v3]
        Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190317183438.2057-2-ira.weiny@intel.com
      
      
      Signed-off-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Mike Marshall <hubcap@omnibond.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      932f4a63
    • Peter Xu's avatar
      userfaultfd/sysctl: add vm.unprivileged_userfaultfd · cefdca0a
      Peter Xu authored
      Userfaultfd can be misued to make it easier to exploit existing
      use-after-free (and similar) bugs that might otherwise only make a
      short window or race condition available.  By using userfaultfd to
      stall a kernel thread, a malicious program can keep some state that it
      wrote, stable for an extended period, which it can then access using an
      existing exploit.  While it doesn't cause the exploit itself, and while
      it's not the only thing that can stall a kernel thread when accessing a
      memory location, it's one of the few that never needs privilege.
      
      We can add a flag, allowing userfaultfd to be restricted, so that in
      general it won't be useable by arbitrary user programs, but in
      environments that require userfaultfd it can be turned back on.
      
      Add a global sysctl knob "vm.unprivileged_userfaultfd" to control
      whether userfaultfd is allowed by unprivileged users.  When this is
      set to zero, only privileged users (root user, or users with the
      CAP_SYS_PTRACE capability) will be able to use the userfaultfd
      syscalls.
      
      Andrea said:
      
      : The only difference between the bpf sysctl and the userfaultfd sysctl
      : this way is that the bpf sysctl adds the CAP_SYS_ADMIN capability
      : requirement, while userfaultfd adds the CAP_SYS_PTRACE requirement,
      : because the userfaultfd monitor is more likely to need CAP_SYS_PTRACE
      : already if it's doing other kind of tracking on processes runtime, in
      : addition of userfaultfd.  In other words both syscalls works only for
      : root, when the two sysctl are opt-in set to 1.
      
      [dgilbert@redhat.com: changelog additions]
      [akpm@linux-foundation.org: documentation tweak, per Mike]
      Link: http://lkml.kernel.org/r/20190319030722.12441-2-peterx@redhat.com
      
      
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Suggested-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Suggested-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cefdca0a
    • Shuning Zhang's avatar
      ocfs2: fix ocfs2 read inode data panic in ocfs2_iget · e091eab0
      Shuning Zhang authored
      In some cases, ocfs2_iget() reads the data of inode, which has been
      deleted for some reason.  That will make the system panic.  So We should
      judge whether this inode has been deleted, and tell the caller that the
      inode is a bad inode.
      
      For example, the ocfs2 is used as the backed of nfs, and the client is
      nfsv3.  This issue can be reproduced by the following steps.
      
      on the nfs server side,
      ..../patha/pathb
      
      Step 1: The process A was scheduled before calling the function fh_verify.
      
      Step 2: The process B is removing the 'pathb', and just completed the call
      to function dput.  Then the dentry of 'pathb' has been deleted from the
      dcache, and all ancestors have been deleted also.  The relationship of
      dentry and inode was deleted through the function hlist_del_init.  The
      following is the call stack.
      dentry_iput->hlist_del_init(&dentry->d_u.d_alias)
      
      At this time, the inode is still in the dcache.
      
      Step 3: The process A call the function ocfs2_get_dentry, which get the
      inode from dcache.  Then the refcount of inode is 1.  The following is the
      call stack.
      nfsd3_proc_getacl->fh_verify->exportfs_decode_fh->fh_to_dentry(ocfs2_get_dentry)
      
      Step 4: Dirty pages are flushed by bdi threads.  So the inode of 'patha'
      is evicted, and this directory was deleted.  But the inode of 'pathb'
      can't be evicted, because the refcount of the inode was 1.
      
      Step 5: The process A keep running, and call the function
      reconnect_path(in exportfs_decode_fh), which call function
      ocfs2_get_parent of ocfs2.  Get the block number of parent
      directory(patha) by the name of ...  Then read the data from disk by the
      block number.  But this inode has been deleted, so the system panic.
      
      Process A                                             Process B
      1. in nfsd3_proc_getacl                   |
      2.                                        |        dput
      3. fh_to_dentry(ocfs2_get_dentry)         |
      4. bdi flush dirty cache                  |
      5. ocfs2_iget                             |
      
      [283465.542049] OCFS2: ERROR (device sdp): ocfs2_validate_inode_block:
      Invalid dinode #580640: OCFS2_VALID_FL not set
      
      [283465.545490] Kernel panic - not syncing: OCFS2: (device sdp): panic forced
      after error
      
      [283465.546889] CPU: 5 PID: 12416 Comm: nfsd Tainted: G        W
      4.1.12-124.18.6.el6uek.bug28762940v3.x86_64 #2
      [283465.548382] Hardware name: VMware, Inc. VMware Virtual Platform/440BX
      Desktop Reference Platform, BIOS 6.00 09/21/2015
      [283465.549657]  0000000000000000 ffff8800a56fb7b8 ffffffff816e839c
      ffffffffa0514758
      [283465.550392]  000000000008dc20 ffff8800a56fb838 ffffffff816e62d3
      0000000000000008
      [283465.551056]  ffff880000000010 ffff8800a56fb848 ffff8800a56fb7e8
      ffff88005df9f000
      [283465.551710] Call Trace:
      [283465.552516]  [<ffffffff816e839c>] dump_stack+0x63/0x81
      [283465.553291]  [<ffffffff816e62d3>] panic+0xcb/0x21b
      [283465.554037]  [<ffffffffa04e66b0>] ocfs2_handle_error+0xf0/0xf0 [ocfs2]
      [283465.554882]  [<ffffffffa04e7737>] __ocfs2_error+0x67/0x70 [ocfs2]
      [283465.555768]  [<ffffffffa049c0f9>] ocfs2_validate_inode_block+0x229/0x230
      [ocfs2]
      [283465.556683]  [<ffffffffa047bcbc>] ocfs2_read_blocks+0x46c/0x7b0 [ocfs2]
      [283465.557408]  [<ffffffffa049bed0>] ? ocfs2_inode_cache_io_unlock+0x20/0x20
      [ocfs2]
      [283465.557973]  [<ffffffffa049f0eb>] ocfs2_read_inode_block_full+0x3b/0x60
      [ocfs2]
      [283465.558525]  [<ffffffffa049f5ba>] ocfs2_iget+0x4aa/0x880 [ocfs2]
      [283465.559082]  [<ffffffffa049146e>] ocfs2_get_parent+0x9e/0x220 [ocfs2]
      [283465.559622]  [<ffffffff81297c05>] reconnect_path+0xb5/0x300
      [283465.560156]  [<ffffffff81297f46>] exportfs_decode_fh+0xf6/0x2b0
      [283465.560708]  [<ffffffffa062faf0>] ? nfsd_proc_getattr+0xa0/0xa0 [nfsd]
      [283465.561262]  [<ffffffff810a8196>] ? prepare_creds+0x26/0x110
      [283465.561932]  [<ffffffffa0630860>] fh_verify+0x350/0x660 [nfsd]
      [283465.562862]  [<ffffffffa0637804>] ? nfsd_cache_lookup+0x44/0x630 [nfsd]
      [283465.563697]  [<ffffffffa063a8b9>] nfsd3_proc_getattr+0x69/0xf0 [nfsd]
      [283465.564510]  [<ffffffffa062cf60>] nfsd_dispatch+0xe0/0x290 [nfsd]
      [283465.565358]  [<ffffffffa05eb892>] ? svc_tcp_adjust_wspace+0x12/0x30
      [sunrpc]
      [283465.566272]  [<ffffffffa05ea652>] svc_process_common+0x412/0x6a0 [sunrpc]
      [283465.567155]  [<ffffffffa05eaa03>] svc_process+0x123/0x210 [sunrpc]
      [283465.568020]  [<ffffffffa062c90f>] nfsd+0xff/0x170 [nfsd]
      [283465.568962]  [<ffffffffa062c810>] ? nfsd_destroy+0x80/0x80 [nfsd]
      [283465.570112]  [<ffffffff810a622b>] kthread+0xcb/0xf0
      [283465.571099]  [<ffffffff810a6160>] ? kthread_create_on_node+0x180/0x180
      [283465.572114]  [<ffffffff816f11b8>] ret_from_fork+0x58/0x90
      [283465.573156]  [<ffffffff810a6160>] ? kthread_create_on_node+0x180/0x180
      
      Link: http://lkml.kernel.org/r/1554185919-3010-1-git-send-email-sunny.s.zhang@oracle.com
      
      
      Signed-off-by: default avatarShuning Zhang <sunny.s.zhang@oracle.com>
      Reviewed-by: default avatarJoseph Qi <jiangqi903@gmail.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: piaojun <piaojun@huawei.com>
      Cc: "Gang He" <ghe@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e091eab0
    • Phillip Potter's avatar
      ocfs2: use common file type conversion · 9dc2108d
      Phillip Potter authored
      Deduplicate the ocfs2 file type conversion implementation and remove
      OCFS2_FT_* definitions - file systems that use the same file types as
      defined by POSIX do not need to define their own versions and can use the
      common helper functions decared in fs_types.h and implemented in
      fs_types.c
      
      Common implementation can be found via bbe7449e ("fs: common
      implementation of file type").
      
      Link: http://lkml.kernel.org/r/20190326213919.GA20878@pathfinder
      
      
      Signed-off-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarPhillip Potter <phil@philpotter.co.uk>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9dc2108d
    • Dan Williams's avatar
      mm/huge_memory: fix vmf_insert_pfn_{pmd, pud}() crash, handle unaligned addresses · fce86ff5
      Dan Williams authored
      Starting with c6f3c5ee ("mm/huge_memory.c: fix modifying of page
      protection by insert_pfn_pmd()") vmf_insert_pfn_pmd() internally calls
      pmdp_set_access_flags().  That helper enforces a pmd aligned @address
      argument via VM_BUG_ON() assertion.
      
      Update the implementation to take a 'struct vm_fault' argument directly
      and apply the address alignment fixup internally to fix crash signatures
      like:
      
          kernel BUG at arch/x86/mm/pgtable.c:515!
          invalid opcode: 0000 [#1] SMP NOPTI
          CPU: 51 PID: 43713 Comm: java Tainted: G           OE     4.19.35 #1
          [..]
          RIP: 0010:pmdp_set_access_flags+0x48/0x50
          [..]
          Call Trace:
           vmf_insert_pfn_pmd+0x198/0x350
           dax_iomap_fault+0xe82/0x1190
           ext4_dax_huge_fault+0x103/0x1f0
           ? __switch_to_asm+0x40/0x70
           __handle_mm_fault+0x3f6/0x1370
           ? __switch_to_asm+0x34/0x70
           ? __switch_to_asm+0x40/0x70
           handle_mm_fault+0xda/0x200
           __do_page_fault+0x249/0x4f0
           do_page_fault+0x32/0x110
           ? page_fault+0x8/0x30
           page_fault+0x1e/0x30
      
      Link: http://lkml.kernel.org/r/155741946350.372037.11148198430068238140.stgit@dwillia2-desk3.amr.corp.intel.com
      
      
      Fixes: c6f3c5ee ("mm/huge_memory.c: fix modifying of page protection by insert_pfn_pmd()")
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reported-by: default avatarPiotr Balcer <piotr.balcer@intel.com>
      Tested-by: default avatarYan Ma <yan.ma@intel.com>
      Tested-by: default avatarPankaj Gupta <pagupta@redhat.com>
      Reviewed-by: default avatarMatthew Wilcox <willy@infradead.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Chandan Rajendra <chandan@linux.ibm.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fce86ff5
  2. May 09, 2019
  3. May 08, 2019
Loading