Skip to content
Snippets Groups Projects
  1. Mar 25, 2019
    • Jeff Layton's avatar
      locks: wake any locks blocked on request before deadlock check · 945ab8f6
      Jeff Layton authored
      Andreas reported that he was seeing the tdbtorture test fail in some
      cases with -EDEADLCK when it wasn't before. Some debugging showed that
      deadlock detection was sometimes discovering the caller's lock request
      itself in a dependency chain.
      
      While we remove the request from the blocked_lock_hash prior to
      reattempting to acquire it, any locks that are blocked on that request
      will still be present in the hash and will still have their fl_blocker
      pointer set to the current request.
      
      This causes posix_locks_deadlock to find a deadlock dependency chain
      when it shouldn't, as a lock request cannot block itself.
      
      We are going to end up waking all of those blocked locks anyway when we
      go to reinsert the request back into the blocked_lock_hash, so just do
      it prior to checking for deadlocks. This ensures that any lock blocked
      on the current request will no longer be part of any blocked request
      chain.
      
      URL: https://bugzilla.kernel.org/show_bug.cgi?id=202975
      
      
      Fixes: 5946c431 ("fs/locks: allow a lock request to block other requests.")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarAndreas Schneider <asn@redhat.com>
      Signed-off-by: default avatarNeil Brown <neilb@suse.com>
      Signed-off-by: default avatarJeff Layton <jlayton@kernel.org>
      945ab8f6
  2. Mar 23, 2019
    • Darrick J. Wong's avatar
      ext4: prohibit fstrim in norecovery mode · 18915b58
      Darrick J. Wong authored
      
      The ext4 fstrim implementation uses the block bitmaps to find free space
      that can be discarded.  If we haven't replayed the journal, the bitmaps
      will be stale and we absolutely *cannot* use stale metadata to zap the
      underlying storage.
      
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      18915b58
    • Trond Myklebust's avatar
      pNFS/flexfiles: Fix layoutstats handling during read failovers · 166bd5b8
      Trond Myklebust authored
      
      During a read failover, we may end up changing the value of
      the pgio_mirror_idx, so make sure that we record the layout
      stats before that update.
      
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      166bd5b8
    • Trond Myklebust's avatar
      NFS: Fix a typo in nfs_init_timeout_values() · 5a698243
      Trond Myklebust authored
      
      Specifying a retrans=0 mount parameter to a NFS/TCP mount, is
      inadvertently causing the NFS client to rewrite any specified
      timeout parameter to the default of 60 seconds.
      
      Fixes: a956beda ("NFS: Allow the mount option retrans=0")
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      5a698243
    • zhangyi (F)'s avatar
      ext4: cleanup bh release code in ext4_ind_remove_space() · 5e86bdda
      zhangyi (F) authored
      
      Currently, we are releasing the indirect buffer where we are done with
      it in ext4_ind_remove_space(), so we can see the brelse() and
      BUFFER_TRACE() everywhere.  It seems fragile and hard to read, and we
      may probably forget to release the buffer some day.  This patch cleans
      up the code by putting of the code which releases the buffers to the
      end of the function.
      
      Signed-off-by: default avatarzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      5e86bdda
    • zhangyi (F)'s avatar
      ext4: brelse all indirect buffer in ext4_ind_remove_space() · 674a2b27
      zhangyi (F) authored
      
      All indirect buffers get by ext4_find_shared() should be released no
      mater the branch should be freed or not. But now, we forget to release
      the lower depth indirect buffers when removing space from the same
      higher depth indirect block. It will lead to buffer leak and futher
      more, it may lead to quota information corruption when using old quota,
      consider the following case.
      
       - Create and mount an empty ext4 filesystem without extent and quota
         features,
       - quotacheck and enable the user & group quota,
       - Create some files and write some data to them, and then punch hole
         to some files of them, it may trigger the buffer leak problem
         mentioned above.
       - Disable quota and run quotacheck again, it will create two new
         aquota files and write the checked quota information to them, which
         probably may reuse the freed indirect block(the buffer and page
         cache was not freed) as data block.
       - Enable quota again, it will invoke
         vfs_load_quota_inode()->invalidate_bdev() to try to clean unused
         buffers and pagecache. Unfortunately, because of the buffer of quota
         data block is still referenced, quota code cannot read the up to date
         quota info from the device and lead to quota information corruption.
      
      This problem can be reproduced by xfstests generic/231 on ext3 file
      system or ext4 file system without extent and quota features.
      
      This patch fix this problem by releasing the missing indirect buffers,
      in ext4_ind_remove_space().
      
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: stable@kernel.org
      674a2b27
    • Kairui Song's avatar
      x86/gart: Exclude GART aperture from kcore · ffc8599a
      Kairui Song authored
      
      On machines where the GART aperture is mapped over physical RAM,
      /proc/kcore contains the GART aperture range. Accessing the GART range via
      /proc/kcore results in a kernel crash.
      
      vmcore used to have the same issue, until it was fixed with commit
      2a3e83c6 ("x86/gart: Exclude GART aperture from vmcore")', leveraging
      existing hook infrastructure in vmcore to let /proc/vmcore return zeroes
      when attempting to read the aperture region, and so it won't read from the
      actual memory.
      
      Apply the same workaround for kcore. First implement the same hook
      infrastructure for kcore, then reuse the hook functions introduced in the
      previous vmcore fix. Just with some minor adjustment, rename some functions
      for more general usage, and simplify the hook infrastructure a bit as there
      is no module usage yet.
      
      Suggested-by: default avatarBaoquan He <bhe@redhat.com>
      Signed-off-by: default avatarKairui Song <kasong@redhat.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJiri Bohac <jbohac@suse.cz>
      Acked-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Dave Young <dyoung@redhat.com>
      Link: https://lkml.kernel.org/r/20190308030508.13548-1-kasong@redhat.com
      
      ffc8599a
    • Steve French's avatar
      cifs: update internal module version number · cf7d624f
      Steve French authored
      
      To 2.19
      
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      cf7d624f
    • Steve French's avatar
      SMB3: Fix SMB3.1.1 guest mounts to Samba · 8c11a607
      Steve French authored
      
      Workaround problem with Samba responses to SMB3.1.1
      null user (guest) mounts.  The server doesn't set the
      expected flag in the session setup response so we have
      to do a similar check to what is done in smb3_validate_negotiate
      where we also check if the user is a null user (but not sec=krb5
      since username might not be passed in on mount for Kerberos case).
      
      Note that the commit below tightened the conditions and forced signing
      for the SMB2-TreeConnect commands as per MS-SMB2.
      However, this should only apply to normal user sessions and not for
      cases where there is no user (even if server forgets to set the flag
      in the response) since we don't have anything useful to sign with.
      This is especially important now that the more secure SMB3.1.1 protocol
      is in the default dialect list.
      
      An earlier patch ("cifs: allow guest mounts to work for smb3.11") fixed
      the guest mounts to Windows.
      
          Fixes: 6188f28b ("Tree connect for SMB3.1.1 must be signed for non-encrypted shares")
      
      Reviewed-by: default avatarRonnie Sahlberg <lsahlber@redhat.com>
      Reviewed-by: default avatarPaulo Alcantara <palcantara@suse.de>
      CC: Stable <stable@vger.kernel.org>
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      8c11a607
    • Paulo Alcantara (SUSE)'s avatar
      cifs: Fix slab-out-of-bounds when tracing SMB tcon · 68ddb496
      Paulo Alcantara (SUSE) authored
      
      This patch fixes the following KASAN report:
      
      [  779.044746] BUG: KASAN: slab-out-of-bounds in string+0xab/0x180
      [  779.044750] Read of size 1 at addr ffff88814f327968 by task trace-cmd/2812
      
      [  779.044756] CPU: 1 PID: 2812 Comm: trace-cmd Not tainted 5.1.0-rc1+ #62
      [  779.044760] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-0-ga698c89-prebuilt.qemu.org 04/01/2014
      [  779.044761] Call Trace:
      [  779.044769]  dump_stack+0x5b/0x90
      [  779.044775]  ? string+0xab/0x180
      [  779.044781]  print_address_description+0x6c/0x23c
      [  779.044787]  ? string+0xab/0x180
      [  779.044792]  ? string+0xab/0x180
      [  779.044797]  kasan_report.cold.3+0x1a/0x32
      [  779.044803]  ? string+0xab/0x180
      [  779.044809]  string+0xab/0x180
      [  779.044816]  ? widen_string+0x160/0x160
      [  779.044822]  ? vsnprintf+0x5bf/0x7f0
      [  779.044829]  vsnprintf+0x4e7/0x7f0
      [  779.044836]  ? pointer+0x4a0/0x4a0
      [  779.044841]  ? seq_buf_vprintf+0x79/0xc0
      [  779.044848]  seq_buf_vprintf+0x62/0xc0
      [  779.044855]  trace_seq_printf+0x113/0x210
      [  779.044861]  ? trace_seq_puts+0x110/0x110
      [  779.044867]  ? trace_raw_output_prep+0xd8/0x110
      [  779.044876]  trace_raw_output_smb3_tcon_class+0x9f/0xc0
      [  779.044882]  print_trace_line+0x377/0x890
      [  779.044888]  ? tracing_buffers_read+0x300/0x300
      [  779.044893]  ? ring_buffer_read+0x58/0x70
      [  779.044899]  s_show+0x6e/0x140
      [  779.044906]  seq_read+0x505/0x6a0
      [  779.044913]  vfs_read+0xaf/0x1b0
      [  779.044919]  ksys_read+0xa1/0x130
      [  779.044925]  ? kernel_write+0xa0/0xa0
      [  779.044931]  ? __do_page_fault+0x3d5/0x620
      [  779.044938]  do_syscall_64+0x63/0x150
      [  779.044944]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [  779.044949] RIP: 0033:0x7f62c2c2db31
      [ 779.044955] Code: fe ff ff 48 8d 3d 17 9e 09 00 48 83 ec 08 e8 96 02
      02 00 66 0f 1f 44 00 00 8b 05 fa fc 2c 00 48 63 ff 85 c0 75 13 31 c0
      0f 05 <48> 3d 00 f0 ff ff 77 57 f3 c3 0f 1f 44 00 00 55 53 48 89 d5 48
      89
      [  779.044958] RSP: 002b:00007ffd6e116678 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
      [  779.044964] RAX: ffffffffffffffda RBX: 0000560a38be9260 RCX: 00007f62c2c2db31
      [  779.044966] RDX: 0000000000002000 RSI: 00007ffd6e116710 RDI: 0000000000000003
      [  779.044966] RDX: 0000000000002000 RSI: 00007ffd6e116710 RDI: 0000000000000003
      [  779.044969] RBP: 00007f62c2ef5420 R08: 0000000000000000 R09: 0000000000000003
      [  779.044972] R10: ffffffffffffffa8 R11: 0000000000000246 R12: 00007ffd6e116710
      [  779.044975] R13: 0000000000002000 R14: 0000000000000d68 R15: 0000000000002000
      
      [  779.044981] Allocated by task 1257:
      [  779.044987]  __kasan_kmalloc.constprop.5+0xc1/0xd0
      [  779.044992]  kmem_cache_alloc+0xad/0x1a0
      [  779.044997]  getname_flags+0x6c/0x2a0
      [  779.045003]  user_path_at_empty+0x1d/0x40
      [  779.045008]  do_faccessat+0x12a/0x330
      [  779.045012]  do_syscall_64+0x63/0x150
      [  779.045017]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      [  779.045019] Freed by task 1257:
      [  779.045023]  __kasan_slab_free+0x12e/0x180
      [  779.045029]  kmem_cache_free+0x85/0x1b0
      [  779.045034]  filename_lookup.part.70+0x176/0x250
      [  779.045039]  do_faccessat+0x12a/0x330
      [  779.045043]  do_syscall_64+0x63/0x150
      [  779.045048]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      [  779.045052] The buggy address belongs to the object at ffff88814f326600
      which belongs to the cache names_cache of size 4096
      [  779.045057] The buggy address is located 872 bytes to the right of
      4096-byte region [ffff88814f326600, ffff88814f327600)
      [  779.045058] The buggy address belongs to the page:
      [  779.045062] page:ffffea00053cc800 count:1 mapcount:0 mapping:ffff88815b191b40 index:0x0 compound_mapcount: 0
      [  779.045067] flags: 0x200000000010200(slab|head)
      [  779.045075] raw: 0200000000010200 dead000000000100 dead000000000200 ffff88815b191b40
      [  779.045081] raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000
      [  779.045083] page dumped because: kasan: bad access detected
      
      [  779.045085] Memory state around the buggy address:
      [  779.045089]  ffff88814f327800: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [  779.045093]  ffff88814f327880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [  779.045097] >ffff88814f327900: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [  779.045099]                                                           ^
      [  779.045103]  ffff88814f327980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [  779.045107]  ffff88814f327a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [  779.045109] ==================================================================
      [  779.045110] Disabling lock debugging due to kernel taint
      
      Correctly assign tree name str for smb3_tcon event.
      
      Signed-off-by: default avatarPaulo Alcantara (SUSE) <paulo@paulo.ac>
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      68ddb496
    • Ronnie Sahlberg's avatar
      cifs: allow guest mounts to work for smb3.11 · e71ab2aa
      Ronnie Sahlberg authored
      
      Fix Guest/Anonymous sessions so that they work with SMB 3.11.
      
      The commit noted below tightened the conditions and forced signing for
      the SMB2-TreeConnect commands as per MS-SMB2.
      However, this should only apply to normal user sessions and not for
      Guest/Anonumous sessions.
      
      Fixes: 6188f28b ("Tree connect for SMB3.1.1 must be signed for non-encrypted shares")
      
      Signed-off-by: default avatarRonnie Sahlberg <lsahlber@redhat.com>
      CC: Stable <stable@vger.kernel.org>
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      e71ab2aa
    • Steve French's avatar
      fix incorrect error code mapping for OBJECTID_NOT_FOUND · 85f9987b
      Steve French authored
      
      It was mapped to EIO which can be confusing when user space
      queries for an object GUID for an object for which the server
      file system doesn't support (or hasn't saved one).
      
      As Amir Goldstein suggested this is similar to ENOATTR
      (equivalently ENODATA in Linux errno definitions) so
      changing NT STATUS code mapping for OBJECTID_NOT_FOUND
      to ENODATA.
      
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      CC: Amir Goldstein <amir73il@gmail.com>
      85f9987b
    • Xiaoli Feng's avatar
      cifs: fix that return -EINVAL when do dedupe operation · b073a080
      Xiaoli Feng authored
      dedupe_file_range operations is combiled into remap_file_range.
      But it's always skipped for dedupe operations in function
      cifs_remap_file_range.
      
      Example to test:
      Before this patch:
        # dd if=/dev/zero of=cifs/file bs=1M count=1
        # xfs_io -c "dedupe cifs/file 4k 64k 4k" cifs/file
        XFS_IOC_FILE_EXTENT_SAME: Invalid argument
      
      After this patch:
        # dd if=/dev/zero of=cifs/file bs=1M count=1
        # xfs_io -c "dedupe cifs/file 4k 64k 4k" cifs/file
        XFS_IOC_FILE_EXTENT_SAME: Operation not supported
      
      Influence for xfstests:
      generic/091
      generic/112
      generic/127
      generic/263
      These tests report this error "do_copy_range:: Invalid
      argument" instead of "FIDEDUPERANGE: Invalid argument".
      Because there are still two bugs cause these test failed.
      https://bugzilla.kernel.org/show_bug.cgi?id=202935
      https://bugzilla.kernel.org/show_bug.cgi?id=202785
      
      
      
      Signed-off-by: default avatarXiaoli Feng <fengxiaoli0714@gmail.com>
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      b073a080
    • Long Li's avatar
      CIFS: Fix an issue with re-sending rdata when transport returning -EAGAIN · 0b0dfd59
      Long Li authored
      
      When sending a rdata, transport may return -EAGAIN. In this case
      we should re-obtain credits because the session may have been
      reconnected.
      
      Change in v2: adjust_credits before re-sending
      
      Signed-off-by: default avatarLong Li <longli@microsoft.com>
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      Reviewed-by: default avatarPavel Shilovsky <pshilov@microsoft.com>
      0b0dfd59
    • Long Li's avatar
      CIFS: Fix an issue with re-sending wdata when transport returning -EAGAIN · d53e292f
      Long Li authored
      
      When sending a wdata, transport may return -EAGAIN. In this case
      we should re-obtain credits because the session may have been
      reconnected.
      
      Change in v2: adjust_credits before re-sending
      
      Signed-off-by: default avatarLong Li <longli@microsoft.com>
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      Reviewed-by: default avatarPavel Shilovsky <pshilov@microsoft.com>
      d53e292f
  3. Mar 20, 2019
    • Filipe Manana's avatar
      Btrfs: fix assertion failure on fsync with NO_HOLES enabled · 0ccc3876
      Filipe Manana authored
      
      Back in commit a89ca6f2 ("Btrfs: fix fsync after truncate when
      no_holes feature is enabled") I added an assertion that is triggered when
      an inline extent is found to assert that the length of the (uncompressed)
      data the extent represents is the same as the i_size of the inode, since
      that is true most of the time I couldn't find or didn't remembered about
      any exception at that time. Later on the assertion was expanded twice to
      deal with a case of a compressed inline extent representing a range that
      matches the sector size followed by an expanding truncate, and another
      case where fallocate can update the i_size of the inode without adding
      or updating existing extents (if the fallocate range falls entirely within
      the first block of the file). These two expansion/fixes of the assertion
      were done by commit 7ed586d0 ("Btrfs: fix assertion on fsync of
      regular file when using no-holes feature") and commit 6399fb5a
      ("Btrfs: fix assertion failure during fsync in no-holes mode").
      These however missed the case where an falloc expands the i_size of an
      inode to exactly the sector size and inline extent exists, for example:
      
       $ mkfs.btrfs -f -O no-holes /dev/sdc
       $ mount /dev/sdc /mnt
      
       $ xfs_io -f -c "pwrite -S 0xab 0 1096" /mnt/foobar
       wrote 1096/1096 bytes at offset 0
       1 KiB, 1 ops; 0.0002 sec (4.448 MiB/sec and 4255.3191 ops/sec)
      
       $ xfs_io -c "falloc 1096 3000" /mnt/foobar
       $ xfs_io -c "fsync" /mnt/foobar
       Segmentation fault
      
       $ dmesg
       [701253.602385] assertion failed: len == i_size || (len == fs_info->sectorsize && btrfs_file_extent_compression(leaf, extent) != BTRFS_COMPRESS_NONE) || (len < i_size && i_size < fs_info->sectorsize), file: fs/btrfs/tree-log.c, line: 4727
       [701253.602962] ------------[ cut here ]------------
       [701253.603224] kernel BUG at fs/btrfs/ctree.h:3533!
       [701253.603503] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
       [701253.603774] CPU: 2 PID: 7192 Comm: xfs_io Tainted: G        W         5.0.0-rc8-btrfs-next-45 #1
       [701253.604054] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
       [701253.604650] RIP: 0010:assfail.constprop.23+0x18/0x1a [btrfs]
       (...)
       [701253.605591] RSP: 0018:ffffbb48c186bc48 EFLAGS: 00010286
       [701253.605914] RAX: 00000000000000de RBX: ffff921d0a7afc08 RCX: 0000000000000000
       [701253.606244] RDX: 0000000000000000 RSI: ffff921d36b16868 RDI: ffff921d36b16868
       [701253.606580] RBP: ffffbb48c186bcf0 R08: 0000000000000000 R09: 0000000000000000
       [701253.606913] R10: 0000000000000003 R11: 0000000000000000 R12: ffff921d05d2de18
       [701253.607247] R13: ffff921d03b54000 R14: 0000000000000448 R15: ffff921d059ecf80
       [701253.607769] FS:  00007f14da906700(0000) GS:ffff921d36b00000(0000) knlGS:0000000000000000
       [701253.608163] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [701253.608516] CR2: 000056087ea9f278 CR3: 00000002268e8001 CR4: 00000000003606e0
       [701253.608880] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       [701253.609250] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       [701253.609608] Call Trace:
       [701253.609994]  btrfs_log_inode+0xdfb/0xe40 [btrfs]
       [701253.610383]  btrfs_log_inode_parent+0x2be/0xa60 [btrfs]
       [701253.610770]  ? do_raw_spin_unlock+0x49/0xc0
       [701253.611150]  btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
       [701253.611537]  btrfs_sync_file+0x3b2/0x440 [btrfs]
       [701253.612010]  ? do_sysinfo+0xb0/0xf0
       [701253.612552]  do_fsync+0x38/0x60
       [701253.612988]  __x64_sys_fsync+0x10/0x20
       [701253.613360]  do_syscall_64+0x60/0x1b0
       [701253.613733]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
       [701253.614103] RIP: 0033:0x7f14da4e66d0
       (...)
       [701253.615250] RSP: 002b:00007fffa670fdb8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
       [701253.615647] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f14da4e66d0
       [701253.616047] RDX: 000056087ea9c260 RSI: 000056087ea9c260 RDI: 0000000000000003
       [701253.616450] RBP: 0000000000000001 R08: 0000000000000020 R09: 0000000000000010
       [701253.616854] R10: 000000000000009b R11: 0000000000000246 R12: 000056087ea9c260
       [701253.617257] R13: 000056087ea9c240 R14: 0000000000000000 R15: 000056087ea9dd10
       (...)
       [701253.619941] ---[ end trace e088d74f132b6da5 ]---
      
      Updating the assertion again to allow for this particular case would result
      in a meaningless assertion, plus there is currently no risk of logging
      content that would result in any corruption after a log replay if the size
      of the data encoded in an inline extent is greater than the inode's i_size
      (which is not currently possibe either with or without compression),
      therefore just remove the assertion.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0ccc3876
  4. Mar 19, 2019
  5. Mar 18, 2019
    • Andrea Righi's avatar
      btrfs: raid56: properly unmap parity page in finish_parity_scrub() · 3897b6f0
      Andrea Righi authored
      Parity page is incorrectly unmapped in finish_parity_scrub(), triggering
      a reference counter bug on i386, i.e.:
      
       [ 157.662401] kernel BUG at mm/highmem.c:349!
       [ 157.666725] invalid opcode: 0000 [#1] SMP PTI
      
      The reason is that kunmap(p_page) was completely left out, so we never
      did an unmap for the p_page and the loop unmapping the rbio page was
      iterating over the wrong number of stripes: unmapping should be done
      with nr_data instead of rbio->real_stripes.
      
      Test case to reproduce the bug:
      
       - create a raid5 btrfs filesystem:
         # mkfs.btrfs -m raid5 -d raid5 /dev/sdb /dev/sdc /dev/sdd /dev/sde
      
       - mount it:
         # mount /dev/sdb /mnt
      
       - run btrfs scrub in a loop:
         # while :; do btrfs scrub start -BR /mnt; done
      
      BugLink: https://bugs.launchpad.net/bugs/1812845
      
      
      Fixes: 5a6ac9ea ("Btrfs, raid56: support parity scrub on raid56")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarAndrea Righi <andrea.righi@canonical.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3897b6f0
    • Catalin Marinas's avatar
      NFS: Fix nfs4_lock_state refcounting in nfs4_alloc_{lock,unlock}data() · 3028efe0
      Catalin Marinas authored
      
      Commit 7b587e1a ("NFS: use locks_copy_lock() to copy locks.")
      changed the lock copying from memcpy() to the dedicated
      locks_copy_lock() function. The latter correctly increments the
      nfs4_lock_state.ls_count via nfs4_fl_copy_lock(), however, this refcount
      has already been incremented in the nfs4_alloc_{lock,unlock}data().
      Kmemleak subsequently reports an unreferenced nfs4_lock_state object as
      below (arm64 platform):
      
      unreferenced object 0xffff8000fce0b000 (size 256):
        comm "systemd-sysuser", pid 1608, jiffies 4294892825 (age 32.348s)
        hex dump (first 32 bytes):
          20 57 4c fb 00 80 ff ff 20 57 4c fb 00 80 ff ff   WL..... WL.....
          00 57 4c fb 00 80 ff ff 01 00 00 00 00 00 00 00  .WL.............
        backtrace:
          [<000000000d15010d>] kmem_cache_alloc+0x178/0x208
          [<00000000d7c1d264>] nfs4_set_lock_state+0x124/0x1f0
          [<000000009c867628>] nfs4_proc_lock+0x90/0x478
          [<000000001686bd74>] do_setlk+0x64/0xe8
          [<00000000e01500d4>] nfs_lock+0xe8/0x1f0
          [<000000004f387d8d>] vfs_lock_file+0x18/0x40
          [<00000000656ab79b>] do_lock_file_wait+0x68/0xf8
          [<00000000f17c4a4b>] fcntl_setlk+0x224/0x280
          [<0000000052a242c6>] do_fcntl+0x418/0x730
          [<000000004f47291a>] __arm64_sys_fcntl+0x84/0xd0
          [<00000000d6856e01>] el0_svc_common+0x80/0xf0
          [<000000009c4bd1df>] el0_svc_handler+0x2c/0x80
          [<00000000b1a0d479>] el0_svc+0x8/0xc
          [<0000000056c62a0f>] 0xffffffffffffffff
      
      This patch removes the original refcount_inc(&lsp->ls_count) that was
      paired with the memcpy() lock copying.
      
      Fixes: 7b587e1a ("NFS: use locks_copy_lock() to copy locks.")
      Cc: <stable@vger.kernel.org> # 5.0.x-
      Cc: NeilBrown <neilb@suse.com>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      3028efe0
    • Jens Axboe's avatar
      block: add BIO_NO_PAGE_REF flag · 399254aa
      Jens Axboe authored
      
      If bio_iov_iter_get_pages() is called on an iov_iter that is flagged
      with NO_REF, then we don't need to add a page reference for the pages
      that we add.
      
      Add BIO_NO_PAGE_REF to track this in the bio, so IO completion knows
      not to drop a reference to these pages.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      399254aa
    • Jens Axboe's avatar
      iov_iter: add ITER_BVEC_FLAG_NO_REF flag · 875f1d07
      Jens Axboe authored
      
      For ITER_BVEC, if we're holding on to kernel pages, the caller
      doesn't need to grab a reference to the bvec pages, and drop that
      same reference on IO completion. This is essentially safe for any
      ITER_BVEC, but some use cases end up reusing pages and uncondtionally
      dropping a page reference on completion. And example of that is
      sendfile(2), that ends up being a splice_in + splice_out on the
      pipe pages.
      
      Add a flag that tells us it's fine to not grab a page reference
      to the bvec pages, since that caller knows not to drop a reference
      when it's done with the pages.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      875f1d07
    • Jens Axboe's avatar
      io_uring: retry bulk slab allocs as single allocs · fd6fab2c
      Jens Axboe authored
      
      I've seen cases where bulk alloc fails, since the bulk alloc API
      is all-or-nothing - either we get the number we ask for, or it
      returns 0 as number of entries.
      
      If we fail a batch bulk alloc, retry a "normal" kmem_cache_alloc()
      and just use that instead of failing with -EAGAIN.
      
      While in there, ensure we use GFP_KERNEL. That was an oversight in
      the original code, when we switched away from GFP_ATOMIC.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fd6fab2c
    • Jan Kara's avatar
      udf: Propagate errors from udf_truncate_extents() · 2b42be5e
      Jan Kara authored
      
      Make udf_truncate_extents() properly propagate errors to its callers and
      let udf_setsize() handle the error properly as well. This lets userspace
      know in case there's some error when truncating blocks.
      
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      2b42be5e
    • Jan Kara's avatar
      udf: Fix crash on IO error during truncate · d3ca4651
      Jan Kara authored
      
      When truncate(2) hits IO error when reading indirect extent block the
      code just bugs with:
      
      kernel BUG at linux-4.15.0/fs/udf/truncate.c:249!
      ...
      
      Fix the problem by bailing out cleanly in case of IO error.
      
      CC: stable@vger.kernel.org
      Reported-by: default avatarjean-luc malet <jeanluc.malet@gmail.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      d3ca4651
  6. Mar 16, 2019
  7. Mar 15, 2019
    • Jens Axboe's avatar
      io_uring: fix poll races · 8c838788
      Jens Axboe authored
      
      This is a straight port of Al's fix for the aio poll implementation,
      since the io_uring version is heavily based on that. The below
      description is almost straight from that patch, just modified to
      fit the io_uring situation.
      
      io_poll() has to cope with several unpleasant problems:
      	* requests that might stay around indefinitely need to
      be made visible for io_cancel(2); that must not be done to
      a request already completed, though.
      	* in cases when ->poll() has placed us on a waitqueue,
      wakeup might have happened (and request completed) before ->poll()
      returns.
      	* worse, in some early wakeup cases request might end
      up re-added into the queue later - we can't treat "woken up and
      currently not in the queue" as "it's not going to stick around
      indefinitely"
      	* ... moreover, ->poll() might have decided not to
      put it on any queues to start with, and that needs to be distinguished
      from the previous case
      	* ->poll() might have tried to put us on more than one queue.
      Only the first will succeed for io poll, so we might end up missing
      wakeups.  OTOH, we might very well notice that only after the
      wakeup hits and request gets completed (all before ->poll() gets
      around to the second poll_wait()).  In that case it's too late to
      decide that we have an error.
      
      req->woken was an attempt to deal with that.  Unfortunately, it was
      broken.  What we need to keep track of is not that wakeup has happened -
      the thing might come back after that.  It's that async reference is
      already gone and won't come back, so we can't (and needn't) put the
      request on the list of cancellables.
      
      The easiest case is "request hadn't been put on any waitqueues"; we
      can tell by seeing NULL apt.head, and in that case there won't be
      anything async.  We should either complete the request ourselves
      (if vfs_poll() reports anything of interest) or return an error.
      
      In all other cases we get exclusion with wakeups by grabbing the
      queue lock.
      
      If request is currently on queue and we have something interesting
      from vfs_poll(), we can steal it and complete the request ourselves.
      
      If it's on queue and vfs_poll() has not reported anything interesting,
      we either put it on the cancellable list, or, if we know that it
      hadn't been put on all queues ->poll() wanted it on, we steal it and
      return an error.
      
      If it's _not_ on queue, it's either been already dealt with (in which
      case we do nothing), or there's io_poll_complete_work() about to be
      executed.  In that case we either put it on the cancellable list,
      or, if we know it hadn't been put on all queues ->poll() wanted it on,
      simulate what cancel would've done.
      
      Fixes: 221c5eb2 ("io_uring: add support for IORING_OP_POLL")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8c838788
    • Jens Axboe's avatar
      io_uring: fix fget/fput handling · 09bb8394
      Jens Axboe authored
      
      This isn't a straight port of commit 84c4e1f8 for aio.c, since
      io_uring doesn't use files in exactly the same way. But it's pretty
      close. See the commit message for that commit.
      
      This essentially fixes a use-after-free with the poll command
      handling, but it takes cue from Linus's approach to just simplifying
      the file handling. We move the setup of the file into a higher level
      location, so the individual commands don't have to deal with it. And
      then we release the reference when we free the associated io_kiocb.
      
      Fixes: 221c5eb2 ("io_uring: add support for IORING_OP_POLL")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      09bb8394
    • Jens Axboe's avatar
      io_uring: add prepped flag · d530a402
      Jens Axboe authored
      
      We currently use the fact that if ->ki_filp is already set, then we've
      done the prep. In preparation for moving the file assignment earlier,
      use a separate flag to tell whether the request has been prepped for
      IO or not.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d530a402
    • Jens Axboe's avatar
      io_uring: make io_read/write return an integer · e0c5c576
      Jens Axboe authored
      
      The callers all convert to an integer, and we only return 0/-ERROR
      anyway.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e0c5c576
    • Lukas Czerner's avatar
      ext4: report real fs size after failed resize · 6c732840
      Lukas Czerner authored
      
      Currently when the file system resize using ext4_resize_fs() fails it
      will report into log that "resized filesystem to <requested block
      count>".  However this may not be true in the case of failure.  Use the
      current block count as returned by ext4_blocks_count() to report the
      block count.
      
      Additionally, report a warning that "error occurred during file system
      resize"
      
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      6c732840
    • Lukas Czerner's avatar
      ext4: add missing brelse() in add_new_gdb_meta_bg() · d64264d6
      Lukas Czerner authored
      
      Currently in add_new_gdb_meta_bg() there is a missing brelse of gdb_bh
      in case ext4_journal_get_write_access() fails.
      Additionally kvfree() is missing in the same error path. Fix it by
      moving the ext4_journal_get_write_access() before the ext4 sb update as
      Ted suggested and release n_group_desc and gdb_bh in case it fails.
      
      Fixes: 61a9c11e ("ext4: add missing brelse() add_new_gdb_meta_bg()'s error path")
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      d64264d6
    • Jens Axboe's avatar
      io_uring: use regular request ref counts · e65ef56d
      Jens Axboe authored
      
      Get rid of the special casing of "normal" requests not having
      any references to the io_kiocb. We initialize the ref count to 2,
      one for the submission side, and one or the completion side.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e65ef56d
    • Jason Yan's avatar
      ext4: remove useless ext4_pin_inode() · 7cf77140
      Jason Yan authored
      
      This function is never used from the beginning (and is commented out);
      let's remove it.
      
      Signed-off-by: default avatarJason Yan <yanaijie@huawei.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      7cf77140
    • Jan Kara's avatar
      ext4: avoid panic during forced reboot · 1dc1097f
      Jan Kara authored
      
      When admin calls "reboot -f" - i.e., does a hard system reboot by
      directly calling reboot(2) - ext4 filesystem mounted with errors=panic
      can panic the system. This happens because the underlying device gets
      disabled without unmounting the filesystem and thus some syscall running
      in parallel to reboot(2) can result in the filesystem getting IO errors.
      
      This is somewhat surprising to the users so try improve the behavior by
      switching to errors=remount-ro behavior when the system is running
      reboot(2).
      
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      1dc1097f
    • Lukas Czerner's avatar
      ext4: fix data corruption caused by unaligned direct AIO · 372a03e0
      Lukas Czerner authored
      
      Ext4 needs to serialize unaligned direct AIO because the zeroing of
      partial blocks of two competing unaligned AIOs can result in data
      corruption.
      
      However it decides not to serialize if the potentially unaligned aio is
      past i_size with the rationale that no pending writes are possible past
      i_size. Unfortunately if the i_size is not block aligned and the second
      unaligned write lands past i_size, but still into the same block, it has
      the potential of corrupting the previous unaligned write to the same
      block.
      
      This is (very simplified) reproducer from Frank
      
          // 41472 = (10 * 4096) + 512
          // 37376 = 41472 - 4096
      
          ftruncate(fd, 41472);
          io_prep_pwrite(iocbs[0], fd, buf[0], 4096, 37376);
          io_prep_pwrite(iocbs[1], fd, buf[1], 4096, 41472);
      
          io_submit(io_ctx, 1, &iocbs[1]);
          io_submit(io_ctx, 1, &iocbs[2]);
      
          io_getevents(io_ctx, 2, 2, events, NULL);
      
      Without this patch the 512B range from 40960 up to the start of the
      second unaligned write (41472) is going to be zeroed overwriting the data
      written by the first write. This is a data corruption.
      
      00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
      *
      00009200  30 30 30 30 30 30 30 30  30 30 30 30 30 30 30 30
      *
      0000a000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
      *
      0000a200  31 31 31 31 31 31 31 31  31 31 31 31 31 31 31 31
      
      With this patch the data corruption is avoided because we will recognize
      the unaligned_aio and wait for the unwritten extent conversion.
      
      00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
      *
      00009200  30 30 30 30 30 30 30 30  30 30 30 30 30 30 30 30
      *
      0000a200  31 31 31 31 31 31 31 31  31 31 31 31 31 31 31 31
      *
      0000b200
      
      Reported-by: default avatarFrank Sorenson <fsorenso@redhat.com>
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Fixes: e9e3bcec ("ext4: serialize unaligned asynchronous DIO")
      Cc: stable@vger.kernel.org
      372a03e0
    • Jiufei Xue's avatar
      ext4: fix NULL pointer dereference while journal is aborted · fa30dde3
      Jiufei Xue authored
      
      We see the following NULL pointer dereference while running xfstests
      generic/475:
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      PGD 8000000c84bad067 P4D 8000000c84bad067 PUD c84e62067 PMD 0
      Oops: 0000 [#1] SMP PTI
      CPU: 7 PID: 9886 Comm: fsstress Kdump: loaded Not tainted 5.0.0-rc8 #10
      RIP: 0010:ext4_do_update_inode+0x4ec/0x760
      ...
      Call Trace:
      ? jbd2_journal_get_write_access+0x42/0x50
      ? __ext4_journal_get_write_access+0x2c/0x70
      ? ext4_truncate+0x186/0x3f0
      ext4_mark_iloc_dirty+0x61/0x80
      ext4_mark_inode_dirty+0x62/0x1b0
      ext4_truncate+0x186/0x3f0
      ? unmap_mapping_pages+0x56/0x100
      ext4_setattr+0x817/0x8b0
      notify_change+0x1df/0x430
      do_truncate+0x5e/0x90
      ? generic_permission+0x12b/0x1a0
      
      This is triggered because the NULL pointer handle->h_transaction was
      dereferenced in function ext4_update_inode_fsync_trans().
      I found that the h_transaction was set to NULL in jbd2__journal_restart
      but failed to attached to a new transaction while the journal is aborted.
      
      Fix this by checking the handle before updating the inode.
      
      Fixes: b436b9be ("ext4: Wait for proper transaction commit on fsync")
      Signed-off-by: default avatarJiufei Xue <jiufei.xue@linux.alibaba.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: default avatarJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: stable@kernel.org
      fa30dde3
Loading