Skip to content
Snippets Groups Projects
  1. Apr 14, 2019
  2. Apr 12, 2019
    • Jason Yan's avatar
      block: fix the return errno for direct IO · a89afe58
      Jason Yan authored
      
      If the last bio returned is not dio->bio, the status of the bio will
      not assigned to dio->bio if it is error. This will cause the whole IO
      status wrong.
      
          ksoftirqd/21-117   [021] ..s.  4017.966090:   8,0    C   N 4883648 [0]
                <idle>-0     [018] ..s.  4017.970888:   8,0    C  WS 4924800 + 1024 [0]
                <idle>-0     [018] ..s.  4017.970909:   8,0    D  WS 4935424 + 1024 [<idle>]
                <idle>-0     [018] ..s.  4017.970924:   8,0    D  WS 4936448 + 321 [<idle>]
          ksoftirqd/21-117   [021] ..s.  4017.995033:   8,0    C   R 4883648 + 336 [65475]
          ksoftirqd/21-117   [021] d.s.  4018.001988: myprobe1: (blkdev_bio_end_io+0x0/0x168) bi_status=7
          ksoftirqd/21-117   [021] d.s.  4018.001992: myprobe: (aio_complete_rw+0x0/0x148) x0=0xffff802f2595ad80 res=0x12a000 res2=0x0
      
      We always have to assign bio->bi_status to dio->bio.bi_status because we
      will only check dio->bio.bi_status when we return the whole IO to
      the upper layer.
      
      Fixes: 542ff7bf ("block: new direct I/O implementation")
      Cc: stable@vger.kernel.org
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJason Yan <yanaijie@huawei.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a89afe58
  3. Apr 11, 2019
  4. Apr 08, 2019
  5. Apr 06, 2019
    • Kirill Smelkov's avatar
      fs: stream_open - opener for stream-like files so that read and write can run... · 10dce8af
      Kirill Smelkov authored
      fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock
      
      Commit 9c225f26 ("vfs: atomic f_pos accesses as per POSIX") added
      locking for file.f_pos access and in particular made concurrent read and
      write not possible - now both those functions take f_pos lock for the
      whole run, and so if e.g. a read is blocked waiting for data, write will
      deadlock waiting for that read to complete.
      
      This caused regression for stream-like files where previously read and
      write could run simultaneously, but after that patch could not do so
      anymore. See e.g. commit 581d21a2 ("xenbus: fix deadlock on writes
      to /proc/xen/xenbus") which fixes such regression for particular case of
      /proc/xen/xenbus.
      
      The patch that added f_pos lock in 2014 did so to guarantee POSIX thread
      safety for read/write/lseek and added the locking to file descriptors of
      all regular files. In 2014 that thread-safety problem was not new as it
      was already discussed earlier in 2006.
      
      However even though 2006'th version of Linus's patch was adding f_pos
      locking "only for files that are marked seekable with FMODE_LSEEK (thus
      avoiding the stream-like objects like pipes and sockets)", the 2014
      version - the one that actually made it into the tree as 9c225f26 -
      is doing so irregardless of whether a file is seekable or not.
      
      See
      
          https://lore.kernel.org/lkml/53022DB1.4070805@gmail.com/
          https://lwn.net/Articles/180387
          https://lwn.net/Articles/180396
      
      for historic context.
      
      The reason that it did so is, probably, that there are many files that
      are marked non-seekable, but e.g. their read implementation actually
      depends on knowing current position to correctly handle the read. Some
      examples:
      
      	kernel/power/user.c		snapshot_read
      	fs/debugfs/file.c		u32_array_read
      	fs/fuse/control.c		fuse_conn_waiting_read + ...
      	drivers/hwmon/asus_atk0110.c	atk_debugfs_ggrp_read
      	arch/s390/hypfs/inode.c		hypfs_read_iter
      	...
      
      Despite that, many nonseekable_open users implement read and write with
      pure stream semantics - they don't depend on passed ppos at all. And for
      those cases where read could wait for something inside, it creates a
      situation similar to xenbus - the write could be never made to go until
      read is done, and read is waiting for some, potentially external, event,
      for potentially unbounded time -> deadlock.
      
      Besides xenbus, there are 14 such places in the kernel that I've found
      with semantic patch (see below):
      
      	drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write()
      	drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write()
      	drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write()
      	drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write()
      	net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write()
      	drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write()
      	drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write()
      	drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write()
      	net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write()
      	drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write()
      	drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write()
      	drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write()
      	drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write()
      	drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write()
      
      In addition to the cases above another regression caused by f_pos
      locking is that now FUSE filesystems that implement open with
      FOPEN_NONSEEKABLE flag, can no longer implement bidirectional
      stream-like files - for the same reason as above e.g. read can deadlock
      write locking on file.f_pos in the kernel.
      
      FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990 ("fuse:
      implement nonseekable open") to support OSSPD. OSSPD implements /dev/dsp
      in userspace with FOPEN_NONSEEKABLE flag, with corresponding read and
      write routines not depending on current position at all, and with both
      read and write being potentially blocking operations:
      
      See
      
          https://github.com/libfuse/osspd
          https://lwn.net/Articles/308445
      
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510
      
      Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as
      "somewhat pipe-like files ..." with read handler not using offset.
      However that test implements only read without write and cannot exercise
      the deadlock scenario:
      
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216
      
      I've actually hit the read vs write deadlock for real while implementing
      my FUSE filesystem where there is /head/watch file, for which open
      creates separate bidirectional socket-like stream in between filesystem
      and its user with both read and write being later performed
      simultaneously. And there it is semantically not easy to split the
      stream into two separate read-only and write-only channels:
      
          https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169
      
      Let's fix this regression. The plan is:
      
      1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS -
         doing so would break many in-kernel nonseekable_open users which
         actually use ppos in read/write handlers.
      
      2. Add stream_open() to kernel to open stream-like non-seekable file
         descriptors. Read and write on such file descriptors would never use
         nor change ppos. And with that property on stream-like files read and
         write will be running without taking f_pos lock - i.e. read and write
         could be running simultaneously.
      
      3. With semantic patch search and convert to stream_open all in-kernel
         nonseekable_open users for which read and write actually do not
         depend on ppos and where there is no other methods in file_operations
         which assume @offset access.
      
      4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via
         steam_open if that bit is present in filesystem open reply.
      
         It was tempting to change fs/fuse/ open handler to use stream_open
         instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but
         grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
         and in particular GVFS which actually uses offset in its read and
         write handlers
      
      	https://codesearch.debian.net/search?q=-%3Enonseekable+%3D
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481
      
      
      
         so if we would do such a change it will break a real user.
      
      5. Add stream_open and FOPEN_STREAM handling to stable kernels starting
         from v3.14+ (the kernel where 9c225f26 first appeared).
      
         This will allow to patch OSSPD and other FUSE filesystems that
         provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE
         in their open handler and this way avoid the deadlock on all kernel
         versions. This should work because fs/fuse/ ignores unknown open
         flags returned from a filesystem and so passing FOPEN_STREAM to a
         kernel that is not aware of this flag cannot hurt. In turn the kernel
         that is not aware of FOPEN_STREAM will be < v3.14 where just
         FOPEN_NONSEEKABLE is sufficient to implement streams without read vs
         write deadlock.
      
      This patch adds stream_open, converts /proc/xen/xenbus to it and adds
      semantic patch to automatically locate in-kernel places that are either
      required to be converted due to read vs write deadlock, or that are just
      safe to be converted because read and write do not use ppos and there
      are no other funky methods in file_operations.
      
      Regarding semantic patch I've verified each generated change manually -
      that it is correct to convert - and each other nonseekable_open instance
      left - that it is either not correct to convert there, or that it is not
      converted due to current stream_open.cocci limitations.
      
      The script also does not convert files that should be valid to convert,
      but that currently have .llseek = noop_llseek or generic_file_llseek for
      unknown reason despite file being opened with nonseekable_open (e.g.
      drivers/input/mousedev.c)
      
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Yongzhi Pan <panyongzhi@gmail.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Julia Lawall <Julia.Lawall@lip6.fr>
      Cc: Nikolaus Rath <Nikolaus@rath.org>
      Cc: Han-Wen Nienhuys <hanwen@google.com>
      Signed-off-by: default avatarKirill Smelkov <kirr@nexedi.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      10dce8af
    • Mike Kravetz's avatar
      hugetlbfs: fix memory leak for resv_map · 58b6e5e8
      Mike Kravetz authored
      When mknod is used to create a block special file in hugetlbfs, it will
      allocate an inode and kmalloc a 'struct resv_map' via resv_map_alloc().
      inode->i_mapping->private_data will point the newly allocated resv_map.
      However, when the device special file is opened bd_acquire() will set
      inode->i_mapping to bd_inode->i_mapping.  Thus the pointer to the
      allocated resv_map is lost and the structure is leaked.
      
      Programs to reproduce:
              mount -t hugetlbfs nodev hugetlbfs
              mknod hugetlbfs/dev b 0 0
              exec 30<> hugetlbfs/dev
              umount hugetlbfs/
      
      resv_map structures are only needed for inodes which can have associated
      page allocations.  To fix the leak, only allocate resv_map for those
      inodes which could possibly be associated with page allocations.
      
      Link: http://lkml.kernel.org/r/20190401213101.16476-1-mike.kravetz@oracle.com
      
      
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reported-by: default avatarYufen Yu <yuyufen@huawei.com>
      Suggested-by: default avatarYufen Yu <yuyufen@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      58b6e5e8
  6. Apr 05, 2019
  7. Apr 04, 2019
  8. Apr 03, 2019
    • Dan Carpenter's avatar
      aio: Fix an error code in __io_submit_one() · 18bfb9c6
      Dan Carpenter authored
      
      This accidentally returns the wrong variable.  The "req->ki_eventfd"
      pointer is NULL so this return success.
      
      Fixes: 7316b49c ("aio: move sanity checks and request allocation to io_submit_one()")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      18bfb9c6
    • Jens Axboe's avatar
      io_uring: fix double free in case of fileset regitration failure · 25adf50f
      Jens Axboe authored
      
      Will Deacon reported the following KASAN complaint:
      
      [  149.890370] ==================================================================
      [  149.891266] BUG: KASAN: double-free or invalid-free in io_sqe_files_unregister+0xa8/0x140
      [  149.892218]
      [  149.892411] CPU: 113 PID: 3974 Comm: io_uring_regist Tainted: G    B             5.1.0-rc3-00012-g40b114779944 #3
      [  149.893623] Hardware name: linux,dummy-virt (DT)
      [  149.894169] Call trace:
      [  149.894539]  dump_backtrace+0x0/0x228
      [  149.895172]  show_stack+0x14/0x20
      [  149.895747]  dump_stack+0xe8/0x124
      [  149.896335]  print_address_description+0x60/0x258
      [  149.897148]  kasan_report_invalid_free+0x78/0xb8
      [  149.897936]  __kasan_slab_free+0x1fc/0x228
      [  149.898641]  kasan_slab_free+0x10/0x18
      [  149.899283]  kfree+0x70/0x1f8
      [  149.899798]  io_sqe_files_unregister+0xa8/0x140
      [  149.900574]  io_ring_ctx_wait_and_kill+0x190/0x3c0
      [  149.901402]  io_uring_release+0x2c/0x48
      [  149.902068]  __fput+0x18c/0x510
      [  149.902612]  ____fput+0xc/0x18
      [  149.903146]  task_work_run+0xf0/0x148
      [  149.903778]  do_notify_resume+0x554/0x748
      [  149.904467]  work_pending+0x8/0x10
      [  149.905060]
      [  149.905331] Allocated by task 3974:
      [  149.905934]  __kasan_kmalloc.isra.0.part.1+0x48/0xf8
      [  149.906786]  __kasan_kmalloc.isra.0+0xb8/0xd8
      [  149.907531]  kasan_kmalloc+0xc/0x18
      [  149.908134]  __kmalloc+0x168/0x248
      [  149.908724]  __arm64_sys_io_uring_register+0x2b8/0x15a8
      [  149.909622]  el0_svc_common+0x100/0x258
      [  149.910281]  el0_svc_handler+0x48/0xc0
      [  149.910928]  el0_svc+0x8/0xc
      [  149.911425]
      [  149.911696] Freed by task 3974:
      [  149.912242]  __kasan_slab_free+0x114/0x228
      [  149.912955]  kasan_slab_free+0x10/0x18
      [  149.913602]  kfree+0x70/0x1f8
      [  149.914118]  __arm64_sys_io_uring_register+0xc2c/0x15a8
      [  149.915009]  el0_svc_common+0x100/0x258
      [  149.915670]  el0_svc_handler+0x48/0xc0
      [  149.916317]  el0_svc+0x8/0xc
      [  149.916817]
      [  149.917101] The buggy address belongs to the object at ffff8004ce07ed00
      [  149.917101]  which belongs to the cache kmalloc-128 of size 128
      [  149.919197] The buggy address is located 0 bytes inside of
      [  149.919197]  128-byte region [ffff8004ce07ed00, ffff8004ce07ed80)
      [  149.921142] The buggy address belongs to the page:
      [  149.921953] page:ffff7e0013381f00 count:1 mapcount:0 mapping:ffff800503417c00 index:0x0 compound_mapcount: 0
      [  149.923595] flags: 0x1ffff00000010200(slab|head)
      [  149.924388] raw: 1ffff00000010200 dead000000000100 dead000000000200 ffff800503417c00
      [  149.925706] raw: 0000000000000000 0000000080400040 00000001ffffffff 0000000000000000
      [  149.927011] page dumped because: kasan: bad access detected
      [  149.927956]
      [  149.928224] Memory state around the buggy address:
      [  149.929054]  ffff8004ce07ec00: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc
      [  149.930274]  ffff8004ce07ec80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [  149.931494] >ffff8004ce07ed00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  149.932712]                    ^
      [  149.933281]  ffff8004ce07ed80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [  149.934508]  ffff8004ce07ee00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [  149.935725] ==================================================================
      
      which is due to a failure in registrering a fileset. This frees the
      ctx->user_files pointer, but doesn't clear it. When the io_uring
      instance is later freed through the normal channels, we free this
      pointer again. At this point it's invalid.
      
      Ensure we clear the pointer when we free it for the error case.
      
      Reported-by: default avatarWill Deacon <will.deacon@arm.com>
      Tested-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      25adf50f
  9. Apr 01, 2019
  10. Mar 29, 2019
    • YueHaibing's avatar
      fs/proc/proc_sysctl.c: fix NULL pointer dereference in put_links · 23da9588
      YueHaibing authored
      Syzkaller reports:
      
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] SMP KASAN PTI
      CPU: 1 PID: 5373 Comm: syz-executor.0 Not tainted 5.0.0-rc8+ #3
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      RIP: 0010:put_links+0x101/0x440 fs/proc/proc_sysctl.c:1599
      Code: 00 0f 85 3a 03 00 00 48 8b 43 38 48 89 44 24 20 48 83 c0 38 48 89 c2 48 89 44 24 28 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <80> 3c 02 00 0f 85 fe 02 00 00 48 8b 74 24 20 48 c7 c7 60 2a 9d 91
      RSP: 0018:ffff8881d828f238 EFLAGS: 00010202
      RAX: dffffc0000000000 RBX: ffff8881e01b1140 RCX: ffffffff8ee98267
      RDX: 0000000000000007 RSI: ffffc90001479000 RDI: ffff8881e01b1178
      RBP: dffffc0000000000 R08: ffffed103ee27259 R09: ffffed103ee27259
      R10: 0000000000000001 R11: ffffed103ee27258 R12: fffffffffffffff4
      R13: 0000000000000006 R14: ffff8881f59838c0 R15: dffffc0000000000
      FS:  00007f072254f700(0000) GS:ffff8881f7100000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fff8b286668 CR3: 00000001f0542002 CR4: 00000000007606e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      PKRU: 55555554
      Call Trace:
       drop_sysctl_table+0x152/0x9f0 fs/proc/proc_sysctl.c:1629
       get_subdir fs/proc/proc_sysctl.c:1022 [inline]
       __register_sysctl_table+0xd65/0x1090 fs/proc/proc_sysctl.c:1335
       br_netfilter_init+0xbc/0x1000 [br_netfilter]
       do_one_initcall+0xfa/0x5ca init/main.c:887
       do_init_module+0x204/0x5f6 kernel/module.c:3460
       load_module+0x66b2/0x8570 kernel/module.c:3808
       __do_sys_finit_module+0x238/0x2a0 kernel/module.c:3902
       do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x462e99
      Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f072254ec58 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
      RAX: ffffffffffffffda RBX: 000000000073bf00 RCX: 0000000000462e99
      RDX: 0000000000000000 RSI: 0000000020000280 RDI: 0000000000000003
      RBP: 00007f072254ec70 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f072254f6bc
      R13: 00000000004bcefa R14: 00000000006f6fb0 R15: 0000000000000004
      Modules linked in: br_netfilter(+) dvb_usb_dibusb_mc_common dib3000mc dibx000_common dvb_usb_dibusb_common dvb_usb_dw2102 dvb_usb classmate_laptop palmas_regulator cn videobuf2_v4l2 v4l2_common snd_soc_bd28623 mptbase snd_usb_usx2y snd_usbmidi_lib snd_rawmidi wmi libnvdimm lockd sunrpc grace rc_kworld_pc150u rc_core rtc_da9063 sha1_ssse3 i2c_cros_ec_tunnel adxl34x_spi adxl34x nfnetlink lib80211 i5500_temp dvb_as102 dvb_core videobuf2_common videodev media videobuf2_vmalloc videobuf2_memops udc_core lnbp22 leds_lp3952 hid_roccat_ryos s1d13xxxfb mtd vport_geneve openvswitch nf_conncount nf_nat_ipv6 nsh geneve udp_tunnel ip6_udp_tunnel snd_soc_mt6351 sis_agp phylink snd_soc_adau1761_spi snd_soc_adau1761 snd_soc_adau17x1 snd_soc_core snd_pcm_dmaengine ac97_bus snd_compress snd_soc_adau_utils snd_soc_sigmadsp_regmap snd_soc_sigmadsp raid_class hid_roccat_konepure hid_roccat_common hid_roccat c2port_duramar2150 core mdio_bcm_unimac iptable_security iptable_raw iptable_mangle
       iptable_nat nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bpfilter ip6_vti ip_vti ip_gre ipip sit tunnel4 ip_tunnel hsr veth netdevsim devlink vxcan batman_adv cfg80211 rfkill chnl_net caif nlmon dummy team bonding vcan bridge stp llc ip6_gre gre ip6_tunnel tunnel6 tun crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel joydev mousedev ide_pci_generic piix aesni_intel aes_x86_64 ide_core crypto_simd atkbd cryptd glue_helper serio_raw ata_generic pata_acpi i2c_piix4 floppy sch_fq_codel ip_tables x_tables ipv6 [last unloaded: lm73]
      Dumping ftrace buffer:
         (ftrace buffer empty)
      ---[ end trace 770020de38961fd0 ]---
      
      A new dir entry can be created in get_subdir and its 'header->parent' is
      set to NULL.  Only after insert_header success, it will be set to 'dir',
      otherwise 'header->parent' is set to NULL and drop_sysctl_table is called.
      However in err handling path of get_subdir, drop_sysctl_table also be
      called on 'new->header' regardless its value of parent pointer.  Then
      put_links is called, which triggers NULL-ptr deref when access member of
      header->parent.
      
      In fact we have multiple error paths which call drop_sysctl_table() there,
      upon failure on insert_links() we also call drop_sysctl_table().And even
      in the successful case on __register_sysctl_table() we still always call
      drop_sysctl_table().This patch fix it.
      
      Link: http://lkml.kernel.org/r/20190314085527.13244-1-yuehaibing@huawei.com
      
      
      Fixes: 0e47c99d ("sysctl: Replace root_list with links between sysctl_table_sets")
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Acked-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: <stable@vger.kernel.org>    [3.4+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      23da9588
    • Randy Dunlap's avatar
      fs: fs_parser: fix printk format warning · 26203278
      Randy Dunlap authored
      Fix printk format warning (seen on i386 builds) by using ptrdiff format
      specifier (%t):
      
        fs/fs_parser.c:413:6: warning: format `%lu' expects argument of type `long unsigned int', but argument 3 has type `int' [-Wformat=]
      
      Link: http://lkml.kernel.org/r/19432668-ffd3-fbb2-af4f-1c8e48f6cc81@infradead.org
      
      
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Acked-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      26203278
    • YueHaibing's avatar
      fs/proc/kcore.c: make kcore_modules static · eebf3648
      YueHaibing authored
      Fix sparse warning:
      
        fs/proc/kcore.c:591:19: warning:
         symbol 'kcore_modules' was not declared. Should it be static?
      
      Link: http://lkml.kernel.org/r/20190320135417.13272-1-yuehaibing@huawei.com
      
      
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Acked-by: default avatarMukesh Ojha <mojha@codeaurora.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eebf3648
    • Darrick J. Wong's avatar
      ocfs2: fix inode bh swapping mixup in ocfs2_reflink_inodes_lock · e6a9467e
      Darrick J. Wong authored
      ocfs2_reflink_inodes_lock() can swap the inode1/inode2 variables so that
      we always grab cluster locks in order of increasing inode number.
      
      Unfortunately, we forget to swap the inode record buffer head pointers
      when we've done this, which leads to incorrect bookkeepping when we're
      trying to make the two inodes have the same refcount tree.
      
      This has the effect of causing filesystem shutdowns if you're trying to
      reflink data from inode 100 into inode 97, where inode 100 already has a
      refcount tree attached and inode 97 doesn't.  The reflink code decides
      to copy the refcount tree pointer from 100 to 97, but uses inode 97's
      inode record to open the tree root (which it doesn't have) and blows up.
      This issue causes filesystem shutdowns and metadata corruption!
      
      Link: http://lkml.kernel.org/r/20190312214910.GK20533@magnolia
      
      
      Fixes: 29ac8e85 ("ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features")
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarJoseph Qi <jiangqi903@gmail.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e6a9467e
    • Tetsuo Handa's avatar
      fs/open.c: allow opening only regular files during execve() · 73601ea5
      Tetsuo Handa authored
      syzbot is hitting lockdep warning [1] due to trying to open a fifo
      during an execve() operation.  But we don't need to open non regular
      files during an execve() operation, for all files which we will need are
      the executable file itself and the interpreter programs like /bin/sh and
      ld-linux.so.2 .
      
      Since the manpage for execve(2) says that execve() returns EACCES when
      the file or a script interpreter is not a regular file, and the manpage
      for uselib(2) says that uselib() can return EACCES, and we use
      FMODE_EXEC when opening for execve()/uselib(), we can bail out if a non
      regular file is requested with FMODE_EXEC set.
      
      Since this deadlock followed by khungtaskd warnings is trivially
      reproducible by a local unprivileged user, and syzbot's frequent crash
      due to this deadlock defers finding other bugs, let's workaround this
      deadlock until we get a chance to find a better solution.
      
      [1] https://syzkaller.appspot.com/bug?id=b5095bfec44ec84213bac54742a82483aad578ce
      
      Link: http://lkml.kernel.org/r/1552044017-7890-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
      
      
      Reported-by: default avatarsyzbot <syzbot+e93a80c1bb7c5c56e522461c149f8bf55eab1b2b@syzkaller.appspotmail.com>
      Fixes: 8924feff ("splice: lift pipe_lock out of splice_to_pipe()")
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric Biggers <ebiggers3@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: <stable@vger.kernel.org>	[4.9+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      73601ea5
  11. Mar 28, 2019
    • Filipe Manana's avatar
      Btrfs: do not allow trimming when a fs is mounted with the nologreplay option · f35f06c3
      Filipe Manana authored
      
      Whan a filesystem is mounted with the nologreplay mount option, which
      requires it to be mounted in RO mode as well, we can not allow discard on
      free space inside block groups, because log trees refer to extents that
      are not pinned in a block group's free space cache (pinning the extents is
      precisely the first phase of replaying a log tree).
      
      So do not allow the fitrim ioctl to do anything when the filesystem is
      mounted with the nologreplay option, because later it can be mounted RW
      without that option, which causes log replay to happen and result in
      either a failure to replay the log trees (leading to a mount failure), a
      crash or some silent corruption.
      
      Reported-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Fixes: 96da0919 ("btrfs: Introduce new mount option to disable tree log replay")
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f35f06c3
    • David Howells's avatar
      afs: Fix StoreData op marshalling · 8c7ae38d
      David Howells authored
      
      The marshalling of AFS.StoreData, AFS.StoreData64 and YFS.StoreData64 calls
      generated by ->setattr() ops for the purpose of expanding a file is
      incorrect due to older documentation incorrectly describing the way the RPC
      'FileLength' parameter is meant to work.
      
      The older documentation says that this is the length the file is meant to
      end up at the end of the operation; however, it was never implemented this
      way in any of the servers, but rather the file is truncated down to this
      before the write operation is effected, and never expanded to it (and,
      indeed, it was renamed to 'TruncPos' in 2014).
      
      Fix this by setting the position parameter to the new file length and doing
      a zero-lengh write there.
      
      The bug causes Xwayland to SIGBUS due to unexpected non-expansion of a file
      it then mmaps.  This can be tested by giving the following test program a
      filename in an AFS directory:
      
      	#include <stdio.h>
      	#include <stdlib.h>
      	#include <unistd.h>
      	#include <fcntl.h>
      	#include <sys/mman.h>
      	int main(int argc, char *argv[])
      	{
      		char *p;
      		int fd;
      		if (argc != 2) {
      			fprintf(stderr,
      				"Format: test-trunc-mmap <file>\n");
      			exit(2);
      		}
      		fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC);
      		if (fd < 0) {
      			perror(argv[1]);
      			exit(1);
      		}
      		if (ftruncate(fd, 0x140008) == -1) {
      			perror("ftruncate");
      			exit(1);
      		}
      		p = mmap(NULL, 4096, PROT_READ | PROT_WRITE,
      			 MAP_SHARED, fd, 0);
      		if (p == MAP_FAILED) {
      			perror("mmap");
      			exit(1);
      		}
      		p[0] = 'a';
      		if (munmap(p, 4096) < 0) {
      			perror("munmap");
      			exit(1);
      		}
      		if (close(fd) < 0) {
      			perror("close");
      			exit(1);
      		}
      		exit(0);
      	}
      
      Fixes: 31143d5d ("AFS: implement basic file write support")
      Reported-by: default avatarJonathan Billings <jsbillin@umich.edu>
      Tested-by: default avatarJonathan Billings <jsbillin@umich.edu>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8c7ae38d
  12. Mar 27, 2019
  13. Mar 26, 2019
    • Brian Foster's avatar
      xfs: serialize unaligned dio writes against all other dio writes · 2032a8a2
      Brian Foster authored
      
      XFS applies more strict serialization constraints to unaligned
      direct writes to accommodate things like direct I/O layer zeroing,
      unwritten extent conversion, etc. Unaligned submissions acquire the
      exclusive iolock and wait for in-flight dio to complete to ensure
      multiple submissions do not race on the same block and cause data
      corruption.
      
      This generally works in the case of an aligned dio followed by an
      unaligned dio, but the serialization is lost if I/Os occur in the
      opposite order. If an unaligned write is submitted first and
      immediately followed by an overlapping, aligned write, the latter
      submits without the typical unaligned serialization barriers because
      there is no indication of an unaligned dio still in-flight. This can
      lead to unpredictable results.
      
      To provide proper unaligned dio serialization, require that such
      direct writes are always the only dio allowed in-flight at one time
      for a particular inode. We already acquire the exclusive iolock and
      drain pending dio before submitting the unaligned dio. Wait once
      more after the dio submission to hold the iolock across the I/O and
      prevent further submissions until the unaligned I/O completes. This
      is heavy handed, but consistent with the current pre-submission
      serialization for unaligned direct writes.
      
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      2032a8a2
  14. Mar 25, 2019
  15. Mar 23, 2019
    • Darrick J. Wong's avatar
      ext4: prohibit fstrim in norecovery mode · 18915b58
      Darrick J. Wong authored
      
      The ext4 fstrim implementation uses the block bitmaps to find free space
      that can be discarded.  If we haven't replayed the journal, the bitmaps
      will be stale and we absolutely *cannot* use stale metadata to zap the
      underlying storage.
      
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      18915b58
    • Trond Myklebust's avatar
      pNFS/flexfiles: Fix layoutstats handling during read failovers · 166bd5b8
      Trond Myklebust authored
      
      During a read failover, we may end up changing the value of
      the pgio_mirror_idx, so make sure that we record the layout
      stats before that update.
      
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      166bd5b8
    • Trond Myklebust's avatar
      NFS: Fix a typo in nfs_init_timeout_values() · 5a698243
      Trond Myklebust authored
      
      Specifying a retrans=0 mount parameter to a NFS/TCP mount, is
      inadvertently causing the NFS client to rewrite any specified
      timeout parameter to the default of 60 seconds.
      
      Fixes: a956beda ("NFS: Allow the mount option retrans=0")
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      5a698243
    • zhangyi (F)'s avatar
      ext4: cleanup bh release code in ext4_ind_remove_space() · 5e86bdda
      zhangyi (F) authored
      
      Currently, we are releasing the indirect buffer where we are done with
      it in ext4_ind_remove_space(), so we can see the brelse() and
      BUFFER_TRACE() everywhere.  It seems fragile and hard to read, and we
      may probably forget to release the buffer some day.  This patch cleans
      up the code by putting of the code which releases the buffers to the
      end of the function.
      
      Signed-off-by: default avatarzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      5e86bdda
    • zhangyi (F)'s avatar
      ext4: brelse all indirect buffer in ext4_ind_remove_space() · 674a2b27
      zhangyi (F) authored
      
      All indirect buffers get by ext4_find_shared() should be released no
      mater the branch should be freed or not. But now, we forget to release
      the lower depth indirect buffers when removing space from the same
      higher depth indirect block. It will lead to buffer leak and futher
      more, it may lead to quota information corruption when using old quota,
      consider the following case.
      
       - Create and mount an empty ext4 filesystem without extent and quota
         features,
       - quotacheck and enable the user & group quota,
       - Create some files and write some data to them, and then punch hole
         to some files of them, it may trigger the buffer leak problem
         mentioned above.
       - Disable quota and run quotacheck again, it will create two new
         aquota files and write the checked quota information to them, which
         probably may reuse the freed indirect block(the buffer and page
         cache was not freed) as data block.
       - Enable quota again, it will invoke
         vfs_load_quota_inode()->invalidate_bdev() to try to clean unused
         buffers and pagecache. Unfortunately, because of the buffer of quota
         data block is still referenced, quota code cannot read the up to date
         quota info from the device and lead to quota information corruption.
      
      This problem can be reproduced by xfstests generic/231 on ext3 file
      system or ext4 file system without extent and quota features.
      
      This patch fix this problem by releasing the missing indirect buffers,
      in ext4_ind_remove_space().
      
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: stable@kernel.org
      674a2b27
    • Kairui Song's avatar
      x86/gart: Exclude GART aperture from kcore · ffc8599a
      Kairui Song authored
      
      On machines where the GART aperture is mapped over physical RAM,
      /proc/kcore contains the GART aperture range. Accessing the GART range via
      /proc/kcore results in a kernel crash.
      
      vmcore used to have the same issue, until it was fixed with commit
      2a3e83c6 ("x86/gart: Exclude GART aperture from vmcore")', leveraging
      existing hook infrastructure in vmcore to let /proc/vmcore return zeroes
      when attempting to read the aperture region, and so it won't read from the
      actual memory.
      
      Apply the same workaround for kcore. First implement the same hook
      infrastructure for kcore, then reuse the hook functions introduced in the
      previous vmcore fix. Just with some minor adjustment, rename some functions
      for more general usage, and simplify the hook infrastructure a bit as there
      is no module usage yet.
      
      Suggested-by: default avatarBaoquan He <bhe@redhat.com>
      Signed-off-by: default avatarKairui Song <kasong@redhat.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJiri Bohac <jbohac@suse.cz>
      Acked-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Dave Young <dyoung@redhat.com>
      Link: https://lkml.kernel.org/r/20190308030508.13548-1-kasong@redhat.com
      
      ffc8599a
Loading