Skip to content
Snippets Groups Projects
  1. Apr 10, 2019
    • Ming Lei's avatar
      blk-mq: introduce blk_mq_complete_request_sync() · 1b8f21b7
      Ming Lei authored
      
      In NVMe's error handler, follows the typical steps of tearing down
      hardware for recovering controller:
      
      1) stop blk_mq hw queues
      2) stop the real hw queues
      3) cancel in-flight requests via
      	blk_mq_tagset_busy_iter(tags, cancel_request, ...)
      cancel_request():
      	mark the request as abort
      	blk_mq_complete_request(req);
      4) destroy real hw queues
      
      However, there may be race between #3 and #4, because blk_mq_complete_request()
      may run q->mq_ops->complete(rq) remotelly and asynchronously, and
      ->complete(rq) may be run after #4.
      
      This patch introduces blk_mq_complete_request_sync() for fixing the
      above race.
      
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: James Smart <james.smart@broadcom.com>
      Cc: linux-nvme@lists.infradead.org
      Reviewed-by: default avatarKeith Busch <keith.busch@intel.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1b8f21b7
  2. Apr 08, 2019
  3. Apr 06, 2019
    • Kirill Smelkov's avatar
      fs: stream_open - opener for stream-like files so that read and write can run... · 10dce8af
      Kirill Smelkov authored
      fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock
      
      Commit 9c225f26 ("vfs: atomic f_pos accesses as per POSIX") added
      locking for file.f_pos access and in particular made concurrent read and
      write not possible - now both those functions take f_pos lock for the
      whole run, and so if e.g. a read is blocked waiting for data, write will
      deadlock waiting for that read to complete.
      
      This caused regression for stream-like files where previously read and
      write could run simultaneously, but after that patch could not do so
      anymore. See e.g. commit 581d21a2 ("xenbus: fix deadlock on writes
      to /proc/xen/xenbus") which fixes such regression for particular case of
      /proc/xen/xenbus.
      
      The patch that added f_pos lock in 2014 did so to guarantee POSIX thread
      safety for read/write/lseek and added the locking to file descriptors of
      all regular files. In 2014 that thread-safety problem was not new as it
      was already discussed earlier in 2006.
      
      However even though 2006'th version of Linus's patch was adding f_pos
      locking "only for files that are marked seekable with FMODE_LSEEK (thus
      avoiding the stream-like objects like pipes and sockets)", the 2014
      version - the one that actually made it into the tree as 9c225f26 -
      is doing so irregardless of whether a file is seekable or not.
      
      See
      
          https://lore.kernel.org/lkml/53022DB1.4070805@gmail.com/
          https://lwn.net/Articles/180387
          https://lwn.net/Articles/180396
      
      for historic context.
      
      The reason that it did so is, probably, that there are many files that
      are marked non-seekable, but e.g. their read implementation actually
      depends on knowing current position to correctly handle the read. Some
      examples:
      
      	kernel/power/user.c		snapshot_read
      	fs/debugfs/file.c		u32_array_read
      	fs/fuse/control.c		fuse_conn_waiting_read + ...
      	drivers/hwmon/asus_atk0110.c	atk_debugfs_ggrp_read
      	arch/s390/hypfs/inode.c		hypfs_read_iter
      	...
      
      Despite that, many nonseekable_open users implement read and write with
      pure stream semantics - they don't depend on passed ppos at all. And for
      those cases where read could wait for something inside, it creates a
      situation similar to xenbus - the write could be never made to go until
      read is done, and read is waiting for some, potentially external, event,
      for potentially unbounded time -> deadlock.
      
      Besides xenbus, there are 14 such places in the kernel that I've found
      with semantic patch (see below):
      
      	drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write()
      	drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write()
      	drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write()
      	drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write()
      	net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write()
      	drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write()
      	drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write()
      	drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write()
      	net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write()
      	drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write()
      	drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write()
      	drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write()
      	drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write()
      	drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write()
      
      In addition to the cases above another regression caused by f_pos
      locking is that now FUSE filesystems that implement open with
      FOPEN_NONSEEKABLE flag, can no longer implement bidirectional
      stream-like files - for the same reason as above e.g. read can deadlock
      write locking on file.f_pos in the kernel.
      
      FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990 ("fuse:
      implement nonseekable open") to support OSSPD. OSSPD implements /dev/dsp
      in userspace with FOPEN_NONSEEKABLE flag, with corresponding read and
      write routines not depending on current position at all, and with both
      read and write being potentially blocking operations:
      
      See
      
          https://github.com/libfuse/osspd
          https://lwn.net/Articles/308445
      
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510
      
      Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as
      "somewhat pipe-like files ..." with read handler not using offset.
      However that test implements only read without write and cannot exercise
      the deadlock scenario:
      
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216
      
      I've actually hit the read vs write deadlock for real while implementing
      my FUSE filesystem where there is /head/watch file, for which open
      creates separate bidirectional socket-like stream in between filesystem
      and its user with both read and write being later performed
      simultaneously. And there it is semantically not easy to split the
      stream into two separate read-only and write-only channels:
      
          https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169
      
      Let's fix this regression. The plan is:
      
      1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS -
         doing so would break many in-kernel nonseekable_open users which
         actually use ppos in read/write handlers.
      
      2. Add stream_open() to kernel to open stream-like non-seekable file
         descriptors. Read and write on such file descriptors would never use
         nor change ppos. And with that property on stream-like files read and
         write will be running without taking f_pos lock - i.e. read and write
         could be running simultaneously.
      
      3. With semantic patch search and convert to stream_open all in-kernel
         nonseekable_open users for which read and write actually do not
         depend on ppos and where there is no other methods in file_operations
         which assume @offset access.
      
      4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via
         steam_open if that bit is present in filesystem open reply.
      
         It was tempting to change fs/fuse/ open handler to use stream_open
         instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but
         grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
         and in particular GVFS which actually uses offset in its read and
         write handlers
      
      	https://codesearch.debian.net/search?q=-%3Enonseekable+%3D
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481
      
      
      
         so if we would do such a change it will break a real user.
      
      5. Add stream_open and FOPEN_STREAM handling to stable kernels starting
         from v3.14+ (the kernel where 9c225f26 first appeared).
      
         This will allow to patch OSSPD and other FUSE filesystems that
         provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE
         in their open handler and this way avoid the deadlock on all kernel
         versions. This should work because fs/fuse/ ignores unknown open
         flags returned from a filesystem and so passing FOPEN_STREAM to a
         kernel that is not aware of this flag cannot hurt. In turn the kernel
         that is not aware of FOPEN_STREAM will be < v3.14 where just
         FOPEN_NONSEEKABLE is sufficient to implement streams without read vs
         write deadlock.
      
      This patch adds stream_open, converts /proc/xen/xenbus to it and adds
      semantic patch to automatically locate in-kernel places that are either
      required to be converted due to read vs write deadlock, or that are just
      safe to be converted because read and write do not use ppos and there
      are no other funky methods in file_operations.
      
      Regarding semantic patch I've verified each generated change manually -
      that it is correct to convert - and each other nonseekable_open instance
      left - that it is either not correct to convert there, or that it is not
      converted due to current stream_open.cocci limitations.
      
      The script also does not convert files that should be valid to convert,
      but that currently have .llseek = noop_llseek or generic_file_llseek for
      unknown reason despite file being opened with nonseekable_open (e.g.
      drivers/input/mousedev.c)
      
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Yongzhi Pan <panyongzhi@gmail.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Julia Lawall <Julia.Lawall@lip6.fr>
      Cc: Nikolaus Rath <Nikolaus@rath.org>
      Cc: Han-Wen Nienhuys <hanwen@google.com>
      Signed-off-by: default avatarKirill Smelkov <kirr@nexedi.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      10dce8af
    • Greg Thelen's avatar
      mm: writeback: use exact memcg dirty counts · 0b3d6e6f
      Greg Thelen authored
      Since commit a983b5eb ("mm: memcontrol: fix excessive complexity in
      memory.stat reporting") memcg dirty and writeback counters are managed
      as:
      
       1) per-memcg per-cpu values in range of [-32..32]
      
       2) per-memcg atomic counter
      
      When a per-cpu counter cannot fit in [-32..32] it's flushed to the
      atomic.  Stat readers only check the atomic.  Thus readers such as
      balance_dirty_pages() may see a nontrivial error margin: 32 pages per
      cpu.
      
      Assuming 100 cpus:
         4k x86 page_size:  13 MiB error per memcg
        64k ppc page_size: 200 MiB error per memcg
      
      Considering that dirty+writeback are used together for some decisions the
      errors double.
      
      This inaccuracy can lead to undeserved oom kills.  One nasty case is
      when all per-cpu counters hold positive values offsetting an atomic
      negative value (i.e.  per_cpu[*]=32, atomic=n_cpu*-32).
      balance_dirty_pages() only consults the atomic and does not consider
      throttling the next n_cpu*32 dirty pages.  If the file_lru is in the
      13..200 MiB range then there's absolutely no dirty throttling, which
      burdens vmscan with only dirty+writeback pages thus resorting to oom
      kill.
      
      It could be argued that tiny containers are not supported, but it's more
      subtle.  It's the amount the space available for file lru that matters.
      If a container has memory.max-200MiB of non reclaimable memory, then it
      will also suffer such oom kills on a 100 cpu machine.
      
      The following test reliably ooms without this patch.  This patch avoids
      oom kills.
      
        $ cat test
        mount -t cgroup2 none /dev/cgroup
        cd /dev/cgroup
        echo +io +memory > cgroup.subtree_control
        mkdir test
        cd test
        echo 10M > memory.max
        (echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo)
        (echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100)
      
        $ cat memcg-writeback-stress.c
        /*
         * Dirty pages from all but one cpu.
         * Clean pages from the non dirtying cpu.
         * This is to stress per cpu counter imbalance.
         * On a 100 cpu machine:
         * - per memcg per cpu dirty count is 32 pages for each of 99 cpus
         * - per memcg atomic is -99*32 pages
         * - thus the complete dirty limit: sum of all counters 0
         * - balance_dirty_pages() only sees atomic count -99*32 pages, which
         *   it max()s to 0.
         * - So a workload can dirty -99*32 pages before balance_dirty_pages()
         *   cares.
         */
        #define _GNU_SOURCE
        #include <err.h>
        #include <fcntl.h>
        #include <sched.h>
        #include <stdlib.h>
        #include <stdio.h>
        #include <sys/stat.h>
        #include <sys/sysinfo.h>
        #include <sys/types.h>
        #include <unistd.h>
      
        static char *buf;
        static int bufSize;
      
        static void set_affinity(int cpu)
        {
        	cpu_set_t affinity;
      
        	CPU_ZERO(&affinity);
        	CPU_SET(cpu, &affinity);
        	if (sched_setaffinity(0, sizeof(affinity), &affinity))
        		err(1, "sched_setaffinity");
        }
      
        static void dirty_on(int output_fd, int cpu)
        {
        	int i, wrote;
      
        	set_affinity(cpu);
        	for (i = 0; i < 32; i++) {
        		for (wrote = 0; wrote < bufSize; ) {
        			int ret = write(output_fd, buf+wrote, bufSize-wrote);
        			if (ret == -1)
        				err(1, "write");
        			wrote += ret;
        		}
        	}
        }
      
        int main(int argc, char **argv)
        {
        	int cpu, flush_cpu = 1, output_fd;
        	const char *output;
      
        	if (argc != 2)
        		errx(1, "usage: output_file");
      
        	output = argv[1];
        	bufSize = getpagesize();
        	buf = malloc(getpagesize());
        	if (buf == NULL)
        		errx(1, "malloc failed");
      
        	output_fd = open(output, O_CREAT|O_RDWR);
        	if (output_fd == -1)
        		err(1, "open(%s)", output);
      
        	for (cpu = 0; cpu < get_nprocs(); cpu++) {
        		if (cpu != flush_cpu)
        			dirty_on(output_fd, cpu);
        	}
      
        	set_affinity(flush_cpu);
        	if (fsync(output_fd))
        		err(1, "fsync(%s)", output);
        	if (close(output_fd))
        		err(1, "close(%s)", output);
        	free(buf);
        }
      
      Make balance_dirty_pages() and wb_over_bg_thresh() work harder to
      collect exact per memcg counters.  This avoids the aforementioned oom
      kills.
      
      This does not affect the overhead of memory.stat, which still reads the
      single atomic counter.
      
      Why not use percpu_counter? memcg already handles cpus going offline, so
      no need for that overhead from percpu_counter.  And the percpu_counter
      spinlocks are more heavyweight than is required.
      
      It probably also makes sense to use exact dirty and writeback counters
      in memcg oom reports.  But that is saved for later.
      
      Link: http://lkml.kernel.org/r/20190329174609.164344-1-gthelen@google.com
      
      
      Signed-off-by: default avatarGreg Thelen <gthelen@google.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.16+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0b3d6e6f
    • Jann Horn's avatar
      mm: fix vm_fault_t cast in VM_FAULT_GET_HINDEX() · fcae96ff
      Jann Horn authored
      Symmetrically to VM_FAULT_SET_HINDEX(), we need a force-cast in
      VM_FAULT_GET_HINDEX() to tell sparse that this is intentional.
      
      Sparse complains about the current code when building a kernel with
      CONFIG_MEMORY_FAILURE:
      
        arch/x86/mm/fault.c:1058:53: warning: restricted vm_fault_t degrades to integer
      
      Link: http://lkml.kernel.org/r/20190327204117.35215-1-jannh@google.com
      
      
      Fixes: 3d353901 ("mm: create the new vm_fault_t type")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fcae96ff
    • Arnd Bergmann's avatar
      include/linux/bitrev.h: fix constant bitrev · 6147e136
      Arnd Bergmann authored
      clang points out with hundreds of warnings that the bitrev macros have a
      problem with constant input:
      
        drivers/hwmon/sht15.c:187:11: error: variable '__x' is uninitialized when used within its own initialization
              [-Werror,-Wuninitialized]
                u8 crc = bitrev8(data->val_status & 0x0F);
                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        include/linux/bitrev.h:102:21: note: expanded from macro 'bitrev8'
                __constant_bitrev8(__x) :                       \
                ~~~~~~~~~~~~~~~~~~~^~~~
        include/linux/bitrev.h:67:11: note: expanded from macro '__constant_bitrev8'
                u8 __x = x;                     \
                   ~~~   ^
      
      Both the bitrev and the __constant_bitrev macros use an internal
      variable named __x, which goes horribly wrong when passing one to the
      other.
      
      The obvious fix is to rename one of the variables, so this adds an extra
      '_'.
      
      It seems we got away with this because
      
       - there are only a few drivers using bitrev macros
      
       - usually there are no constant arguments to those
      
       - when they are constant, they tend to be either 0 or (unsigned)-1
         (drivers/isdn/i4l/isdnhdlc.o, drivers/iio/amplifiers/ad8366.c) and
         give the correct result by pure chance.
      
      In fact, the only driver that I could find that gets different results
      with this is drivers/net/wan/slic_ds26522.c, which in turn is a driver
      for fairly rare hardware (adding the maintainer to Cc for testing).
      
      Link: http://lkml.kernel.org/r/20190322140503.123580-1-arnd@arndb.de
      
      
      Fixes: 556d2f05 ("ARM: 8187/1: add CONFIG_HAVE_ARCH_BITREVERSE to support rbit instruction")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarNick Desaulniers <ndesaulniers@google.com>
      Cc: Zhao Qiang <qiang.zhao@nxp.com>
      Cc: Yalin Wang <yalin.wang@sonymobile.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6147e136
    • Nick Desaulniers's avatar
      lib/string.c: implement a basic bcmp · 5f074f3e
      Nick Desaulniers authored
      A recent optimization in Clang (r355672) lowers comparisons of the
      return value of memcmp against zero to comparisons of the return value
      of bcmp against zero.  This helps some platforms that implement bcmp
      more efficiently than memcmp.  glibc simply aliases bcmp to memcmp, but
      an optimized implementation is in the works.
      
      This results in linkage failures for all targets with Clang due to the
      undefined symbol.  For now, just implement bcmp as a tailcail to memcmp
      to unbreak the build.  This routine can be further optimized in the
      future.
      
      Other ideas discussed:
      
       * A weak alias was discussed, but breaks for architectures that define
         their own implementations of memcmp since aliases to declarations are
         not permitted (only definitions). Arch-specific memcmp
         implementations typically declare memcmp in C headers, but implement
         them in assembly.
      
       * -ffreestanding also is used sporadically throughout the kernel.
      
       * -fno-builtin-bcmp doesn't work when doing LTO.
      
      Link: https://bugs.llvm.org/show_bug.cgi?id=41035
      Link: https://code.woboq.org/userspace/glibc/string/memcmp.c.html#bcmp
      Link: https://github.com/llvm/llvm-project/commit/8e16d73346f8091461319a7dfc4ddd18eedcff13
      Link: https://github.com/ClangBuiltLinux/linux/issues/416
      Link: http://lkml.kernel.org/r/20190313211335.165605-1-ndesaulniers@google.com
      
      
      Signed-off-by: default avatarNick Desaulniers <ndesaulniers@google.com>
      Reported-by: default avatarNathan Chancellor <natechancellor@gmail.com>
      Reported-by: default avatarAdhemerval Zanella <adhemerval.zanella@linaro.org>
      Suggested-by: default avatarArnd Bergmann <arnd@arndb.de>
      Suggested-by: default avatarJames Y Knight <jyknight@google.com>
      Suggested-by: default avatarMasahiro Yamada <yamada.masahiro@socionext.com>
      Suggested-by: default avatarNathan Chancellor <natechancellor@gmail.com>
      Suggested-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
      Acked-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Reviewed-by: default avatarNathan Chancellor <natechancellor@gmail.com>
      Tested-by: default avatarNathan Chancellor <natechancellor@gmail.com>
      Reviewed-by: default avatarMasahiro Yamada <yamada.masahiro@socionext.com>
      Reviewed-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5f074f3e
  4. Apr 05, 2019
    • Steven Rostedt (VMware)'s avatar
      syscalls: Remove start and number from syscall_set_arguments() args · 32d92586
      Steven Rostedt (VMware) authored
      After removing the start and count arguments of syscall_get_arguments() it
      seems reasonable to remove them from syscall_set_arguments(). Note, as of
      today, there are no users of syscall_set_arguments(). But we are told that
      there will be soon. But for now, at least make it consistent with
      syscall_get_arguments().
      
      Link: http://lkml.kernel.org/r/20190327222014.GA32540@altlinux.org
      
      
      
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: Dave Martin <dave.martin@arm.com>
      Cc: "Dmitry V. Levin" <ldv@altlinux.org>
      Cc: x86@kernel.org
      Cc: linux-snps-arc@lists.infradead.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-c6x-dev@linux-c6x.org
      Cc: uclinux-h8-devel@lists.sourceforge.jp
      Cc: linux-hexagon@vger.kernel.org
      Cc: linux-ia64@vger.kernel.org
      Cc: linux-mips@vger.kernel.org
      Cc: nios2-dev@lists.rocketboards.org
      Cc: openrisc@lists.librecores.org
      Cc: linux-parisc@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: linux-riscv@lists.infradead.org
      Cc: linux-s390@vger.kernel.org
      Cc: linux-sh@vger.kernel.org
      Cc: sparclinux@vger.kernel.org
      Cc: linux-um@lists.infradead.org
      Cc: linux-xtensa@linux-xtensa.org
      Cc: linux-arch@vger.kernel.org
      Acked-by: Max Filippov <jcmvbkbc@gmail.com> # For xtensa changes
      Acked-by: Will Deacon <will.deacon@arm.com> # For the arm64 bits
      Reviewed-by: Thomas Gleixner <tglx@linutronix.de> # for x86
      Reviewed-by: default avatarDmitry V. Levin <ldv@altlinux.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      32d92586
    • Steven Rostedt (Red Hat)'s avatar
      syscalls: Remove start and number from syscall_get_arguments() args · b35f549d
      Steven Rostedt (Red Hat) authored
      At Linux Plumbers, Andy Lutomirski approached me and pointed out that the
      function call syscall_get_arguments() implemented in x86 was horribly
      written and not optimized for the standard case of passing in 0 and 6 for
      the starting index and the number of system calls to get. When looking at
      all the users of this function, I discovered that all instances pass in only
      0 and 6 for these arguments. Instead of having this function handle
      different cases that are never used, simply rewrite it to return the first 6
      arguments of a system call.
      
      This should help out the performance of tracing system calls by ptrace,
      ftrace and perf.
      
      Link: http://lkml.kernel.org/r/20161107213233.754809394@goodmis.org
      
      
      
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: Dave Martin <dave.martin@arm.com>
      Cc: "Dmitry V. Levin" <ldv@altlinux.org>
      Cc: x86@kernel.org
      Cc: linux-snps-arc@lists.infradead.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-c6x-dev@linux-c6x.org
      Cc: uclinux-h8-devel@lists.sourceforge.jp
      Cc: linux-hexagon@vger.kernel.org
      Cc: linux-ia64@vger.kernel.org
      Cc: linux-mips@vger.kernel.org
      Cc: nios2-dev@lists.rocketboards.org
      Cc: openrisc@lists.librecores.org
      Cc: linux-parisc@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: linux-riscv@lists.infradead.org
      Cc: linux-s390@vger.kernel.org
      Cc: linux-sh@vger.kernel.org
      Cc: sparclinux@vger.kernel.org
      Cc: linux-um@lists.infradead.org
      Cc: linux-xtensa@linux-xtensa.org
      Cc: linux-arch@vger.kernel.org
      Acked-by: Paul Burton <paul.burton@mips.com> # MIPS parts
      Acked-by: Max Filippov <jcmvbkbc@gmail.com> # For xtensa changes
      Acked-by: Will Deacon <will.deacon@arm.com> # For the arm64 bits
      Reviewed-by: Thomas Gleixner <tglx@linutronix.de> # for x86
      Reviewed-by: default avatarDmitry V. Levin <ldv@altlinux.org>
      Reported-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      b35f549d
  5. Apr 04, 2019
    • Steven Rostedt (Red Hat)'s avatar
      ptrace: Remove maxargs from task_current_syscall() · 631b7aba
      Steven Rostedt (Red Hat) authored
      task_current_syscall() has a single user that passes in 6 for maxargs, which
      is the maximum arguments that can be used to get system calls from
      syscall_get_arguments(). Instead of passing in a number of arguments to
      grab, just get 6 arguments. The args argument even specifies that it's an
      array of 6 items.
      
      This will also allow changing syscall_get_arguments() to not get a variable
      number of arguments, but always grab 6.
      
      Linus also suggested not passing in a bunch of arguments to
      task_current_syscall() but to instead pass in a pointer to a structure, and
      just fill the structure. struct seccomp_data has almost all the parameters
      that is needed except for the stack pointer (sp). As seccomp_data is part of
      uapi, and I'm afraid to change it, a new structure was created
      "syscall_info", which includes seccomp_data and adds the "sp" field.
      
      Link: http://lkml.kernel.org/r/20161107213233.466776454@goodmis.org
      
      
      
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      631b7aba
  6. Apr 01, 2019
  7. Mar 29, 2019
  8. Mar 28, 2019
    • Masahiro Yamada's avatar
      KVM: export <linux/kvm_para.h> and <asm/kvm_para.h> iif KVM is supported · 3d9683cf
      Masahiro Yamada authored
      
      I do not see any consistency about headers_install of <linux/kvm_para.h>
      and <asm/kvm_para.h>.
      
      According to my analysis of Linux 5.1-rc1, there are 3 groups:
      
       [1] Both <linux/kvm_para.h> and <asm/kvm_para.h> are exported
      
          alpha, arm, hexagon, mips, powerpc, s390, sparc, x86
      
       [2] <asm/kvm_para.h> is exported, but <linux/kvm_para.h> is not
      
          arc, arm64, c6x, h8300, ia64, m68k, microblaze, nios2, openrisc,
          parisc, sh, unicore32, xtensa
      
       [3] Neither <linux/kvm_para.h> nor <asm/kvm_para.h> is exported
      
          csky, nds32, riscv
      
      This does not match to the actual KVM support. At least, [2] is
      half-baked.
      
      Nor do arch maintainers look like they care about this. For example,
      commit 0add5371 ("microblaze: Add missing kvm_para.h to Kbuild")
      exported <asm/kvm_para.h> to user-space in order to fix an in-kernel
      build error.
      
      We have two ways to make this consistent:
      
       [A] export both <linux/kvm_para.h> and <asm/kvm_para.h> for all
           architectures, irrespective of the KVM support
      
       [B] Match the header export of <linux/kvm_para.h> and <asm/kvm_para.h>
           to the KVM support
      
      My first attempt was [A] because the code looks cleaner, but Paolo
      suggested [B].
      
      So, this commit goes with [B].
      
      For most architectures, <asm/kvm_para.h> was moved to the kernel-space.
      I changed include/uapi/linux/Kbuild so that it checks generated
      asm/kvm_para.h as well as check-in ones.
      
      After this commit, there will be two groups:
      
       [1] Both <linux/kvm_para.h> and <asm/kvm_para.h> are exported
      
          arm, arm64, mips, powerpc, s390, x86
      
       [2] Neither <linux/kvm_para.h> nor <asm/kvm_para.h> is exported
      
          alpha, arc, c6x, csky, h8300, hexagon, ia64, m68k, microblaze,
          nds32, nios2, openrisc, parisc, riscv, sh, sparc, unicore32, xtensa
      
      Signed-off-by: default avatarMasahiro Yamada <yamada.masahiro@socionext.com>
      Acked-by: default avatarCornelia Huck <cohuck@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3d9683cf
    • Claudiu Manoil's avatar
      net: mii: Fix PAUSE cap advertisement from linkmode_adv_to_lcl_adv_t() helper · 7f07e5f1
      Claudiu Manoil authored
      
      With a recent link mode advertisement code update this helper
      providing local pause capability translation used for flow
      control link mode negotiation got broken.
      For eth drivers using this helper, the issue is apparent only
      if either PAUSE or ASYM_PAUSE is being advertised.
      
      Fixes: 3c1bcc86 ("net: ethernet: Convert phydev advertize and supported from u32 to link mode")
      Signed-off-by: default avatarClaudiu Manoil <claudiu.manoil@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7f07e5f1
  9. Mar 27, 2019
  10. Mar 26, 2019
  11. Mar 25, 2019
  12. Mar 23, 2019
    • Kairui Song's avatar
      x86/gart: Exclude GART aperture from kcore · ffc8599a
      Kairui Song authored
      
      On machines where the GART aperture is mapped over physical RAM,
      /proc/kcore contains the GART aperture range. Accessing the GART range via
      /proc/kcore results in a kernel crash.
      
      vmcore used to have the same issue, until it was fixed with commit
      2a3e83c6 ("x86/gart: Exclude GART aperture from vmcore")', leveraging
      existing hook infrastructure in vmcore to let /proc/vmcore return zeroes
      when attempting to read the aperture region, and so it won't read from the
      actual memory.
      
      Apply the same workaround for kcore. First implement the same hook
      infrastructure for kcore, then reuse the hook functions introduced in the
      previous vmcore fix. Just with some minor adjustment, rename some functions
      for more general usage, and simplify the hook infrastructure a bit as there
      is no module usage yet.
      
      Suggested-by: default avatarBaoquan He <bhe@redhat.com>
      Signed-off-by: default avatarKairui Song <kasong@redhat.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJiri Bohac <jbohac@suse.cz>
      Acked-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Dave Young <dyoung@redhat.com>
      Link: https://lkml.kernel.org/r/20190308030508.13548-1-kasong@redhat.com
      
      ffc8599a
  13. Mar 22, 2019
  14. Mar 21, 2019
    • Davide Caratti's avatar
      net/sched: let actions use RCU to access 'goto_chain' · ee3bbfe8
      Davide Caratti authored
      
      use RCU when accessing the action chain, to avoid use after free in the
      traffic path when 'goto chain' is replaced on existing TC actions (see
      script below). Since the control action is read in the traffic path
      without holding the action spinlock, we need to explicitly ensure that
      a->goto_chain is not NULL before dereferencing (i.e it's not sufficient
      to rely on the value of TC_ACT_GOTO_CHAIN bits). Not doing so caused NULL
      dereferences in tcf_action_goto_chain_exec() when the following script:
      
       # tc chain add dev dd0 chain 42 ingress protocol ip flower \
       > ip_proto udp action pass index 4
       # tc filter add dev dd0 ingress protocol ip flower \
       > ip_proto udp action csum udp goto chain 42 index 66
       # tc chain del dev dd0 chain 42 ingress
       (start UDP traffic towards dd0)
       # tc action replace action csum udp pass index 66
      
      was run repeatedly for several hours.
      
      Suggested-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Suggested-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ee3bbfe8
    • Davide Caratti's avatar
      net/sched: don't dereference a->goto_chain to read the chain index · fe384e2f
      Davide Caratti authored
      
      callers of tcf_gact_goto_chain_index() can potentially read an old value
      of the chain index, or even dereference a NULL 'goto_chain' pointer,
      because 'goto_chain' and 'tcfa_action' are read in the traffic path
      without caring of concurrent write in the control path. The most recent
      value of chain index can be read also from a->tcfa_action (it's encoded
      there together with TC_ACT_GOTO_CHAIN bits), so we don't really need to
      dereference 'goto_chain': just read the chain id from the control action.
      
      Fixes: e457d86a ("net: sched: add couple of goto_chain helpers")
      Signed-off-by: default avatarDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fe384e2f
    • Davide Caratti's avatar
      net/sched: prepare TC actions to properly validate the control action · 85d0966f
      Davide Caratti authored
      
      - pass a pointer to struct tcf_proto in each actions's init() handler,
        to allow validating the control action, checking whether the chain
        exists and (eventually) refcounting it.
      - remove code that validates the control action after a successful call
        to the action's init() handler, and replace it with a test that forbids
        addition of actions having 'goto_chain' and NULL goto_chain pointer at
        the same time.
      - add tcf_action_check_ctrlact(), that will validate the control action
        and eventually allocate the action 'goto_chain' within the init()
        handler.
      - add tcf_action_set_ctrlact(), that will assign the control action and
        swap the current 'goto_chain' pointer with the new given one.
      
      This disallows 'goto_chain' on actions that don't initialize it properly
      in their init() handler, i.e. calling tcf_action_check_ctrlact() after
      successful IDR reservation and then calling tcf_action_set_ctrlact()
      to assign 'goto_chain' and 'tcf_action' consistently.
      
      By doing this, the kernel does not leak anymore refcounts when a valid
      'goto chain' handle is replaced in TC actions, causing kmemleak splats
      like the following one:
      
       # tc chain add dev dd0 chain 42 ingress protocol ip flower \
       > ip_proto tcp action drop
       # tc chain add dev dd0 chain 43 ingress protocol ip flower \
       > ip_proto udp action drop
       # tc filter add dev dd0 ingress matchall \
       > action gact goto chain 42 index 66
       # tc filter replace dev dd0 ingress matchall \
       > action gact goto chain 43 index 66
       # echo scan >/sys/kernel/debug/kmemleak
       <...>
       unreferenced object 0xffff93c0ee09f000 (size 1024):
       comm "tc", pid 2565, jiffies 4295339808 (age 65.426s)
       hex dump (first 32 bytes):
         00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
         00 00 00 00 08 00 06 00 00 00 00 00 00 00 00 00  ................
       backtrace:
         [<000000009b63f92d>] tc_ctl_chain+0x3d2/0x4c0
         [<00000000683a8d72>] rtnetlink_rcv_msg+0x263/0x2d0
         [<00000000ddd88f8e>] netlink_rcv_skb+0x4a/0x110
         [<000000006126a348>] netlink_unicast+0x1a0/0x250
         [<00000000b3340877>] netlink_sendmsg+0x2c1/0x3c0
         [<00000000a25a2171>] sock_sendmsg+0x36/0x40
         [<00000000f19ee1ec>] ___sys_sendmsg+0x280/0x2f0
         [<00000000d0422042>] __sys_sendmsg+0x5e/0xa0
         [<000000007a6c61f9>] do_syscall_64+0x5b/0x180
         [<00000000ccd07542>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
         [<0000000013eaa334>] 0xffffffffffffffff
      
      Fixes: db50514f ("net: sched: add termination action to allow goto chain")
      Fixes: 97763dc0 ("net_sched: reject unknown tcfa_action values")
      Signed-off-by: default avatarDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      85d0966f
    • Peter Xu's avatar
      genirq: Fix typo in comment of IRQD_MOVE_PCNTXT · 551417af
      Peter Xu authored
      
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Dou Liyang <douliyangs@gmail.com>
      Cc: Julien Thierry <julien.thierry@arm.com>
      Link: https://lkml.kernel.org/r/20190318065123.11862-1-peterx@redhat.com
      551417af
  15. Mar 20, 2019
  16. Mar 19, 2019
Loading