Skip to content
Snippets Groups Projects
  1. May 30, 2019
  2. May 21, 2019
  3. May 18, 2019
  4. May 15, 2019
    • Dan Williams's avatar
      mm: shuffle initial free memory to improve memory-side-cache utilization · e900a918
      Dan Williams authored
      Patch series "mm: Randomize free memory", v10.
      
      This patch (of 3):
      
      Randomization of the page allocator improves the average utilization of
      a direct-mapped memory-side-cache.  Memory side caching is a platform
      capability that Linux has been previously exposed to in HPC
      (high-performance computing) environments on specialty platforms.  In
      that instance it was a smaller pool of high-bandwidth-memory relative to
      higher-capacity / lower-bandwidth DRAM.  Now, this capability is going
      to be found on general purpose server platforms where DRAM is a cache in
      front of higher latency persistent memory [1].
      
      Robert offered an explanation of the state of the art of Linux
      interactions with memory-side-caches [2], and I copy it here:
      
          It's been a problem in the HPC space:
          http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/
      
          A kernel module called zonesort is available to try to help:
          https://software.intel.com/en-us/articles/xeon-phi-software
      
          and this abandoned patch series proposed that for the kernel:
          https://lkml.kernel.org/r/20170823100205.17311-1-lukasz.daniluk@intel.com
      
          Dan's patch series doesn't attempt to ensure buffers won't conflict, but
          also reduces the chance that the buffers will. This will make performance
          more consistent, albeit slower than "optimal" (which is near impossible
          to attain in a general-purpose kernel).  That's better than forcing
          users to deploy remedies like:
              "To eliminate this gradual degradation, we have added a Stream
               measurement to the Node Health Check that follows each job;
               nodes are rebooted whenever their measured memory bandwidth
               falls below 300 GB/s."
      
      A replacement for zonesort was merged upstream in commit cc9aec03
      ("x86/numa_emulation: Introduce uniform split capability").  With this
      numa_emulation capability, memory can be split into cache sized
      ("near-memory" sized) numa nodes.  A bind operation to such a node, and
      disabling workloads on other nodes, enables full cache performance.
      However, once the workload exceeds the cache size then cache conflicts
      are unavoidable.  While HPC environments might be able to tolerate
      time-scheduling of cache sized workloads, for general purpose server
      platforms, the oversubscribed cache case will be the common case.
      
      The worst case scenario is that a server system owner benchmarks a
      workload at boot with an un-contended cache only to see that performance
      degrade over time, even below the average cache performance due to
      excessive conflicts.  Randomization clips the peaks and fills in the
      valleys of cache utilization to yield steady average performance.
      
      Here are some performance impact details of the patches:
      
      1/ An Intel internal synthetic memory bandwidth measurement tool, saw a
         3X speedup in a contrived case that tries to force cache conflicts.
         The contrived cased used the numa_emulation capability to force an
         instance of the benchmark to be run in two of the near-memory sized
         numa nodes.  If both instances were placed on the same emulated they
         would fit and cause zero conflicts.  While on separate emulated nodes
         without randomization they underutilized the cache and conflicted
         unnecessarily due to the in-order allocation per node.
      
      2/ A well known Java server application benchmark was run with a heap
         size that exceeded cache size by 3X.  The cache conflict rate was 8%
         for the first run and degraded to 21% after page allocator aging.  With
         randomization enabled the rate levelled out at 11%.
      
      3/ A MongoDB workload did not observe measurable difference in
         cache-conflict rates, but the overall throughput dropped by 7% with
         randomization in one case.
      
      4/ Mel Gorman ran his suite of performance workloads with randomization
         enabled on platforms without a memory-side-cache and saw a mix of some
         improvements and some losses [3].
      
      While there is potentially significant improvement for applications that
      depend on low latency access across a wide working-set, the performance
      may be negligible to negative for other workloads.  For this reason the
      shuffle capability defaults to off unless a direct-mapped
      memory-side-cache is detected.  Even then, the page_alloc.shuffle=0
      parameter can be specified to disable the randomization on those systems.
      
      Outside of memory-side-cache utilization concerns there is potentially
      security benefit from randomization.  Some data exfiltration and
      return-oriented-programming attacks rely on the ability to infer the
      location of sensitive data objects.  The kernel page allocator, especially
      early in system boot, has predictable first-in-first out behavior for
      physical pages.  Pages are freed in physical address order when first
      onlined.
      
      Quoting Kees:
          "While we already have a base-address randomization
           (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
           memory layouts would certainly be using the predictability of
           allocation ordering (i.e. for attacks where the base address isn't
           important: only the relative positions between allocated memory).
           This is common in lots of heap-style attacks. They try to gain
           control over ordering by spraying allocations, etc.
      
           I'd really like to see this because it gives us something similar
           to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."
      
      While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
      caches it leaves vast bulk of memory to be predictably in order allocated.
      However, it should be noted, the concrete security benefits are hard to
      quantify, and no known CVE is mitigated by this randomization.
      
      Introduce shuffle_free_memory(), and its helper shuffle_zone(), to perform
      a Fisher-Yates shuffle of the page allocator 'free_area' lists when they
      are initially populated with free memory at boot and at hotplug time.  Do
      this based on either the presence of a page_alloc.shuffle=Y command line
      parameter, or autodetection of a memory-side-cache (to be added in a
      follow-on patch).
      
      The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
      pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e.  10,
      4MB this trades off randomization granularity for time spent shuffling.
      MAX_ORDER-1 was chosen to be minimally invasive to the page allocator
      while still showing memory-side cache behavior improvements, and the
      expectation that the security implications of finer granularity
      randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM.  The
      performance impact of the shuffling appears to be in the noise compared to
      other memory initialization work.
      
      This initial randomization can be undone over time so a follow-on patch is
      introduced to inject entropy on page free decisions.  It is reasonable to
      ask if the page free entropy is sufficient, but it is not enough due to
      the in-order initial freeing of pages.  At the start of that process
      putting page1 in front or behind page0 still keeps them close together,
      page2 is still near page1 and has a high chance of being adjacent.  As
      more pages are added ordering diversity improves, but there is still high
      page locality for the low address pages and this leads to no significant
      impact to the cache conflict rate.
      
      [1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
      [2]: https://lkml.kernel.org/r/AT5PR8401MB1169D656C8B5E121752FC0F8AB120@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM
      [3]: https://lkml.org/lkml/2018/10/12/309
      
      [dan.j.williams@intel.com: fix shuffle enable]
        Link: http://lkml.kernel.org/r/154943713038.3858443.4125180191382062871.stgit@dwillia2-desk3.amr.corp.intel.com
      [cai@lca.pw: fix SHUFFLE_PAGE_ALLOCATOR help texts]
        Link: http://lkml.kernel.org/r/20190425201300.75650-1-cai@lca.pw
      Link: http://lkml.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com
      
      
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Robert Elliott <elliott@hpe.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e900a918
  5. May 14, 2019
  6. May 05, 2019
  7. Apr 30, 2019
  8. Apr 29, 2019
    • Joel Fernandes (Google)'s avatar
      init/config: Do not select BUILD_BIN2C for IKCONFIG · bc0c6045
      Joel Fernandes (Google) authored
      
      Since commit 13610aa9 ("kernel/configs: use .incbin directive to
      embed config_data.gz"), IKCONFIG no longer uses BUILD_BIN2C so prevent
      it from being selected in Kconfig.
      
      Reviewed-by: default avatarMasahiro Yamada <yamada.masahiro@socionext.com>
      Signed-off-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bc0c6045
    • Joel Fernandes (Google)'s avatar
      Provide in-kernel headers to make extending kernel easier · 43d8ce9d
      Joel Fernandes (Google) authored
      Introduce in-kernel headers which are made available as an archive
      through proc (/proc/kheaders.tar.xz file). This archive makes it
      possible to run eBPF and other tracing programs that need to extend the
      kernel for tracing purposes without any dependency on the file system
      having headers.
      
      A github PR is sent for the corresponding BCC patch at:
      https://github.com/iovisor/bcc/pull/2312
      
      
      
      On Android and embedded systems, it is common to switch kernels but not
      have kernel headers available on the file system. Further once a
      different kernel is booted, any headers stored on the file system will
      no longer be useful. This is an issue even well known to distros.
      By storing the headers as a compressed archive within the kernel, we can
      avoid these issues that have been a hindrance for a long time.
      
      The best way to use this feature is by building it in. Several users
      have a need for this, when they switch debug kernels, they do not want to
      update the filesystem or worry about it where to store the headers on
      it. However, the feature is also buildable as a module in case the user
      desires it not being part of the kernel image. This makes it possible to
      load and unload the headers from memory on demand. A tracing program can
      load the module, do its operations, and then unload the module to save
      kernel memory. The total memory needed is 3.3MB.
      
      By having the archive available at a fixed location independent of
      filesystem dependencies and conventions, all debugging tools can
      directly refer to the fixed location for the archive, without concerning
      with where the headers on a typical filesystem which significantly
      simplifies tooling that needs kernel headers.
      
      The code to read the headers is based on /proc/config.gz code and uses
      the same technique to embed the headers.
      
      Other approaches were discussed such as having an in-memory mountable
      filesystem, but that has drawbacks such as requiring an in-kernel xz
      decompressor which we don't have today, and requiring usage of 42 MB of
      kernel memory to host the decompressed headers at anytime. Also this
      approach is simpler than such approaches.
      
      Reviewed-by: default avatarMasahiro Yamada <yamada.masahiro@socionext.com>
      Signed-off-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      43d8ce9d
  9. Apr 20, 2019
    • Kees Cook's avatar
      random: move rand_initialize() earlier · d5553523
      Kees Cook authored
      
      Right now rand_initialize() is run as an early_initcall(), but it only
      depends on timekeeping_init() (for mixing ktime_get_real() into the
      pools). However, the call to boot_init_stack_canary() for stack canary
      initialization runs earlier, which triggers a warning at boot:
      
      random: get_random_bytes called from start_kernel+0x357/0x548 with crng_init=0
      
      Instead, this moves rand_initialize() to after timekeeping_init(), and moves
      canary initialization here as well.
      
      Note that this warning may still remain for machines that do not have
      UEFI RNG support (which initializes the RNG pools during setup_arch()),
      or for x86 machines without RDRAND (or booting without "random.trust=on"
      or CONFIG_RANDOM_TRUST_CPU=y).
      
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      d5553523
  10. Apr 19, 2019
    • Dan Williams's avatar
      init: initialize jump labels before command line option parsing · 6041186a
      Dan Williams authored
      When a module option, or core kernel argument, toggles a static-key it
      requires jump labels to be initialized early.  While x86, PowerPC, and
      ARM64 arrange for jump_label_init() to be called before parse_args(),
      ARM does not.
      
        Kernel command line: rdinit=/sbin/init page_alloc.shuffle=1 panic=-1 console=ttyAMA0,115200 page_alloc.shuffle=1
        ------------[ cut here ]------------
        WARNING: CPU: 0 PID: 0 at ./include/linux/jump_label.h:303
        page_alloc_shuffle+0x12c/0x1ac
        static_key_enable(): static key 'page_alloc_shuffle_key+0x0/0x4' used
        before call to jump_label_init()
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper Not tainted
        5.1.0-rc4-next-20190410-00003-g3367c36ce744 #1
        Hardware name: ARM Integrator/CP (Device Tree)
        [<c0011c68>] (unwind_backtrace) from [<c000ec48>] (show_stack+0x10/0x18)
        [<c000ec48>] (show_stack) from [<c07e9710>] (dump_stack+0x18/0x24)
        [<c07e9710>] (dump_stack) from [<c001bb1c>] (__warn+0xe0/0x108)
        [<c001bb1c>] (__warn) from [<c001bb88>] (warn_slowpath_fmt+0x44/0x6c)
        [<c001bb88>] (warn_slowpath_fmt) from [<c0b0c4a8>]
        (page_alloc_shuffle+0x12c/0x1ac)
        [<c0b0c4a8>] (page_alloc_shuffle) from [<c0b0c550>] (shuffle_store+0x28/0x48)
        [<c0b0c550>] (shuffle_store) from [<c003e6a0>] (parse_args+0x1f4/0x350)
        [<c003e6a0>] (parse_args) from [<c0ac3c00>] (start_kernel+0x1c0/0x488)
      
      Move the fallback call to jump_label_init() to occur before
      parse_args().
      
      The redundant calls to jump_label_init() in other archs are left intact
      in case they have static key toggling use cases that are even earlier
      than option parsing.
      
      Link: http://lkml.kernel.org/r/155544804466.1032396.13418949511615676665.stgit@dwillia2-desk3.amr.corp.intel.com
      
      
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reported-by: default avatarGuenter Roeck <groeck@google.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Russell King <rmk@armlinux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6041186a
    • David Howells's avatar
      Make anon_inodes unconditional · 5dd50aae
      David Howells authored
      
      Make the anon_inodes facility unconditional so that it can be used by core
      VFS code and pidfd code.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      [christian@brauner.io: adapt commit message to mention pidfds]
      Signed-off-by: default avatarChristian Brauner <christian@brauner.io>
      5dd50aae
  11. Apr 09, 2019
    • Sakari Ailus's avatar
      treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively · d75f773c
      Sakari Ailus authored
      %pF and %pf are functionally equivalent to %pS and %ps conversion
      specifiers. The former are deprecated, therefore switch the current users
      to use the preferred variant.
      
      The changes have been produced by the following command:
      
      	git grep -l '%p[fF]' | grep -v '^\(tools\|Documentation\)/' | \
      	while read i; do perl -i -pe 's/%pf/%ps/g; s/%pF/%pS/g;' $i; done
      
      And verifying the result.
      
      Link: http://lkml.kernel.org/r/20190325193229.23390-1-sakari.ailus@linux.intel.com
      
      
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: sparclinux@vger.kernel.org
      Cc: linux-um@lists.infradead.org
      Cc: xen-devel@lists.xenproject.org
      Cc: linux-acpi@vger.kernel.org
      Cc: linux-pm@vger.kernel.org
      Cc: drbd-dev@lists.linbit.com
      Cc: linux-block@vger.kernel.org
      Cc: linux-mmc@vger.kernel.org
      Cc: linux-nvdimm@lists.01.org
      Cc: linux-pci@vger.kernel.org
      Cc: linux-scsi@vger.kernel.org
      Cc: linux-btrfs@vger.kernel.org
      Cc: linux-f2fs-devel@lists.sourceforge.net
      Cc: linux-mm@kvack.org
      Cc: ceph-devel@vger.kernel.org
      Cc: netdev@vger.kernel.org
      Signed-off-by: default avatarSakari Ailus <sakari.ailus@linux.intel.com>
      Acked-by: David Sterba <dsterba@suse.com> (for btrfs)
      Acked-by: Mike Rapoport <rppt@linux.ibm.com> (for mm/memblock.c)
      Acked-by: Bjorn Helgaas <bhelgaas@google.com> (for drivers/pci)
      Acked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarPetr Mladek <pmladek@suse.com>
      d75f773c
  12. Mar 20, 2019
  13. Mar 12, 2019
    • Mike Rapoport's avatar
      init/main: add checks for the return value of memblock_alloc*() · f5c7310a
      Mike Rapoport authored
      Add panic() calls if memblock_alloc() returns NULL.
      
      The panic() format duplicates the one used by memblock itself and in
      order to avoid explosion with long parameters list replace open coded
      allocation size calculations with a local variable.
      
      Link: http://lkml.kernel.org/r/1548057848-15136-18-git-send-email-rppt@linux.ibm.com
      
      
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Guo Ren <ren_guo@c-sky.com>				[c-sky]
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Juergen Gross <jgross@suse.com>			[Xen]
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f5c7310a
  14. Mar 08, 2019
  15. Mar 06, 2019
    • Arnd Bergmann's avatar
      time: Make VIRT_CPU_ACCOUNTING_GEN depend on GENERIC_CLOCKEVENTS · 041a1574
      Arnd Bergmann authored
      
      Moving the CONTEXT_TRACKING Kconfig option into kernel/time/Kconfig added
      an implicit dependency on the surrounding GENERIC_CLOCKEVENTS option, but
      this is not always enabled when it is possible to select
      VIRT_CPU_ACCOUNTING_GEN:
      
      WARNING: unmet direct dependencies detected for CONTEXT_TRACKING
        Depends on [n]: GENERIC_CLOCKEVENTS [=n]
        Selected by [y]:
        - VIRT_CPU_ACCOUNTING_GEN [=y] && <choice> && HAVE_CONTEXT_TRACKING [=y] && HAVE_VIRT_CPU_ACCOUNTING_GEN [=y]
      
      Platforms without GENERIC_CLOCKEVENTS are rare enough so that corner case
      can be just ignored. Make it a dependency for VIRT_CPU_ACCOUNTING_GEN to
      simplify the configuration.
      
      Fixes: a4cffdad ("time: Move CONTEXT_TRACKING to kernel/time/Kconfig")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: "Paul E . McKenney" <paulmck@linux.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Link: https://lkml.kernel.org/r/20190304200202.1163250-1-arnd@arndb.de
      041a1574
    • Anshuman Khandual's avatar
      mm: replace all open encodings for NUMA_NO_NODE · 98fa15f3
      Anshuman Khandual authored
      Patch series "Replace all open encodings for NUMA_NO_NODE", v3.
      
      All these places for replacement were found by running the following
      grep patterns on the entire kernel code.  Please let me know if this
      might have missed some instances.  This might also have replaced some
      false positives.  I will appreciate suggestions, inputs and review.
      
      1. git grep "nid == -1"
      2. git grep "node == -1"
      3. git grep "nid = -1"
      4. git grep "node = -1"
      
      This patch (of 2):
      
      At present there are multiple places where invalid node number is
      encoded as -1.  Even though implicitly understood it is always better to
      have macros in there.  Replace these open encodings for an invalid node
      number with the global macro NUMA_NO_NODE.  This helps remove NUMA
      related assumptions like 'invalid node' from various places redirecting
      them to a common definition.
      
      Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com
      
      
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	[ixgbe]
      Acked-by: Jens Axboe <axboe@kernel.dk>			[mtip32xx]
      Acked-by: Vinod Koul <vkoul@kernel.org>			[dmaengine.c]
      Acked-by: Michael Ellerman <mpe@ellerman.id.au>		[powerpc]
      Acked-by: Doug Ledford <dledford@redhat.com>		[drivers/infiniband]
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Hans Verkuil <hverkuil@xs4all.nl>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      98fa15f3
  16. Mar 04, 2019
  17. Feb 28, 2019
    • Jens Axboe's avatar
      Add io_uring IO interface · 2b188cc1
      Jens Axboe authored
      The submission queue (SQ) and completion queue (CQ) rings are shared
      between the application and the kernel. This eliminates the need to
      copy data back and forth to submit and complete IO.
      
      IO submissions use the io_uring_sqe data structure, and completions
      are generated in the form of io_uring_cqe data structures. The SQ
      ring is an index into the io_uring_sqe array, which makes it possible
      to submit a batch of IOs without them being contiguous in the ring.
      The CQ ring is always contiguous, as completion events are inherently
      unordered, and hence any io_uring_cqe entry can point back to an
      arbitrary submission.
      
      Two new system calls are added for this:
      
      io_uring_setup(entries, params)
      	Sets up an io_uring instance for doing async IO. On success,
      	returns a file descriptor that the application can mmap to
      	gain access to the SQ ring, CQ ring, and io_uring_sqes.
      
      io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
      	Initiates IO against the rings mapped to this fd, or waits for
      	them to complete, or both. The behavior is controlled by the
      	parameters passed in. If 'to_submit' is non-zero, then we'll
      	try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
      	kernel will wait for 'min_complete' events, if they aren't
      	already available. It's valid to set IORING_ENTER_GETEVENTS
      	and 'min_complete' == 0 at the same time, this allows the
      	kernel to return already completed events without waiting
      	for them. This is useful only for polling, as for IRQ
      	driven IO, the application can just check the CQ ring
      	without entering the kernel.
      
      With this setup, it's possible to do async IO with a single system
      call. Future developments will enable polled IO with this interface,
      and polled submission as well. The latter will enable an application
      to do IO without doing ANY system calls at all.
      
      For IRQ driven IO, an application only needs to enter the kernel for
      completions if it wants to wait for them to occur.
      
      Each io_uring is backed by a workqueue, to support buffered async IO
      as well. We will only punt to an async context if the command would
      need to wait for IO on the device side. Any data that can be accessed
      directly in the page cache is done inline. This avoids the slowness
      issue of usual threadpools, since cached data is accessed as quickly
      as a sync interface.
      
      Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
      
      
      
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2b188cc1
  18. Feb 27, 2019
  19. Feb 21, 2019
  20. Feb 13, 2019
    • Qian Cai's avatar
      Revert "mm: use early_pfn_to_nid in page_ext_init" · 2f1ee091
      Qian Cai authored
      This reverts commit fe53ca54 ("mm: use early_pfn_to_nid in
      page_ext_init").
      
      When booting a system with "page_owner=on",
      
      start_kernel
        page_ext_init
          invoke_init_callbacks
            init_section_page_ext
              init_page_owner
                init_early_allocated_pages
                  init_zones_in_node
                    init_pages_in_zone
                      lookup_page_ext
                        page_to_nid
      
      The issue here is that page_to_nid() will not work since some page flags
      have no node information until later in page_alloc_init_late() due to
      DEFERRED_STRUCT_PAGE_INIT.  Hence, it could trigger an out-of-bounds
      access with an invalid nid.
      
        UBSAN: Undefined behaviour in ./include/linux/mm.h:1104:50
        index 7 is out of range for type 'zone [5]'
      
      Also, kernel will panic since flags were poisoned earlier with,
      
      CONFIG_DEBUG_VM_PGFLAGS=y
      CONFIG_NODE_NOT_IN_PAGE_FLAGS=n
      
      start_kernel
        setup_arch
          pagetable_init
            paging_init
              sparse_init
                sparse_init_nid
                  memblock_alloc_try_nid_raw
      
      It did not handle it well in init_pages_in_zone() which ends up calling
      page_to_nid().
      
        page:ffffea0004200000 is uninitialized and poisoned
        raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
        raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
        page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
        page_owner info is not active (free page?)
        kernel BUG at include/linux/mm.h:990!
        RIP: 0010:init_page_owner+0x486/0x520
      
      This means that assumptions behind commit fe53ca54 ("mm: use
      early_pfn_to_nid in page_ext_init") are incomplete.  Therefore, revert
      the commit for now.  A proper way to move the page_owner initialization
      to sooner is to hook into memmap initialization.
      
      Link: http://lkml.kernel.org/r/20190115202812.75820-1-cai@lca.pw
      
      
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Acked-by: default avatarMichal Hocko <mhocko@kernel.org>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2f1ee091
  21. Feb 04, 2019
    • Elena Reshetova's avatar
      sched/core: Convert task_struct.stack_refcount to refcount_t · f0b89d39
      Elena Reshetova authored
      atomic_t variables are currently used to implement reference
      counters with the following properties:
      
       - counter is initialized to 1 using atomic_set()
       - a resource is freed upon counter reaching zero
       - once counter reaches zero, its further
         increments aren't allowed
       - counter schema uses basic atomic operations
         (set, inc, inc_not_zero, dec_and_test, etc.)
      
      Such atomic variables should be converted to a newly provided
      refcount_t type and API that prevents accidental counter overflows
      and underflows. This is important since overflows and underflows
      can lead to use-after-free situation and be exploitable.
      
      The variable task_struct.stack_refcount is used as pure reference counter.
      Convert it to refcount_t and fix up the operations.
      
      ** Important note for maintainers:
      
      Some functions from refcount_t API defined in lib/refcount.c
      have different memory ordering guarantees than their atomic
      counterparts.
      
      The full comparison can be seen in
      https://lkml.org/lkml/2017/11/15/57
      
       and it is hopefully soon
      in state to be merged to the documentation tree.
      
      Normally the differences should not matter since refcount_t provides
      enough guarantees to satisfy the refcounting use cases, but in
      some rare cases it might matter.
      
      Please double check that you don't have some undocumented
      memory guarantees for this variable usage.
      
      For the task_struct.stack_refcount it might make a difference
      in following places:
      
       - try_get_task_stack(): increment in refcount_inc_not_zero() only
         guarantees control dependency on success vs. fully ordered
         atomic counterpart
       - put_task_stack(): decrement in refcount_dec_and_test() only
         provides RELEASE ordering and control dependency on success
         vs. fully ordered atomic counterpart
      
      Suggested-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDavid Windsor <dwindsor@gmail.com>
      Reviewed-by: default avatarHans Liljestrand <ishkamiel@gmail.com>
      Reviewed-by: default avatarAndrea Parri <andrea.parri@amarulasolutions.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: viro@zeniv.linux.org.uk
      Link: https://lkml.kernel.org/r/1547814450-18902-6-git-send-email-elena.reshetova@intel.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      f0b89d39
    • Elena Reshetova's avatar
      sched/core: Convert task_struct.usage to refcount_t · ec1d2819
      Elena Reshetova authored
      atomic_t variables are currently used to implement reference
      counters with the following properties:
      
       - counter is initialized to 1 using atomic_set()
       - a resource is freed upon counter reaching zero
       - once counter reaches zero, its further
         increments aren't allowed
       - counter schema uses basic atomic operations
         (set, inc, inc_not_zero, dec_and_test, etc.)
      
      Such atomic variables should be converted to a newly provided
      refcount_t type and API that prevents accidental counter overflows
      and underflows. This is important since overflows and underflows
      can lead to use-after-free situation and be exploitable.
      
      The variable task_struct.usage is used as pure reference counter.
      Convert it to refcount_t and fix up the operations.
      
      ** Important note for maintainers:
      
      Some functions from refcount_t API defined in lib/refcount.c
      have different memory ordering guarantees than their atomic
      counterparts.
      
      The full comparison can be seen in
      https://lkml.org/lkml/2017/11/15/57
      
       and it is hopefully soon
      in state to be merged to the documentation tree.
      
      Normally the differences should not matter since refcount_t provides
      enough guarantees to satisfy the refcounting use cases, but in
      some rare cases it might matter.
      
      Please double check that you don't have some undocumented
      memory guarantees for this variable usage.
      
      For the task_struct.usage it might make a difference
      in following places:
      
       - put_task_struct(): decrement in refcount_dec_and_test() only
         provides RELEASE ordering and control dependency on success
         vs. fully ordered atomic counterpart
      
      Suggested-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDavid Windsor <dwindsor@gmail.com>
      Reviewed-by: default avatarHans Liljestrand <ishkamiel@gmail.com>
      Reviewed-by: default avatarAndrea Parri <andrea.parri@amarulasolutions.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: viro@zeniv.linux.org.uk
      Link: https://lkml.kernel.org/r/1547814450-18902-5-git-send-email-elena.reshetova@intel.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      ec1d2819
    • Elena Reshetova's avatar
      sched/core: Convert signal_struct.sigcnt to refcount_t · 60d4de3f
      Elena Reshetova authored
      atomic_t variables are currently used to implement reference
      counters with the following properties:
      
       - counter is initialized to 1 using atomic_set()
       - a resource is freed upon counter reaching zero
       - once counter reaches zero, its further
         increments aren't allowed
       - counter schema uses basic atomic operations
         (set, inc, inc_not_zero, dec_and_test, etc.)
      
      Such atomic variables should be converted to a newly provided
      refcount_t type and API that prevents accidental counter overflows
      and underflows. This is important since overflows and underflows
      can lead to use-after-free situation and be exploitable.
      
      The variable signal_struct.sigcnt is used as pure reference counter.
      Convert it to refcount_t and fix up the operations.
      
      ** Important note for maintainers:
      
      Some functions from refcount_t API defined in lib/refcount.c
      have different memory ordering guarantees than their atomic
      counterparts.
      
      The full comparison can be seen in
      https://lkml.org/lkml/2017/11/15/57
      
       and it is hopefully soon
      in state to be merged to the documentation tree.
      
      Normally the differences should not matter since refcount_t provides
      enough guarantees to satisfy the refcounting use cases, but in
      some rare cases it might matter.
      
      Please double check that you don't have some undocumented
      memory guarantees for this variable usage.
      
      For the signal_struct.sigcnt it might make a difference
      in following places:
      
       - put_signal_struct(): decrement in refcount_dec_and_test() only
         provides RELEASE ordering and control dependency on success
         vs. fully ordered atomic counterpart
      
      Suggested-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDavid Windsor <dwindsor@gmail.com>
      Reviewed-by: default avatarHans Liljestrand <ishkamiel@gmail.com>
      Reviewed-by: default avatarAndrea Parri <andrea.parri@amarulasolutions.com>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: viro@zeniv.linux.org.uk
      Link: https://lkml.kernel.org/r/1547814450-18902-3-git-send-email-elena.reshetova@intel.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      60d4de3f
  22. Feb 01, 2019
  23. Jan 25, 2019
  24. Jan 14, 2019
    • Paul Burton's avatar
      kbuild: Disable LD_DEAD_CODE_DATA_ELIMINATION with ftrace & GCC <= 4.7 · 16fd20aa
      Paul Burton authored
      
      When building using GCC 4.7 or older, -ffunction-sections & the -pg flag
      used by ftrace are incompatible. This causes warnings or build failures
      (where -Werror applies) such as the following:
      
        arch/mips/generic/init.c:
          error: -ffunction-sections disabled; it makes profiling impossible
      
      This used to be taken into account by the ordering of calls to cc-option
      from within the top-level Makefile, which was introduced by commit
      90ad4052 ("kbuild: avoid conflict between -ffunction-sections and
      -pg on gcc-4.7"). Unfortunately this was broken when the
      CONFIG_LD_DEAD_CODE_DATA_ELIMINATION cc-option check was moved to
      Kconfig in commit e85d1d65 ("kbuild: test dead code/data elimination
      support in Kconfig"), because the flags used by this check no longer
      include -pg.
      
      Fix this by not allowing CONFIG_LD_DEAD_CODE_DATA_ELIMINATION to be
      enabled at the same time as ftrace/CONFIG_FUNCTION_TRACER when building
      using GCC 4.7 or older.
      
      Signed-off-by: default avatarPaul Burton <paul.burton@mips.com>
      Fixes: e85d1d65 ("kbuild: test dead code/data elimination support in Kconfig")
      Reported-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: stable@vger.kernel.org # v4.19+
      Signed-off-by: default avatarMasahiro Yamada <yamada.masahiro@socionext.com>
      16fd20aa
  25. Jan 06, 2019
    • Masahiro Yamada's avatar
      jump_label: move 'asm goto' support test to Kconfig · e9666d10
      Masahiro Yamada authored
      
      Currently, CONFIG_JUMP_LABEL just means "I _want_ to use jump label".
      
      The jump label is controlled by HAVE_JUMP_LABEL, which is defined
      like this:
      
        #if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL)
        # define HAVE_JUMP_LABEL
        #endif
      
      We can improve this by testing 'asm goto' support in Kconfig, then
      make JUMP_LABEL depend on CC_HAS_ASM_GOTO.
      
      Ugly #ifdef HAVE_JUMP_LABEL will go away, and CONFIG_JUMP_LABEL will
      match to the real kernel capability.
      
      Signed-off-by: default avatarMasahiro Yamada <yamada.masahiro@socionext.com>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Tested-by: default avatarSedat Dilek <sedat.dilek@gmail.com>
      e9666d10
Loading