Skip to content
Snippets Groups Projects
  1. Apr 14, 2019
    • Linus Torvalds's avatar
      mm: make page ref count overflow check tighter and more explicit · f958d7b5
      Linus Torvalds authored
      
      We have a VM_BUG_ON() to check that the page reference count doesn't
      underflow (or get close to overflow) by checking the sign of the count.
      
      That's all fine, but we actually want to allow people to use a "get page
      ref unless it's already very high" helper function, and we want that one
      to use the sign of the page ref (without triggering this VM_BUG_ON).
      
      Change the VM_BUG_ON to only check for small underflows (or _very_ close
      to overflowing), and ignore overflows which have strayed into negative
      territory.
      
      Acked-by: default avatarMatthew Wilcox <willy@infradead.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f958d7b5
  2. Mar 02, 2019
  3. Feb 27, 2019
  4. Feb 25, 2019
    • Nazarov Sergey's avatar
      net: avoid use IPCB in cipso_v4_error · 3da1ed7a
      Nazarov Sergey authored
      
      Extract IP options in cipso_v4_error and use __icmp_send.
      
      Signed-off-by: default avatarSergey Nazarov <s-nazarov@yandex.ru>
      Acked-by: default avatarPaul Moore <paul@paul-moore.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3da1ed7a
    • Nazarov Sergey's avatar
      net: Add __icmp_send helper. · 9ef6b42a
      Nazarov Sergey authored
      
      Add __icmp_send function having ip_options struct parameter
      
      Signed-off-by: default avatarSergey Nazarov <s-nazarov@yandex.ru>
      Reviewed-by: default avatarPaul Moore <paul@paul-moore.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9ef6b42a
    • Linus Torvalds's avatar
      Revert "x86/fault: BUG() when uaccess helpers fault on kernel addresses" · 53a41cb7
      Linus Torvalds authored
      
      This reverts commit 9da3f2b7.
      
      It was well-intentioned, but wrong.  Overriding the exception tables for
      instructions for random reasons is just wrong, and that is what the new
      code did.
      
      It caused problems for tracing, and it caused problems for strncpy_from_user(),
      because the new checks made perfectly valid use cases break, rather than
      catch things that did bad things.
      
      Unchecked user space accesses are a problem, but that's not a reason to
      add invalid checks that then people have to work around with silly flags
      (in this case, that 'kernel_uaccess_faults_ok' flag, which is just an
      odd way to say "this commit was wrong" and was sprinked into random
      places to hide the wrongness).
      
      The real fix to unchecked user space accesses is to get rid of the
      special "let's not check __get_user() and __put_user() at all" logic.
      Make __{get|put}_user() be just aliases to the regular {get|put}_user()
      functions, and make it impossible to access user space without having
      the proper checks in places.
      
      The raison d'être of the special double-underscore versions used to be
      that the range check was expensive, and if you did multiple user
      accesses, you'd do the range check up front (like the signal frame
      handling code, for example).  But SMAP (on x86) and PAN (on ARM) have
      made that optimization pointless, because the _real_ expense is the "set
      CPU flag to allow user space access".
      
      Do let's not break the valid cases to catch invalid cases that shouldn't
      even exist.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Tobin C. Harding <tobin@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Jann Horn <jannh@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      53a41cb7
  5. Feb 24, 2019
    • Linus Walleij's avatar
      net: phy: realtek: Dummy IRQ calls for RTL8366RB · 4c8e0459
      Linus Walleij authored
      
      This fixes a regression introduced by
      commit 0d2e778e
      "net: phy: replace PHY_HAS_INTERRUPT with a check for
      config_intr and ack_interrupt".
      
      This assumes that a PHY cannot trigger interrupt unless
      it has .config_intr() or .ack_interrupt() implemented.
      A later patch makes the code assume both need to be
      implemented for interrupts to be present.
      
      But this PHY (which is inside a DSA) will happily
      fire interrupts without either callback.
      
      Implement dummy callbacks for .config_intr() and
      .ack_interrupt() in the phy header to fix this.
      
      Tested on the RTL8366RB on D-Link DIR-685.
      
      Fixes: 0d2e778e ("net: phy: replace PHY_HAS_INTERRUPT with a check for config_intr and ack_interrupt")
      Cc: Heiner Kallweit <hkallweit1@gmail.com>
      Signed-off-by: default avatarLinus Walleij <linus.walleij@linaro.org>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4c8e0459
  6. Feb 22, 2019
  7. Feb 21, 2019
  8. Feb 16, 2019
  9. Feb 15, 2019
    • David Howells's avatar
      keys: Fix dependency loop between construction record and auth key · 822ad64d
      David Howells authored
      
      In the request_key() upcall mechanism there's a dependency loop by which if
      a key type driver overrides the ->request_key hook and the userspace side
      manages to lose the authorisation key, the auth key and the internal
      construction record (struct key_construction) can keep each other pinned.
      
      Fix this by the following changes:
      
       (1) Killing off the construction record and using the auth key instead.
      
       (2) Including the operation name in the auth key payload and making the
           payload available outside of security/keys/.
      
       (3) The ->request_key hook is given the authkey instead of the cons
           record and operation name.
      
      Changes (2) and (3) allow the auth key to naturally be cleaned up if the
      keyring it is in is destroyed or cleared or the auth key is unlinked.
      
      Fixes: 7ee02a316600 ("keys: Fix dependency loop between construction record and auth key")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarJames Morris <james.morris@microsoft.com>
      822ad64d
    • Miguel Ojeda's avatar
      include/linux/module.h: copy __init/__exit attrs to init/cleanup_module · a6e60d84
      Miguel Ojeda authored
      The upcoming GCC 9 release extends the -Wmissing-attributes warnings
      (enabled by -Wall) to C and aliases: it warns when particular function
      attributes are missing in the aliases but not in their target.
      
      In particular, it triggers for all the init/cleanup_module
      aliases in the kernel (defined by the module_init/exit macros),
      ending up being very noisy.
      
      These aliases point to the __init/__exit functions of a module,
      which are defined as __cold (among other attributes). However,
      the aliases themselves do not have the __cold attribute.
      
      Since the compiler behaves differently when compiling a __cold
      function as well as when compiling paths leading to calls
      to __cold functions, the warning is trying to point out
      the possibly-forgotten attribute in the alias.
      
      In order to keep the warning enabled, we decided to silence
      this case. Ideally, we would mark the aliases directly
      as __init/__exit. However, there are currently around 132 modules
      in the kernel which are missing __init/__exit in their init/cleanup
      functions (either because they are missing, or for other reasons,
      e.g. the functions being called from somewhere else); and
      a section mismatch is a hard error.
      
      A conservative alternative was to mark the aliases as __cold only.
      However, since we would like to eventually enforce __init/__exit
      to be always marked,  we chose to use the new __copy function
      attribute (introduced by GCC 9 as well to deal with this).
      With it, we copy the attributes used by the target functions
      into the aliases. This way, functions that were not marked
      as __init/__exit won't have their aliases marked either,
      and therefore there won't be a section mismatch.
      
      Note that the warning would go away marking either the extern
      declaration, the definition, or both. However, we only mark
      the definition of the alias, since we do not want callers
      (which only see the declaration) to be compiled as if the function
      was __cold (and therefore the paths leading to those calls
      would be assumed to be unlikely).
      
      Link: https://lore.kernel.org/lkml/20190123173707.GA16603@gmail.com/
      Link: https://lore.kernel.org/lkml/20190206175627.GA20399@gmail.com/
      
      
      Suggested-by: default avatarMartin Sebor <msebor@gcc.gnu.org>
      Acked-by: default avatarJessica Yu <jeyu@kernel.org>
      Signed-off-by: default avatarMiguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      a6e60d84
    • Miguel Ojeda's avatar
      Compiler Attributes: add support for __copy (gcc >= 9) · c0d9782f
      Miguel Ojeda authored
      From the GCC manual:
      
        copy
        copy(function)
      
          The copy attribute applies the set of attributes with which function
          has been declared to the declaration of the function to which
          the attribute is applied. The attribute is designed for libraries
          that define aliases or function resolvers that are expected
          to specify the same set of attributes as their targets. The copy
          attribute can be used with functions, variables, or types. However,
          the kind of symbol to which the attribute is applied (either
          function or variable) must match the kind of symbol to which
          the argument refers. The copy attribute copies only syntactic and
          semantic attributes but not attributes that affect a symbol’s
          linkage or visibility such as alias, visibility, or weak.
          The deprecated attribute is also not copied.
      
        https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html
      
      
      
      The upcoming GCC 9 release extends the -Wmissing-attributes warnings
      (enabled by -Wall) to C and aliases: it warns when particular function
      attributes are missing in the aliases but not in their target, e.g.:
      
          void __cold f(void) {}
          void __alias("f") g(void);
      
      diagnoses:
      
          warning: 'g' specifies less restrictive attribute than
          its target 'f': 'cold' [-Wmissing-attributes]
      
      Using __copy(f) we can copy the __cold attribute from f to g:
      
          void __cold f(void) {}
          void __copy(f) __alias("f") g(void);
      
      This attribute is most useful to deal with situations where an alias
      is declared but we don't know the exact attributes the target has.
      
      For instance, in the kernel, the widely used module_init/exit macros
      define the init/cleanup_module aliases, but those cannot be marked
      always as __init/__exit since some modules do not have their
      functions marked as such.
      
      Suggested-by: default avatarMartin Sebor <msebor@gcc.gnu.org>
      Reviewed-by: default avatarNick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: default avatarMiguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      c0d9782f
  10. Feb 14, 2019
  11. Feb 13, 2019
  12. Feb 12, 2019
    • Konstantin Khlebnikov's avatar
      inet_diag: fix reporting cgroup classid and fallback to priority · 1ec17dbd
      Konstantin Khlebnikov authored
      
      Field idiag_ext in struct inet_diag_req_v2 used as bitmap of requested
      extensions has only 8 bits. Thus extensions starting from DCTCPINFO
      cannot be requested directly. Some of them included into response
      unconditionally or hook into some of lower 8 bits.
      
      Extension INET_DIAG_CLASS_ID has not way to request from the beginning.
      
      This patch bundle it with INET_DIAG_TCLASS (ipv6 tos), fixes space
      reservation, and documents behavior for other extensions.
      
      Also this patch adds fallback to reporting socket priority. This filed
      is more widely used for traffic classification because ipv4 sockets
      automatically maps TOS to priority and default qdisc pfifo_fast knows
      about that. But priority could be changed via setsockopt SO_PRIORITY so
      INET_DIAG_TOS isn't enough for predicting class.
      
      Also cgroup2 obsoletes net_cls classid (it always zero), but we cannot
      reuse this field for reporting cgroup2 id because it is 64-bit (ino+gen).
      
      So, after this patch INET_DIAG_CLASS_ID will report socket priority
      for most common setup when net_cls isn't set and/or cgroup2 in use.
      
      Fixes: 0888e372 ("net: inet: diag: expose sockets cgroup classid")
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1ec17dbd
  13. Feb 11, 2019
    • Jiri Olsa's avatar
      perf/x86: Add check_period PMU callback · 81ec3f3c
      Jiri Olsa authored
      
      Vince (and later on Ravi) reported crashes in the BTS code during
      fuzzing with the following backtrace:
      
        general protection fault: 0000 [#1] SMP PTI
        ...
        RIP: 0010:perf_prepare_sample+0x8f/0x510
        ...
        Call Trace:
         <IRQ>
         ? intel_pmu_drain_bts_buffer+0x194/0x230
         intel_pmu_drain_bts_buffer+0x160/0x230
         ? tick_nohz_irq_exit+0x31/0x40
         ? smp_call_function_single_interrupt+0x48/0xe0
         ? call_function_single_interrupt+0xf/0x20
         ? call_function_single_interrupt+0xa/0x20
         ? x86_schedule_events+0x1a0/0x2f0
         ? x86_pmu_commit_txn+0xb4/0x100
         ? find_busiest_group+0x47/0x5d0
         ? perf_event_set_state.part.42+0x12/0x50
         ? perf_mux_hrtimer_restart+0x40/0xb0
         intel_pmu_disable_event+0xae/0x100
         ? intel_pmu_disable_event+0xae/0x100
         x86_pmu_stop+0x7a/0xb0
         x86_pmu_del+0x57/0x120
         event_sched_out.isra.101+0x83/0x180
         group_sched_out.part.103+0x57/0xe0
         ctx_sched_out+0x188/0x240
         ctx_resched+0xa8/0xd0
         __perf_event_enable+0x193/0x1e0
         event_function+0x8e/0xc0
         remote_function+0x41/0x50
         flush_smp_call_function_queue+0x68/0x100
         generic_smp_call_function_single_interrupt+0x13/0x30
         smp_call_function_single_interrupt+0x3e/0xe0
         call_function_single_interrupt+0xf/0x20
         </IRQ>
      
      The reason is that while event init code does several checks
      for BTS events and prevents several unwanted config bits for
      BTS event (like precise_ip), the PERF_EVENT_IOC_PERIOD allows
      to create BTS event without those checks being done.
      
      Following sequence will cause the crash:
      
      If we create an 'almost' BTS event with precise_ip and callchains,
      and it into a BTS event it will crash the perf_prepare_sample()
      function because precise_ip events are expected to come
      in with callchain data initialized, but that's not the
      case for intel_pmu_drain_bts_buffer() caller.
      
      Adding a check_period callback to be called before the period
      is changed via PERF_EVENT_IOC_PERIOD. It will deny the change
      if the event would become BTS. Plus adding also the limit_period
      check as well.
      
      Reported-by: default avatarVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20190204123532.GA4794@krava
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      81ec3f3c
    • Willem de Bruijn's avatar
      bpf: only adjust gso_size on bytestream protocols · b90efd22
      Willem de Bruijn authored
      
      bpf_skb_change_proto and bpf_skb_adjust_room change skb header length.
      For GSO packets they adjust gso_size to maintain the same MTU.
      
      The gso size can only be safely adjusted on bytestream protocols.
      Commit d02f51cb ("bpf: fix bpf_skb_adjust_net/bpf_skb_proto_xlat
      to deal with gso sctp skbs") excluded SKB_GSO_SCTP.
      
      Since then type SKB_GSO_UDP_L4 has been added, whose contents are one
      gso_size unit per datagram. Also exclude these.
      
      Move from a blacklist to a whitelist check to future proof against
      additional such new GSO types, e.g., for fraglist based GRO.
      
      Fixes: bec1f6f6 ("udp: generate gso with UDP_SEGMENT")
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b90efd22
  14. Feb 09, 2019
    • Lorenzo Bianconi's avatar
      net: ipv4: use a dedicated counter for icmp_v4 redirect packets · c09551c6
      Lorenzo Bianconi authored
      
      According to the algorithm described in the comment block at the
      beginning of ip_rt_send_redirect, the host should try to send
      'ip_rt_redirect_number' ICMP redirect packets with an exponential
      backoff and then stop sending them at all assuming that the destination
      ignores redirects.
      If the device has previously sent some ICMP error packets that are
      rate-limited (e.g TTL expired) and continues to receive traffic,
      the redirect packets will never be transmitted. This happens since
      peer->rate_tokens will be typically greater than 'ip_rt_redirect_number'
      and so it will never be reset even if the redirect silence timeout
      (ip_rt_redirect_silence) has elapsed without receiving any packet
      requiring redirects.
      
      Fix it by using a dedicated counter for the number of ICMP redirect
      packets that has been sent by the host
      
      I have not been able to identify a given commit that introduced the
      issue since ip_rt_send_redirect implements the same rate-limiting
      algorithm from commit 1da177e4 ("Linux-2.6.12-rc2")
      
      Signed-off-by: default avatarLorenzo Bianconi <lorenzo.bianconi@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c09551c6
  15. Feb 08, 2019
    • Zachary Hays's avatar
      mmc: block: handle complete_work on separate workqueue · dcf6e2e3
      Zachary Hays authored
      
      The kblockd workqueue is created with the WQ_MEM_RECLAIM flag set.
      This generates a rescuer thread for that queue that will trigger when
      the CPU is under heavy load and collect the uncompleted work.
      
      In the case of mmc, this creates the possibility of a deadlock when
      there are multiple partitions on the device as other blk-mq work is
      also run on the same queue. For example:
      
      - worker 0 claims the mmc host to work on partition 1
      - worker 1 attempts to claim the host for partition 2 but has to wait
        for worker 0 to finish
      - worker 0 schedules complete_work to release the host
      - rescuer thread is triggered after time-out and collects the dangling
        work
      - rescuer thread attempts to complete the work in order starting with
        claim host
      - the task to release host is now blocked by a task to claim it and
        will never be called
      
      The above results in multiple hung tasks that lead to failures to
      mount partitions.
      
      Handling complete_work on a separate workqueue avoids this by keeping
      the work completion tasks separate from the other blk-mq work. This
      allows the host to be released without getting blocked by other tasks
      attempting to claim the host.
      
      Signed-off-by: default avatarZachary Hays <zhays@lexmark.com>
      Fixes: 81196976 ("mmc: block: Add blk-mq support")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarUlf Hansson <ulf.hansson@linaro.org>
      dcf6e2e3
  16. Feb 07, 2019
  17. Feb 05, 2019
    • Charles Keepax's avatar
      ALSA: compress: Fix stop handling on compressed capture streams · 4f2ab5e1
      Charles Keepax authored
      
      It is normal user behaviour to start, stop, then start a stream
      again without closing it. Currently this works for compressed
      playback streams but not capture ones.
      
      The states on a compressed capture stream go directly from OPEN to
      PREPARED, unlike a playback stream which moves to SETUP and waits
      for a write of data before moving to PREPARED. Currently however,
      when a stop is sent the state is set to SETUP for both types of
      streams. This leaves a capture stream in the situation where a new
      start can't be sent as that requires the state to be PREPARED and
      a new set_params can't be sent as that requires the state to be
      OPEN. The only option being to close the stream, and then reopen.
      
      Correct this issues by allowing snd_compr_drain_notify to set the
      state depending on the stream direction, as we already do in
      set_params.
      
      Fixes: 49bb6402 ("ALSA: compress_core: Add support for capture streams")
      Signed-off-by: default avatarCharles Keepax <ckeepax@opensource.cirrus.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      4f2ab5e1
    • Michael S. Tsirkin's avatar
      virtio: drop internal struct from UAPI · 9c0644ee
      Michael S. Tsirkin authored
      
      There's no reason to expose struct vring_packed in UAPI - if we do we
      won't be able to change or drop it, and it's not part of any interface.
      
      Let's move it to virtio_ring.c
      
      Cc: Tiwei Bie <tiwei.bie@intel.com>
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      9c0644ee
    • Cong Wang's avatar
      xfrm: destroy xfrm_state synchronously on net exit path · f75a2804
      Cong Wang authored
      
      xfrm_state_put() moves struct xfrm_state to the GC list
      and schedules the GC work to clean it up. On net exit call
      path, xfrm_state_flush() is called to clean up and
      xfrm_flush_gc() is called to wait for the GC work to complete
      before exit.
      
      However, this doesn't work because one of the ->destructor(),
      ipcomp_destroy(), schedules the same GC work again inside
      the GC work. It is hard to wait for such a nested async
      callback. This is also why syzbot still reports the following
      warning:
      
       WARNING: CPU: 1 PID: 33 at net/ipv6/xfrm6_tunnel.c:351 xfrm6_tunnel_net_exit+0x2cb/0x500 net/ipv6/xfrm6_tunnel.c:351
       ...
        ops_exit_list.isra.0+0xb0/0x160 net/core/net_namespace.c:153
        cleanup_net+0x51d/0xb10 net/core/net_namespace.c:551
        process_one_work+0xd0c/0x1ce0 kernel/workqueue.c:2153
        worker_thread+0x143/0x14a0 kernel/workqueue.c:2296
        kthread+0x357/0x430 kernel/kthread.c:246
        ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352
      
      In fact, it is perfectly fine to bypass GC and destroy xfrm_state
      synchronously on net exit call path, because it is in process context
      and doesn't need a work struct to do any blocking work.
      
      This patch introduces xfrm_state_put_sync() which simply bypasses
      GC, and lets its callers to decide whether to use this synchronous
      version. On net exit path, xfrm_state_fini() and
      xfrm6_tunnel_net_exit() use it. And, as ipcomp_destroy() itself is
      blocking, it can use xfrm_state_put_sync() directly too.
      
      Also rename xfrm_state_gc_destroy() to ___xfrm_state_destroy() to
      reflect this change.
      
      Fixes: b48c05ab ("xfrm: Fix warning in xfrm6_tunnel_net_exit.")
      Reported-and-tested-by: default avatar <syzbot+e9aebef558e3ed673934@syzkaller.appspotmail.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      f75a2804
  18. Feb 04, 2019
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: unbind set in rule from commit path · f6ac8585
      Pablo Neira Ayuso authored
      
      Anonymous sets that are bound to rules from the same transaction trigger
      a kernel splat from the abort path due to double set list removal and
      double free.
      
      This patch updates the logic to search for the transaction that is
      responsible for creating the set and disable the set list removal and
      release, given the rule is now responsible for this. Lookup is reverse
      since the transaction that adds the set is likely to be at the tail of
      the list.
      
      Moreover, this patch adds the unbind step to deliver the event from the
      commit path.  This should not be done from the worker thread, since we
      have no guarantees of in-order delivery to the listener.
      
      This patch removes the assumption that both activate and deactivate
      callbacks need to be provided.
      
      Fixes: cd5125d8 ("netfilter: nf_tables: split set destruction in deactivate and destroy phase")
      Reported-by: default avatarMikhail Morfikov <mmorfikov@gmail.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      f6ac8585
  19. Feb 02, 2019
    • Johannes Weiner's avatar
      x86/resctrl: Avoid confusion over the new X86_RESCTRL config · e6d42931
      Johannes Weiner authored
      
      "Resource Control" is a very broad term for this CPU feature, and a term
      that is also associated with containers, cgroups etc. This can easily
      cause confusion.
      
      Make the user prompt more specific. Match the config symbol name.
      
       [ bp: In the future, the corresponding ARM arch-specific code will be
         under ARM_CPU_RESCTRL and the arch-agnostic bits will be carved out
         under the CPU_RESCTRL umbrella symbol. ]
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: Babu Moger <Babu.Moger@amd.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: linux-doc@vger.kernel.org
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Pu Wen <puwen@hygon.cn>
      Cc: Reinette Chatre <reinette.chatre@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20190130195621.GA30653@cmpxchg.org
      e6d42931
  20. Feb 01, 2019
    • Qian Cai's avatar
      mm/hotplug: invalid PFNs from pfn_to_online_page() · b13bc351
      Qian Cai authored
      On an arm64 ThunderX2 server, the first kmemleak scan would crash [1]
      with CONFIG_DEBUG_VM_PGFLAGS=y due to page_to_nid() found a pfn that is
      not directly mapped (MEMBLOCK_NOMAP).  Hence, the page->flags is
      uninitialized.
      
      This is due to the commit 9f1eb38e ("mm, kmemleak: little
      optimization while scanning") starts to use pfn_to_online_page() instead
      of pfn_valid().  However, in the CONFIG_MEMORY_HOTPLUG=y case,
      pfn_to_online_page() does not call memblock_is_map_memory() while
      pfn_valid() does.
      
      Historically, the commit 68709f45 ("arm64: only consider memblocks
      with NOMAP cleared for linear mapping") causes pages marked as nomap
      being no long reassigned to the new zone in memmap_init_zone() by
      calling __init_single_page().
      
      Since the commit 2d070eab ("mm: consider zone which is not fully
      populated to have holes") introduced pfn_to_online_page() and was
      designed to return a valid pfn only, but it is clearly broken on arm64.
      
      Therefore, let pfn_to_online_page() call pfn_valid_within(), so it can
      handle nomap thanks to the commit f52bb98f ("arm64: mm: always
      enable CONFIG_HOLES_IN_ZONE"), while it will be optimized away on
      architectures where have no HOLES_IN_ZONE.
      
      [1]
        Unable to handle kernel NULL pointer dereference at virtual address 0000000000000006
        Mem abort info:
          ESR = 0x96000005
          Exception class = DABT (current EL), IL = 32 bits
          SET = 0, FnV = 0
          EA = 0, S1PTW = 0
        Data abort info:
          ISV = 0, ISS = 0x00000005
          CM = 0, WnR = 0
        Internal error: Oops: 96000005 [#1] SMP
        CPU: 60 PID: 1408 Comm: kmemleak Not tainted 5.0.0-rc2+ #8
        pstate: 60400009 (nZCv daif +PAN -UAO)
        pc : page_mapping+0x24/0x144
        lr : __dump_page+0x34/0x3dc
        sp : ffff00003a5cfd10
        x29: ffff00003a5cfd10 x28: 000000000000802f
        x27: 0000000000000000 x26: 0000000000277d00
        x25: ffff000010791f56 x24: ffff7fe000000000
        x23: ffff000010772f8b x22: ffff00001125f670
        x21: ffff000011311000 x20: ffff000010772f8b
        x19: fffffffffffffffe x18: 0000000000000000
        x17: 0000000000000000 x16: 0000000000000000
        x15: 0000000000000000 x14: ffff802698b19600
        x13: ffff802698b1a200 x12: ffff802698b16f00
        x11: ffff802698b1a400 x10: 0000000000001400
        x9 : 0000000000000001 x8 : ffff00001121a000
        x7 : 0000000000000000 x6 : ffff0000102c53b8
        x5 : 0000000000000000 x4 : 0000000000000003
        x3 : 0000000000000100 x2 : 0000000000000000
        x1 : ffff000010772f8b x0 : ffffffffffffffff
        Process kmemleak (pid: 1408, stack limit = 0x(____ptrval____))
        Call trace:
         page_mapping+0x24/0x144
         __dump_page+0x34/0x3dc
         dump_page+0x28/0x4c
         kmemleak_scan+0x4ac/0x680
         kmemleak_scan_thread+0xb4/0xdc
         kthread+0x12c/0x13c
         ret_from_fork+0x10/0x18
        Code: d503201f f9400660 36000040 d1000413 (f9400661)
        ---[ end trace 4d4bd7f573490c8e ]---
        Kernel panic - not syncing: Fatal exception
        SMP: stopping secondary CPUs
        Kernel Offset: disabled
        CPU features: 0x002,20000c38
        Memory Limit: none
        ---[ end Kernel panic - not syncing: Fatal exception ]---
      
      Link: http://lkml.kernel.org/r/20190122132916.28360-1-cai@lca.pw
      
      
      Fixes: 9f1eb38e ("mm, kmemleak: little optimization while scanning")
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b13bc351
    • Tetsuo Handa's avatar
      oom, oom_reaper: do not enqueue same task twice · 9bcdeb51
      Tetsuo Handa authored
      Arkadiusz reported that enabling memcg's group oom killing causes
      strange memcg statistics where there is no task in a memcg despite the
      number of tasks in that memcg is not 0.  It turned out that there is a
      bug in wake_oom_reaper() which allows enqueuing same task twice which
      makes impossible to decrease the number of tasks in that memcg due to a
      refcount leak.
      
      This bug existed since the OOM reaper became invokable from
      task_will_free_mem(current) path in out_of_memory() in Linux 4.7,
      
        T1@P1     |T2@P1     |T3@P1     |OOM reaper
        ----------+----------+----------+------------
                                         # Processing an OOM victim in a different memcg domain.
                              try_charge()
                                mem_cgroup_out_of_memory()
                                  mutex_lock(&oom_lock)
                   try_charge()
                     mem_cgroup_out_of_memory()
                       mutex_lock(&oom_lock)
        try_charge()
          mem_cgroup_out_of_memory()
            mutex_lock(&oom_lock)
                                  out_of_memory()
                                    oom_kill_process(P1)
                                      do_send_sig_info(SIGKILL, @P1)
                                      mark_oom_victim(T1@P1)
                                      wake_oom_reaper(T1@P1) # T1@P1 is enqueued.
                                  mutex_unlock(&oom_lock)
                       out_of_memory()
                         mark_oom_victim(T2@P1)
                         wake_oom_reaper(T2@P1) # T2@P1 is enqueued.
                       mutex_unlock(&oom_lock)
            out_of_memory()
              mark_oom_victim(T1@P1)
              wake_oom_reaper(T1@P1) # T1@P1 is enqueued again due to oom_reaper_list == T2@P1 && T1@P1->oom_reaper_list == NULL.
            mutex_unlock(&oom_lock)
                                         # Completed processing an OOM victim in a different memcg domain.
                                         spin_lock(&oom_reaper_lock)
                                         # T1P1 is dequeued.
                                         spin_unlock(&oom_reaper_lock)
      
      but memcg's group oom killing made it easier to trigger this bug by
      calling wake_oom_reaper() on the same task from one out_of_memory()
      request.
      
      Fix this bug using an approach used by commit 855b0183 ("oom,
      oom_reaper: disable oom_reaper for oom_kill_allocating_task").  As a
      side effect of this patch, this patch also avoids enqueuing multiple
      threads sharing memory via task_will_free_mem(current) path.
      
      Link: http://lkml.kernel.org/r/e865a044-2c10-9858-f4ef-254bc71d6cc2@i-love.sakura.ne.jp
      Link: http://lkml.kernel.org/r/5ee34fc6-1485-34f8-8790-903ddabaa809@i-love.sakura.ne.jp
      
      
      Fixes: af8e15cc ("oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head")
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: default avatarArkadiusz Miskiewicz <arekm@maven.pl>
      Tested-by: default avatarArkadiusz Miskiewicz <arekm@maven.pl>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Aleksa Sarai <asarai@suse.de>
      Cc: Jay Kamat <jgkamat@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9bcdeb51
    • Takashi Iwai's avatar
      ALSA: hda - Serialize codec registrations · 305a0ade
      Takashi Iwai authored
      
      In the current code, the codec registration may happen both at the
      codec bind time and the end of the controller probe time.  In a rare
      occasion, they race with each other, leading to Oops due to the still
      uninitialized card device.
      
      This patch introduces a simple flag to prevent the codec registration
      at the codec bind time as long as the controller probe is going on.
      The controller probe invokes snd_card_register() that does the whole
      registration task, and we don't need to register each piece
      beforehand.
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      305a0ade
  21. Jan 31, 2019
    • Alexei Starovoitov's avatar
      bpf: run bpf programs with preemption disabled · 6cab5e90
      Alexei Starovoitov authored
      
      Disabled preemption is necessary for proper access to per-cpu maps
      from BPF programs.
      
      But the sender side of socket filters didn't have preemption disabled:
      unix_dgram_sendmsg->sk_filter->sk_filter_trim_cap->bpf_prog_run_save_cb->BPF_PROG_RUN
      
      and a combination of af_packet with tun device didn't disable either:
      tpacket_snd->packet_direct_xmit->packet_pick_tx_queue->ndo_select_queue->
        tun_select_queue->tun_ebpf_select_queue->bpf_prog_run_clear_cb->BPF_PROG_RUN
      
      Disable preemption before executing BPF programs (both classic and extended).
      
      Reported-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      6cab5e90
    • Jens Axboe's avatar
      ide: ensure atapi sense request aren't preempted · 9a6d5488
      Jens Axboe authored
      
      There's an issue with how sense requests are handled in IDE. If ide-cd
      encounters an error, it queues a sense request. With how IDE request
      handling is done, this is the next request we need to handle. But it's
      impossible to guarantee this, as another request could come in between
      the sense being queued, and ->queue_rq() being run and handling it. If
      that request ALSO fails, then we attempt to doubly queue the single
      sense request we have.
      
      Since we only support one active request at the time, defer request
      processing when a sense request is queued.
      
      Fixes: 60033520 "ide: convert to blk-mq"
      Reported-by: default avatarHe Zhe <zhe.he@windriver.com>
      Tested-by: default avatarHe Zhe <zhe.he@windriver.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9a6d5488
    • Zenghui Yu's avatar
      irqchip/gic-v3-its: Fix ITT_entry_size accessor · 56841070
      Zenghui Yu authored
      
      According to ARM IHI 0069C (ID070116), we should use GITS_TYPER's
      bits [7:4] as ITT_entry_size instead of [8:4]. Although this is
      pretty annoying, it only results in a potential over-allocation
      of memory, and nothing bad happens.
      
      Fixes: 3dfa576b ("irqchip/gic-v3-its: Add probing for VLPI properties")
      Signed-off-by: default avatarZenghui Yu <yuzenghui@huawei.com>
      [maz: massaged subject and commit message]
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      56841070
    • Jose Abreu's avatar
      net: stmmac: Fallback to Platform Data clock in Watchdog conversion · 4ec5302f
      Jose Abreu authored
      
      If we don't have DT then stmmac_clk will not be available. Let's add a
      new Platform Data field so that we can specify the refclk by this mean.
      
      This way we can still use the coalesce command in PCI based setups.
      
      Signed-off-by: default avatarJose Abreu <joabreu@synopsys.com>
      Cc: Joao Pinto <jpinto@synopsys.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
      Cc: Alexandre Torgue <alexandre.torgue@st.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4ec5302f
    • Daniel Borkmann's avatar
      ipvlan, l3mdev: fix broken l3s mode wrt local routes · d5256083
      Daniel Borkmann authored
      While implementing ipvlan l3 and l3s mode for kubernetes CNI plugin,
      I ran into the issue that while l3 mode is working fine, l3s mode
      does not have any connectivity to kube-apiserver and hence all pods
      end up in Error state as well. The ipvlan master device sits on
      top of a bond device and hostns traffic to kube-apiserver (also running
      in hostns) is DNATed from 10.152.183.1:443 to 139.178.29.207:37573
      where the latter is the address of the bond0. While in l3 mode, a
      curl to https://10.152.183.1:443 or to https://139.178.29.207:37573
      works fine from hostns, neither of them do in case of l3s. In the
      latter only a curl to https://127.0.0.1:37573 appeared to work where
      for local addresses of bond0 I saw kernel suddenly starting to emit
      ARP requests to query HW address of bond0 which remained unanswered
      and neighbor entries in INCOMPLETE state. These ARP requests only
      happen while in l3s.
      
      Debugging this further, I found the issue is that l3s mode is piggy-
      backing on l3 master device, and in this case local routes are using
      l3mdev_master_dev_rcu(dev) instead of net->loopback_dev as per commit
      f5a0aab8 ("net: ipv4: dst for local input routes should use l3mdev
      if relevant") and 5f02ce24 ("net: l3mdev: Allow the l3mdev to be
      a loopback"). I found that reverting them back into using the
      net->loopback_dev fixed ipvlan l3s connectivity and got everything
      working for the CNI.
      
      Now judging from 4fbae7d8 ("ipvlan: Introduce l3s mode") and the
      l3mdev paper in [0] the only sole reason why ipvlan l3s is relying
      on l3 master device is to get the l3mdev_ip_rcv() receive hook for
      setting the dst entry of the input route without adding its own
      ipvlan specific hacks into the receive path, however, any l3 domain
      semantics beyond just that are breaking l3s operation. Note that
      ipvlan also has the ability to dynamically switch its internal
      operation from l3 to l3s for all ports via ipvlan_set_port_mode()
      at runtime. In any case, l3 vs l3s soley distinguishes itself by
      'de-confusing' netfilter through switching skb->dev to ipvlan slave
      device late in NF_INET_LOCAL_IN before handing the skb to L4.
      
      Minimal fix taken here is to add a IFF_L3MDEV_RX_HANDLER flag which,
      if set from ipvlan setup, gets us only the wanted l3mdev_l3_rcv() hook
      without any additional l3mdev semantics on top. This should also have
      minimal impact since dev->priv_flags is already hot in cache. With
      this set, l3s mode is working fine and I also get things like
      masquerading pod traffic on the ipvlan master properly working.
      
        [0] https://netdevconf.org/1.2/papers/ahern-what-is-l3mdev-paper.pdf
      
      
      
      Fixes: f5a0aab8 ("net: ipv4: dst for local input routes should use l3mdev if relevant")
      Fixes: 5f02ce24 ("net: l3mdev: Allow the l3mdev to be a loopback")
      Fixes: 4fbae7d8 ("ipvlan: Introduce l3s mode")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Cc: David Ahern <dsa@cumulusnetworks.com>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Martynas Pumputis <m@lambda.lt>
      Acked-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d5256083
Loading