Skip to content
Snippets Groups Projects
  1. Mar 29, 2019
    • Qian Cai's avatar
      mm/hotplug: fix offline undo_isolate_page_range() · 9b7ea46a
      Qian Cai authored
      Commit f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded
      memory to zones until online") introduced move_pfn_range_to_zone() which
      calls memmap_init_zone() during onlining a memory block.
      memmap_init_zone() will reset pagetype flags and makes migrate type to
      be MOVABLE.
      
      However, in __offline_pages(), it also call undo_isolate_page_range()
      after offline_isolated_pages() to do the same thing.  Due to commit
      2ce13640 ("mm: __first_valid_page skip over offline pages") changed
      __first_valid_page() to skip offline pages, undo_isolate_page_range()
      here just waste CPU cycles looping around the offlining PFN range while
      doing nothing, because __first_valid_page() will return NULL as
      offline_isolated_pages() has already marked all memory sections within
      the pfn range as offline via offline_mem_sections().
      
      Also, after calling the "useless" undo_isolate_page_range() here, it
      reaches the point of no returning by notifying MEM_OFFLINE.  Those pages
      will be marked as MIGRATE_MOVABLE again once onlining.  The only thing
      left to do is to decrease the number of isolated pageblocks zone counter
      which would make some paths of the page allocation slower that the above
      commit introduced.
      
      Even if alloc_contig_range() can be used to isolate 16GB-hugetlb pages
      on ppc64, an "int" should still be enough to represent the number of
      pageblocks there.  Fix an incorrect comment along the way.
      
      [cai@lca.pw: v4]
        Link: http://lkml.kernel.org/r/20190314150641.59358-1-cai@lca.pw
      Link: http://lkml.kernel.org/r/20190313143133.46200-1-cai@lca.pw
      
      
      Fixes: 2ce13640 ("mm: __first_valid_page skip over offline pages")
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>	[4.13+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9b7ea46a
  2. Mar 26, 2019
  3. Mar 25, 2019
    • Linus Torvalds's avatar
      Revert "parport: daisy: use new parport device model" · a3ac7917
      Linus Torvalds authored
      
      This reverts commit 1aec4211.
      
      Steven Rostedt reports that it causes a hang at bootup and bisected it
      to this commit.
      
      The troigger is apparently a module alias for "parport_lowlevel" that
      points to "parport_pc", which causes a hang with
      
          modprobe -q -- parport_lowlevel
      
      blocking forever with a backtrace like this:
      
          wait_for_completion_killable+0x1c/0x28
          call_usermodehelper_exec+0xa7/0x108
          __request_module+0x351/0x3d8
          get_lowlevel_driver+0x28/0x41 [parport]
          __parport_register_driver+0x39/0x1f4 [parport]
          daisy_drv_init+0x31/0x4f [parport]
          parport_bus_init+0x5d/0x7b [parport]
          parport_default_proc_register+0x26/0x1000 [parport]
          do_one_initcall+0xc2/0x1e0
          do_init_module+0x50/0x1d4
          load_module+0x1c2e/0x21b3
          sys_init_module+0xef/0x117
      
      Supid says:
       "Due to the new device model daisy driver will now try to find the
        parallel ports while trying to register its driver so that it can bind
        with them. Now, since daisy driver is loaded while parport bus is
        initialising the list of parport is still empty and it tries to load
        the lowlevel driver, which has an alias set to parport_pc, now causes
        a deadlock"
      
      But I don't think the daisy driver should be loaded by the parport
      initialization in the first place, so let's revert the whole change.
      
      If the daisy driver can just initialize separately on its own (like a
      driver should), instead of hooking into the parport init sequence
      directly, this issue probably would go away.
      
      Reported-and-bisected-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Reported-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a3ac7917
  4. Mar 23, 2019
    • Kairui Song's avatar
      x86/gart: Exclude GART aperture from kcore · ffc8599a
      Kairui Song authored
      
      On machines where the GART aperture is mapped over physical RAM,
      /proc/kcore contains the GART aperture range. Accessing the GART range via
      /proc/kcore results in a kernel crash.
      
      vmcore used to have the same issue, until it was fixed with commit
      2a3e83c6 ("x86/gart: Exclude GART aperture from vmcore")', leveraging
      existing hook infrastructure in vmcore to let /proc/vmcore return zeroes
      when attempting to read the aperture region, and so it won't read from the
      actual memory.
      
      Apply the same workaround for kcore. First implement the same hook
      infrastructure for kcore, then reuse the hook functions introduced in the
      previous vmcore fix. Just with some minor adjustment, rename some functions
      for more general usage, and simplify the hook infrastructure a bit as there
      is no module usage yet.
      
      Suggested-by: default avatarBaoquan He <bhe@redhat.com>
      Signed-off-by: default avatarKairui Song <kasong@redhat.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJiri Bohac <jbohac@suse.cz>
      Acked-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Dave Young <dyoung@redhat.com>
      Link: https://lkml.kernel.org/r/20190308030508.13548-1-kasong@redhat.com
      
      ffc8599a
  5. Mar 22, 2019
  6. Mar 21, 2019
    • Davide Caratti's avatar
      net/sched: let actions use RCU to access 'goto_chain' · ee3bbfe8
      Davide Caratti authored
      
      use RCU when accessing the action chain, to avoid use after free in the
      traffic path when 'goto chain' is replaced on existing TC actions (see
      script below). Since the control action is read in the traffic path
      without holding the action spinlock, we need to explicitly ensure that
      a->goto_chain is not NULL before dereferencing (i.e it's not sufficient
      to rely on the value of TC_ACT_GOTO_CHAIN bits). Not doing so caused NULL
      dereferences in tcf_action_goto_chain_exec() when the following script:
      
       # tc chain add dev dd0 chain 42 ingress protocol ip flower \
       > ip_proto udp action pass index 4
       # tc filter add dev dd0 ingress protocol ip flower \
       > ip_proto udp action csum udp goto chain 42 index 66
       # tc chain del dev dd0 chain 42 ingress
       (start UDP traffic towards dd0)
       # tc action replace action csum udp pass index 66
      
      was run repeatedly for several hours.
      
      Suggested-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Suggested-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ee3bbfe8
    • Davide Caratti's avatar
      net/sched: don't dereference a->goto_chain to read the chain index · fe384e2f
      Davide Caratti authored
      
      callers of tcf_gact_goto_chain_index() can potentially read an old value
      of the chain index, or even dereference a NULL 'goto_chain' pointer,
      because 'goto_chain' and 'tcfa_action' are read in the traffic path
      without caring of concurrent write in the control path. The most recent
      value of chain index can be read also from a->tcfa_action (it's encoded
      there together with TC_ACT_GOTO_CHAIN bits), so we don't really need to
      dereference 'goto_chain': just read the chain id from the control action.
      
      Fixes: e457d86a ("net: sched: add couple of goto_chain helpers")
      Signed-off-by: default avatarDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fe384e2f
    • Davide Caratti's avatar
      net/sched: prepare TC actions to properly validate the control action · 85d0966f
      Davide Caratti authored
      
      - pass a pointer to struct tcf_proto in each actions's init() handler,
        to allow validating the control action, checking whether the chain
        exists and (eventually) refcounting it.
      - remove code that validates the control action after a successful call
        to the action's init() handler, and replace it with a test that forbids
        addition of actions having 'goto_chain' and NULL goto_chain pointer at
        the same time.
      - add tcf_action_check_ctrlact(), that will validate the control action
        and eventually allocate the action 'goto_chain' within the init()
        handler.
      - add tcf_action_set_ctrlact(), that will assign the control action and
        swap the current 'goto_chain' pointer with the new given one.
      
      This disallows 'goto_chain' on actions that don't initialize it properly
      in their init() handler, i.e. calling tcf_action_check_ctrlact() after
      successful IDR reservation and then calling tcf_action_set_ctrlact()
      to assign 'goto_chain' and 'tcf_action' consistently.
      
      By doing this, the kernel does not leak anymore refcounts when a valid
      'goto chain' handle is replaced in TC actions, causing kmemleak splats
      like the following one:
      
       # tc chain add dev dd0 chain 42 ingress protocol ip flower \
       > ip_proto tcp action drop
       # tc chain add dev dd0 chain 43 ingress protocol ip flower \
       > ip_proto udp action drop
       # tc filter add dev dd0 ingress matchall \
       > action gact goto chain 42 index 66
       # tc filter replace dev dd0 ingress matchall \
       > action gact goto chain 43 index 66
       # echo scan >/sys/kernel/debug/kmemleak
       <...>
       unreferenced object 0xffff93c0ee09f000 (size 1024):
       comm "tc", pid 2565, jiffies 4295339808 (age 65.426s)
       hex dump (first 32 bytes):
         00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
         00 00 00 00 08 00 06 00 00 00 00 00 00 00 00 00  ................
       backtrace:
         [<000000009b63f92d>] tc_ctl_chain+0x3d2/0x4c0
         [<00000000683a8d72>] rtnetlink_rcv_msg+0x263/0x2d0
         [<00000000ddd88f8e>] netlink_rcv_skb+0x4a/0x110
         [<000000006126a348>] netlink_unicast+0x1a0/0x250
         [<00000000b3340877>] netlink_sendmsg+0x2c1/0x3c0
         [<00000000a25a2171>] sock_sendmsg+0x36/0x40
         [<00000000f19ee1ec>] ___sys_sendmsg+0x280/0x2f0
         [<00000000d0422042>] __sys_sendmsg+0x5e/0xa0
         [<000000007a6c61f9>] do_syscall_64+0x5b/0x180
         [<00000000ccd07542>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
         [<0000000013eaa334>] 0xffffffffffffffff
      
      Fixes: db50514f ("net: sched: add termination action to allow goto chain")
      Fixes: 97763dc0 ("net_sched: reject unknown tcfa_action values")
      Signed-off-by: default avatarDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      85d0966f
    • Peter Xu's avatar
      genirq: Fix typo in comment of IRQD_MOVE_PCNTXT · 551417af
      Peter Xu authored
      
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Dou Liyang <douliyangs@gmail.com>
      Cc: Julien Thierry <julien.thierry@arm.com>
      Link: https://lkml.kernel.org/r/20190318065123.11862-1-peterx@redhat.com
      551417af
  7. Mar 20, 2019
  8. Mar 19, 2019
    • Dongli Zhang's avatar
      blk-mq: remove unused 'nr_expired' from blk_mq_hw_ctx · 9496c015
      Dongli Zhang authored
      
      There is no usage of 'nr_expired'.
      
      The 'nr_expired' was introduced by commit 1d9bd516 ("blk-mq: replace
      timeout synchronization with a RCU and generation based scheme"). Its usage
      was removed since commit 12f5b931 ("blk-mq: Remove generation
      seqeunce").
      
      Signed-off-by: default avatarDongli Zhang <dongli.zhang@oracle.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9496c015
    • Xin Long's avatar
      sctp: get sctphdr by offset in sctp_compute_cksum · 273160ff
      Xin Long authored
      
      sctp_hdr(skb) only works when skb->transport_header is set properly.
      
      But in Netfilter, skb->transport_header for ipv6 is not guaranteed
      to be right value for sctphdr. It would cause to fail to check the
      checksum for sctp packets.
      
      So fix it by using offset, which is always right in all places.
      
      v1->v2:
        - Fix the changelog.
      
      Fixes: e6d8b64b ("net: sctp: fix and consolidate SCTP checksumming code")
      Reported-by: default avatarLi Shuang <shuali@redhat.com>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      273160ff
    • Maxime Chevallier's avatar
      packets: Always register packet sk in the same order · a4dc6a49
      Maxime Chevallier authored
      
      When using fanouts with AF_PACKET, the demux functions such as
      fanout_demux_cpu will return an index in the fanout socket array, which
      corresponds to the selected socket.
      
      The ordering of this array depends on the order the sockets were added
      to a given fanout group, so for FANOUT_CPU this means sockets are bound
      to cpus in the order they are configured, which is OK.
      
      However, when stopping then restarting the interface these sockets are
      bound to, the sockets are reassigned to the fanout group in the reverse
      order, due to the fact that they were inserted at the head of the
      interface's AF_PACKET socket list.
      
      This means that traffic that was directed to the first socket in the
      fanout group is now directed to the last one after an interface restart.
      
      In the case of FANOUT_CPU, traffic from CPU0 will be directed to the
      socket that used to receive traffic from the last CPU after an interface
      restart.
      
      This commit introduces a helper to add a socket at the tail of a list,
      then uses it to register AF_PACKET sockets.
      
      Note that this changes the order in which sockets are listed in /proc and
      with sock_diag.
      
      Fixes: dc99f600 ("packet: Add fanout support")
      Signed-off-by: default avatarMaxime Chevallier <maxime.chevallier@bootlin.com>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a4dc6a49
  9. Mar 18, 2019
    • Jens Axboe's avatar
      block: add BIO_NO_PAGE_REF flag · 399254aa
      Jens Axboe authored
      
      If bio_iov_iter_get_pages() is called on an iov_iter that is flagged
      with NO_REF, then we don't need to add a page reference for the pages
      that we add.
      
      Add BIO_NO_PAGE_REF to track this in the bio, so IO completion knows
      not to drop a reference to these pages.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      399254aa
    • Jens Axboe's avatar
      iov_iter: add ITER_BVEC_FLAG_NO_REF flag · 875f1d07
      Jens Axboe authored
      
      For ITER_BVEC, if we're holding on to kernel pages, the caller
      doesn't need to grab a reference to the bvec pages, and drop that
      same reference on IO completion. This is essentially safe for any
      ITER_BVEC, but some use cases end up reusing pages and uncondtionally
      dropping a page reference on completion. And example of that is
      sendfile(2), that ends up being a splice_in + splice_out on the
      pipe pages.
      
      Add a flag that tells us it's fine to not grab a page reference
      to the bvec pages, since that caller knows not to drop a reference
      when it's done with the pages.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      875f1d07
    • Yishai Hadas's avatar
      IB/mlx5: Use mlx5 core to create/destroy a DEVX DCT · c5ae1954
      Yishai Hadas authored
      
      To prevent a hardware memory leak when a DEVX DCT object is destroyed
      without calling DRAIN DCT before, (e.g. under cleanup flow), need to
      manage its creation and destruction via mlx5 core.
      
      In that case the DRAIN DCT command will be called and only once that it
      will be completed the DESTROY DCT command will be called.  Otherwise, the
      DESTROY DCT may fail and a hardware leak may occur.
      
      As of that change the DRAIN DCT command should not be exposed any more
      from DEVX, it's managed internally by the driver to work as expected by
      the device specification.
      
      Fixes: 7efce369 ("IB/mlx5: Add obj create and destroy functionality")
      Signed-off-by: default avatarYishai Hadas <yishaih@mellanox.com>
      Reviewed-by: default avatarArtemy Kovalyov <artemyko@mellanox.com>
      Signed-off-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      c5ae1954
  10. Mar 17, 2019
  11. Mar 16, 2019
  12. Mar 15, 2019
    • Pedro Tammela's avatar
      net: add documentation to socket.c · 8a3c245c
      Pedro Tammela authored
      
      Adds missing sphinx documentation to the
      socket.c's functions. Also fixes some whitespaces.
      
      I also changed the style of older documentation as an
      effort to have an uniform documentation style.
      
      Signed-off-by: default avatarPedro Tammela <pctammela@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8a3c245c
    • YueHaibing's avatar
      appletalk: Fix potential NULL pointer dereference in unregister_snap_client · 9804501f
      YueHaibing authored
      
      register_snap_client may return NULL, all the callers
      check it, but only print a warning. This will result in
      NULL pointer dereference in unregister_snap_client and other
      places.
      
      It has always been used like this since v2.6
      
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9804501f
    • Josef Bacik's avatar
      filemap: kill page_cache_read usage in filemap_fault · a75d4c33
      Josef Bacik authored
      Patch series "drop the mmap_sem when doing IO in the fault path", v6.
      
      Now that we have proper isolation in place with cgroups2 we have started
      going through and fixing the various priority inversions.  Most are all
      gone now, but this one is sort of weird since it's not necessarily a
      priority inversion that happens within the kernel, but rather because of
      something userspace does.
      
      We have giant applications that we want to protect, and parts of these
      giant applications do things like watch the system state to determine how
      healthy the box is for load balancing and such.  This involves running
      'ps' or other such utilities.  These utilities will often walk
      /proc/<pid>/whatever, and these files can sometimes need to
      down_read(&task->mmap_sem).  Not usually a big deal, but we noticed when
      we are stress testing that sometimes our protected application has latency
      spikes trying to get the mmap_sem for tasks that are in lower priority
      cgroups.
      
      This is because any down_write() on a semaphore essentially turns it into
      a mutex, so even if we currently have it held for reading, any new readers
      will not be allowed on to keep from starving the writer.  This is fine,
      except a lower priority task could be stuck doing IO because it has been
      throttled to the point that its IO is taking much longer than normal.  But
      because a higher priority group depends on this completing it is now stuck
      behind lower priority work.
      
      In order to avoid this particular priority inversion we want to use the
      existing retry mechanism to stop from holding the mmap_sem at all if we
      are going to do IO.  This already exists in the read case sort of, but
      needed to be extended for more than just grabbing the page lock.  With
      io.latency we throttle at submit_bio() time, so the readahead stuff can
      block and even page_cache_read can block, so all these paths need to have
      the mmap_sem dropped.
      
      The other big thing is ->page_mkwrite.  btrfs is particularly shitty here
      because we have to reserve space for the dirty page, which can be a very
      expensive operation.  We use the same retry method as the read path, and
      simply cache the page and verify the page is still setup properly the next
      pass through ->page_mkwrite().
      
      I've tested these patches with xfstests and there are no regressions.
      
      This patch (of 3):
      
      If we do not have a page at filemap_fault time we'll do this weird forced
      page_cache_read thing to populate the page, and then drop it again and
      loop around and find it.  This makes for 2 ways we can read a page in
      filemap_fault, and it's not really needed.  Instead add a FGP_FOR_MMAP
      flag so that pagecache_get_page() will return a unlocked page that's in
      pagecache.  Then use the normal page locking and readpage logic already in
      filemap_fault.  This simplifies the no page in page cache case
      significantly.
      
      [akpm@linux-foundation.org: fix comment text]
      [josef@toxicpanda.com: don't unlock null page in FGP_FOR_MMAP case]
        Link: http://lkml.kernel.org/r/20190312201742.22935-1-josef@toxicpanda.com
      Link: http://lkml.kernel.org/r/20181211173801.29535-2-josef@toxicpanda.com
      
      
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a75d4c33
  13. Mar 14, 2019
  14. Mar 13, 2019
    • Ricardo Biehl Pasquali's avatar
    • Martin KaFai Lau's avatar
      bpf: Add bpf_get_listener_sock(struct bpf_sock *sk) helper · dbafd7dd
      Martin KaFai Lau authored
      
      Add a new helper "struct bpf_sock *bpf_get_listener_sock(struct bpf_sock *sk)"
      which returns a bpf_sock in TCP_LISTEN state.  It will trace back to
      the listener sk from a request_sock if possible.  It returns NULL
      for all other cases.
      
      No reference is taken because the helper ensures the sk is
      in SOCK_RCU_FREE (where the TCP_LISTEN sock should be in).
      Hence, bpf_sk_release() is unnecessary and the verifier does not
      allow bpf_sk_release(listen_sk) to be called either.
      
      The following is also allowed because the bpf_prog is run under
      rcu_read_lock():
      
      	sk = bpf_sk_lookup_tcp();
      	/* if (!sk) { ... } */
      	listen_sk = bpf_get_listener_sock(sk);
      	/* if (!listen_sk) { ... } */
      	bpf_sk_release(sk);
      	src_port = listen_sk->src_port; /* Allowed */
      
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      dbafd7dd
    • Martin KaFai Lau's avatar
      bpf: Fix bpf_tcp_sock and bpf_sk_fullsock issue related to bpf_sk_release · 1b986589
      Martin KaFai Lau authored
      
      Lorenz Bauer [thanks!] reported that a ptr returned by bpf_tcp_sock(sk)
      can still be accessed after bpf_sk_release(sk).
      Both bpf_tcp_sock() and bpf_sk_fullsock() have the same issue.
      This patch addresses them together.
      
      A simple reproducer looks like this:
      
      	sk = bpf_sk_lookup_tcp();
      	/* if (!sk) ... */
      	tp = bpf_tcp_sock(sk);
      	/* if (!tp) ... */
      	bpf_sk_release(sk);
      	snd_cwnd = tp->snd_cwnd; /* oops! The verifier does not complain. */
      
      The problem is the verifier did not scrub the register's states of
      the tcp_sock ptr (tp) after bpf_sk_release(sk).
      
      [ Note that when calling bpf_tcp_sock(sk), the sk is not always
        refcount-acquired. e.g. bpf_tcp_sock(skb->sk). The verifier works
        fine for this case. ]
      
      Currently, the verifier does not track if a helper's return ptr (in REG_0)
      is "carry"-ing one of its argument's refcount status. To carry this info,
      the reg1->id needs to be stored in reg0.
      
      One approach was tried, like "reg0->id = reg1->id", when calling
      "bpf_tcp_sock()".  The main idea was to avoid adding another "ref_obj_id"
      for the same reg.  However, overlapping the NULL marking and ref
      tracking purpose in one "id" does not work well:
      
      	ref_sk = bpf_sk_lookup_tcp();
      	fullsock = bpf_sk_fullsock(ref_sk);
      	tp = bpf_tcp_sock(ref_sk);
      	if (!fullsock) {
      	     bpf_sk_release(ref_sk);
      	     return 0;
      	}
      	/* fullsock_reg->id is marked for NOT-NULL.
      	 * Same for tp_reg->id because they have the same id.
      	 */
      
      	/* oops. verifier did not complain about the missing !tp check */
      	snd_cwnd = tp->snd_cwnd;
      
      Hence, a new "ref_obj_id" is needed in "struct bpf_reg_state".
      With a new ref_obj_id, when bpf_sk_release(sk) is called, the verifier can
      scrub all reg states which has a ref_obj_id match.  It is done with the
      changes in release_reg_references() in this patch.
      
      While fixing it, sk_to_full_sk() is removed from bpf_tcp_sock() and
      bpf_sk_fullsock() to avoid these helpers from returning
      another ptr. It will make bpf_sk_release(tp) possible:
      
      	sk = bpf_sk_lookup_tcp();
      	/* if (!sk) ... */
      	tp = bpf_tcp_sock(sk);
      	/* if (!tp) ... */
      	bpf_sk_release(tp);
      
      A separate helper "bpf_get_listener_sock()" will be added in a later
      patch to do sk_to_full_sk().
      
      Misc change notes:
      - To allow bpf_sk_release(tp), the arg of bpf_sk_release() is changed
        from ARG_PTR_TO_SOCKET to ARG_PTR_TO_SOCK_COMMON.  ARG_PTR_TO_SOCKET
        is removed from bpf.h since no helper is using it.
      
      - arg_type_is_refcounted() is renamed to arg_type_may_be_refcounted()
        because ARG_PTR_TO_SOCK_COMMON is the only one and skb->sk is not
        refcounted.  All bpf_sk_release(), bpf_sk_fullsock() and bpf_tcp_sock()
        take ARG_PTR_TO_SOCK_COMMON.
      
      - check_refcount_ok() ensures is_acquire_function() cannot take
        arg_type_may_be_refcounted() as its argument.
      
      - The check_func_arg() can only allow one refcount-ed arg.  It is
        guaranteed by check_refcount_ok() which ensures at most one arg can be
        refcounted.  Hence, it is a verifier internal error if >1 refcount arg
        found in check_func_arg().
      
      - In release_reference(), release_reference_state() is called
        first to ensure a match on "reg->ref_obj_id" can be found before
        scrubbing the reg states with release_reg_references().
      
      - reg_is_refcounted() is no longer needed.
        1. In mark_ptr_or_null_regs(), its usage is replaced by
           "ref_obj_id && ref_obj_id == id" because,
           when is_null == true, release_reference_state() should only be
           called on the ref_obj_id obtained by a acquire helper (i.e.
           is_acquire_function() == true).  Otherwise, the following
           would happen:
      
      	sk = bpf_sk_lookup_tcp();
      	/* if (!sk) { ... } */
      	fullsock = bpf_sk_fullsock(sk);
      	if (!fullsock) {
      		/*
      		 * release_reference_state(fullsock_reg->ref_obj_id)
      		 * where fullsock_reg->ref_obj_id == sk_reg->ref_obj_id.
      		 *
      		 * Hence, the following bpf_sk_release(sk) will fail
      		 * because the ref state has already been released in the
      		 * earlier release_reference_state(fullsock_reg->ref_obj_id).
      		 */
      		bpf_sk_release(sk);
      	}
      
        2. In release_reg_references(), the current reg_is_refcounted() call
           is unnecessary because the id check is enough.
      
      - The type_is_refcounted() and type_is_refcounted_or_null()
        are no longer needed also because reg_is_refcounted() is removed.
      
      Fixes: 655a51e5 ("bpf: Add struct bpf_tcp_sock and BPF_FUNC_tcp_sock")
      Reported-by: default avatarLorenz Bauer <lmb@cloudflare.com>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1b986589
    • Douglas Anderson's avatar
      tracing: kdb: Fix ftdump to not sleep · 31b265b3
      Douglas Anderson authored
      As reported back in 2016-11 [1], the "ftdump" kdb command triggers a
      BUG for "sleeping function called from invalid context".
      
      kdb's "ftdump" command wants to call ring_buffer_read_prepare() in
      atomic context.  A very simple solution for this is to add allocation
      flags to ring_buffer_read_prepare() so kdb can call it without
      triggering the allocation error.  This patch does that.
      
      Note that in the original email thread about this, it was suggested
      that perhaps the solution for kdb was to either preallocate the buffer
      ahead of time or create our own iterator.  I'm hoping that this
      alternative of adding allocation flags to ring_buffer_read_prepare()
      can be considered since it means I don't need to duplicate more of the
      core trace code into "trace_kdb.c" (for either creating my own
      iterator or re-preparing a ring allocator whose memory was already
      allocated).
      
      NOTE: another option for kdb is to actually figure out how to make it
      reuse the existing ftrace_dump() function and totally eliminate the
      duplication.  This sounds very appealing and actually works (the "sr
      z" command can be seen to properly dump the ftrace buffer).  The
      downside here is that ftrace_dump() fully consumes the trace buffer.
      Unless that is changed I'd rather not use it because it means "ftdump
      | grep xyz" won't be very useful to search the ftrace buffer since it
      will throw away the whole trace on the first grep.  A future patch to
      dump only the last few lines of the buffer will also be hard to
      implement.
      
      [1] https://lkml.kernel.org/r/20161117191605.GA21459@google.com
      
      Link: http://lkml.kernel.org/r/20190308193205.213659-1-dianders@chromium.org
      
      
      
      Reported-by: default avatarBrian Norris <briannorris@chromium.org>
      Signed-off-by: default avatarDouglas Anderson <dianders@chromium.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      31b265b3
    • Chao Yu's avatar
      f2fs: print more parameters in trace_f2fs_map_blocks · 76630f20
      Chao Yu authored
      
      for better map_blocks trace.
      
      Signed-off-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      76630f20
    • Chao Yu's avatar
      f2fs: trace f2fs_ioc_shutdown · 559e87c4
      Chao Yu authored
      
      This patch supports to trace f2fs_ioc_shutdown.
      
      Signed-off-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      559e87c4
    • Zeng Guangyue's avatar
      f2fs: correct spelling mistake · 68b79cdc
      Zeng Guangyue authored
      
      correct spelling mistake for "nunmber"
      
      Signed-off-by: default avatarZeng Guangyue <zengguangyue@hisilicon.com>
      Reviewed-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      68b79cdc
  15. Mar 12, 2019
    • Abel Vesa's avatar
      dt-bindings: clock: imx8mq: Fix numbering overlaps and gaps · 010d5166
      Abel Vesa authored
      
      IMX8MQ_CLK_USB_PHY_REF changes from 163 to 153, this way removing the gap.
      All the following clock ids are now decreased by 10 to keep the numbering
      right. Doing this, the IMX8MQ_CLK_CSI2_CORE is not overlapped with
      IMX8MQ_CLK_GPT1 anymore. IMX8MQ_CLK_GPT1_ROOT changes from 193 to 183 and
      all the following ids are updated accordingly.
      
      Reported-by: default avatarPatrick Wildt <patrick@blueri.se>
      Fixes: 1cf3817b ("dt-bindings: Add binding for i.MX8MQ CCM")
      Signed-off-by: default avatarAbel Vesa <abel.vesa@nxp.com>
      Signed-off-by: default avatarStephen Boyd <sboyd@kernel.org>
      010d5166
    • Olga Kornievskaia's avatar
      fix null pointer deref in tracepoints in back channel · f87b543a
      Olga Kornievskaia authored
      
      Backchannel doesn't have the rq_task->tk_clientid pointer set.
      
      Otherwise can lead to the following oops:
      ocalhost login: [  111.385319] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
      [  111.388073] #PF error: [normal kernel read fault]
      [  111.389452] PGD 80000000290d8067 P4D 80000000290d8067 PUD 75f25067 PMD 0
      [  111.391224] Oops: 0000 [#1] SMP PTI
      [  111.392151] CPU: 0 PID: 3533 Comm: NFSv4 callback Not tainted 5.0.0-rc7+ #1
      [  111.393787] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
      [  111.396340] RIP: 0010:trace_event_raw_event_xprt_enq_xmit+0x6f/0xf0 [sunrpc]
      [  111.397974] Code: 00 00 00 48 89 ee 48 89 e7 e8 bd 0a 85 d7 48 85 c0 74 4a 41 0f b7 94 24 e0 00 00 00 48 89 e7 89 50 08 49 8b 94 24 a8 00 00 00 <8b> 52 04 89 50 0c 49 8b 94 24 c0 00 00 00 8b 92 a8 00 00 00 0f ca
      [  111.402215] RSP: 0018:ffffb98743263cf8 EFLAGS: 00010286
      [  111.403406] RAX: ffffa0890fc3bc88 RBX: 0000000000000003 RCX: 0000000000000000
      [  111.405057] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffb98743263cf8
      [  111.406656] RBP: ffffa0896f5368f0 R08: 0000000000000246 R09: 0000000000000000
      [  111.408437] R10: ffffe19b01c01500 R11: 0000000000000000 R12: ffffa08977d28a00
      [  111.410210] R13: 0000000000000004 R14: ffffa089315303f0 R15: ffffa08931530000
      [  111.411856] FS:  0000000000000000(0000) GS:ffffa0897bc00000(0000) knlGS:0000000000000000
      [  111.413699] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  111.415068] CR2: 0000000000000004 CR3: 000000002ac90004 CR4: 00000000001606f0
      [  111.416745] Call Trace:
      [  111.417339]  xprt_request_enqueue_transmit+0x2b6/0x4a0 [sunrpc]
      [  111.418709]  ? rpc_task_need_encode+0x40/0x40 [sunrpc]
      [  111.419957]  call_bc_transmit+0xd5/0x170 [sunrpc]
      [  111.421067]  __rpc_execute+0x7e/0x3f0 [sunrpc]
      [  111.422177]  rpc_run_bc_task+0x78/0xd0 [sunrpc]
      [  111.423212]  bc_svc_process+0x281/0x340 [sunrpc]
      [  111.424325]  nfs41_callback_svc+0x130/0x1c0 [nfsv4]
      [  111.425430]  ? remove_wait_queue+0x60/0x60
      [  111.426398]  kthread+0xf5/0x130
      [  111.427155]  ? nfs_callback_authenticate+0x50/0x50 [nfsv4]
      [  111.428388]  ? kthread_bind+0x10/0x10
      [  111.429270]  ret_from_fork+0x1f/0x30
      
      localhost login: [  467.462259] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
      [  467.464411] #PF error: [normal kernel read fault]
      [  467.465445] PGD 80000000728c1067 P4D 80000000728c1067 PUD 728c0067 PMD 0
      [  467.466980] Oops: 0000 [#1] SMP PTI
      [  467.467759] CPU: 0 PID: 3517 Comm: NFSv4 callback Not tainted 5.0.0-rc7+ #1
      [  467.469393] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
      [  467.471840] RIP: 0010:trace_event_raw_event_xprt_transmit+0x7c/0xf0 [sunrpc]
      [  467.473392] Code: f6 48 85 c0 74 4b 49 8b 94 24 98 00 00 00 48 89 e7 0f b7 92 e0 00 00 00 89 50 08 49 8b 94 24 98 00 00 00 48 8b 92 a8 00 00 00 <8b> 52 04 89 50 0c 41 8b 94 24 a8 00 00 00 0f ca 89 50 10 41 8b 94
      [  467.477605] RSP: 0018:ffffabe7434fbcd0 EFLAGS: 00010282
      [  467.478793] RAX: ffff99720fc3bce0 RBX: 0000000000000003 RCX: 0000000000000000
      [  467.480409] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffabe7434fbcd0
      [  467.482011] RBP: ffff99726f631948 R08: 0000000000000246 R09: 0000000000000000
      [  467.483591] R10: 0000000070000000 R11: 0000000000000000 R12: ffff997277dfcc00
      [  467.485226] R13: 0000000000000000 R14: 0000000000000000 R15: ffff99722fecdca8
      [  467.486830] FS:  0000000000000000(0000) GS:ffff99727bc00000(0000) knlGS:0000000000000000
      [  467.488596] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  467.489931] CR2: 0000000000000004 CR3: 00000000270e6006 CR4: 00000000001606f0
      [  467.491559] Call Trace:
      [  467.492128]  xprt_transmit+0x303/0x3f0 [sunrpc]
      [  467.493143]  ? rpc_task_need_encode+0x40/0x40 [sunrpc]
      [  467.494328]  call_bc_transmit+0x49/0x170 [sunrpc]
      [  467.495379]  __rpc_execute+0x7e/0x3f0 [sunrpc]
      [  467.496451]  rpc_run_bc_task+0x78/0xd0 [sunrpc]
      [  467.497467]  bc_svc_process+0x281/0x340 [sunrpc]
      [  467.498507]  nfs41_callback_svc+0x130/0x1c0 [nfsv4]
      [  467.499751]  ? remove_wait_queue+0x60/0x60
      [  467.500686]  kthread+0xf5/0x130
      [  467.501438]  ? nfs_callback_authenticate+0x50/0x50 [nfsv4]
      [  467.502640]  ? kthread_bind+0x10/0x10
      [  467.503454]  ret_from_fork+0x1f/0x30
      
      Signed-off-by: default avatarOlga Kornievskaia <kolga@netapp.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      f87b543a
    • Kent Overstreet's avatar
      Drop flex_arrays · 586187d7
      Kent Overstreet authored
      All existing users have been converted to generic radix trees
      
      Link: http://lkml.kernel.org/r/20181217131929.11727-8-kent.overstreet@gmail.com
      
      
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@gmail.com>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric Paris <eparis@parisplace.org>
      Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: Paul Moore <paul@paul-moore.com>
      Cc: Pravin B Shelar <pshelar@ovn.org>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Vlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      586187d7
    • Kent Overstreet's avatar
      sctp: convert to genradix · 2075e50c
      Kent Overstreet authored
      This also makes sctp_stream_alloc_(out|in) saner, in that they no longer
      allocate new flex_arrays/genradixes, they just preallocate more
      elements.
      
      This code does however have a suspicious lack of locking.
      
      Link: http://lkml.kernel.org/r/20181217131929.11727-7-kent.overstreet@gmail.com
      
      
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@gmail.com>
      Cc: Vlad Yasevich <vyasevich@gmail.com>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Eric Paris <eparis@parisplace.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Paul Moore <paul@paul-moore.com>
      Cc: Pravin B Shelar <pshelar@ovn.org>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2075e50c
    • Kent Overstreet's avatar
      generic radix trees · ba20ba2e
      Kent Overstreet authored
      Very simple radix tree implementation that supports storing arbitrary
      size entries, up to PAGE_SIZE - upcoming patches will convert existing
      flex_array users to genradixes.  The new genradix code has a much
      simpler API and implementation, and doesn't have a hard limit on the
      number of elements like flex_array does.
      
      Link: http://lkml.kernel.org/r/20181217131929.11727-5-kent.overstreet@gmail.com
      
      
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@gmail.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Eric Paris <eparis@parisplace.org>
      Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: Paul Moore <paul@paul-moore.com>
      Cc: Pravin B Shelar <pshelar@ovn.org>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Vlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ba20ba2e
    • Mike Rapoport's avatar
      memblock: remove memblock_{set,clear}_region_flags · fe145124
      Mike Rapoport authored
      The memblock API provides dedicated helpers to set or clear a flag on a
      memory region, e.g.  memblock_{mark,clear}_hotplug().
      
      The memblock_{set,clear}_region_flags() functions are used only by the
      memblock internal function that adjusts the region flags.  Drop these
      functions and use open-coded implementation instead.
      
      Link: http://lkml.kernel.org/r/1549455025-17706-2-git-send-email-rppt@linux.ibm.com
      
      
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fe145124
Loading