      netfilter: nf_tables: autoload modules from the abort path · eb014de4
      This patch introduces a list of pending module requests. This new module
      list is composed of nft_module_request objects that contain the module
      name and one status field that tells if the module has been already
      loaded (the 'done' field).
      In the first pass, from the preparation phase, the netlink command finds
      that a module is missing on this list. Then, a module request is
      allocated and added to this list and nft_request_module() returns
      -EAGAIN. This triggers the abort path with the autoload parameter set on
      from nfnetlink, request_module() is called and the module request enters
      the 'done' state. Since the mutex is released when loading modules from
      the abort phase, the module list is zapped so this is iteration occurs
      over a local list. Therefore, the request_module() calls happen when
      object lists are in consistent state (after fulling aborting the
      transaction) and the commit list is empty.
      On the second pass, the netlink command will find that it already tried
      to load the module, so it does not request it again and
      nft_request_module() returns 0. Then, there is a look up to find the
      object that the command was missing. If the module was successfully
      loaded, the command proceeds normally since it finds the missing object
      in place, otherwise -ENOENT is reported to userspace.
      This patch also updates nfnetlink to include the reason to enter the
      abort phase, which is required for this new autoload module rationale.
      Fixes: ec7470b8
       ("netfilter: nf_tables: store transaction list locally while requesting module")
      Reported-by: default avatar <syzbot+29125d208b3dae9a7019@syzkaller.appspotmail.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      netfilter: nf_tables: add __nft_chain_type_get() · 82603549
      This new helper function validates that unknown family and chain type
      coming from userspace do not trigger an out-of-bound array access. Bail
      out in case __nft_chain_type_get() returns NULL from
      Fixes: 9370761c
       ("netfilter: nf_tables: convert built-in tables/chains to chain types")
      Reported-by: default avatar <syzbot+156a04714799b1d480bc@syzkaller.appspotmail.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      netfilter: nf_tables_offload: fix check the chain offload flag · c83de17d
      In the nft_indr_block_cb the chain should check the flag with
      Fixes: 9a32669f
       ("netfilter: nf_tables_offload: support indr block call")
      Signed-off-by: default avatarwenxu <wenxu@ucloud.cn>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      netfilter: conntrack: sctp: use distinct states for new SCTP connections · ab658b9f
      The netlink notifications triggered by the INIT and INIT_ACK chunks
      for a tracked SCTP association do not include protocol information
      for the corresponding connection - SCTP state and verification tags
      for the original and reply direction are missing. Since the connection
      tracking implementation allows user space programs to receive
      notifications about a connection and then create a new connection
      based on the values received in a notification, it makes sense that
      INIT and INIT_ACK notifications should contain the SCTP state
      and verification tags available at the time when a notification
      is sent. The missing verification tags cause a newly created
      netfilter connection to fail to verify the tags of SCTP packets
      when this connection has been created from the values previously
      received in an INIT or INIT_ACK notification.
      A PROTOINFO event is cached in sctp_packet() when the state
      of a connection changes. The CLOSED and COOKIE_WAIT state will
      be used for connections that have seen an INIT and INIT_ACK chunk,
      respectively. The distinct states will cause a connection state
      change in sctp_packet().
      Signed-off-by: default avatarJiri Wiesner <jwiesner@suse.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      net: systemport: Fixed queue mapping in internal ring map · 5a9ef194
      Florian Fainelli authored
      We would not be transmitting using the correct SYSTEMPORT transmit queue
      during ndo_select_queue() which looks up the internal TX ring map
      because while establishing the mapping we would be off by 4, so for
      instance, when we populate switch port mappings we would be doing:
      switch port 0, queue 0 -> ring index #0
      switch port 0, queue 1 -> ring index #1
      switch port 0, queue 3 -> ring index #3
      switch port 1, queue 0 -> ring index #8 (4 + 4 * 1)
      instead of using ring index #4. This would cause our ndo_select_queue()
      to use the fallback queue mechanism which would pick up an incorrect
      ring for that switch port. Fix this by using the correct switch queue
      number instead of SYSTEMPORT queue number.
      Fixes: 25c44070
       ("net: systemport: Simplify queue mapping logic")
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      net: dsa: bcm_sf2: Configure IMP port for 2Gb/sec · 8f1880cb
      Florian Fainelli authored
      With the implementation of the system reset controller we lost a setting
      that is currently applied by the bootloader and which configures the IMP
      port for 2Gb/sec, the default is 1Gb/sec. This is needed given the
      number of ports and applications we expect to run so bring back that
      Fixes: 01b0ac07589e ("net: dsa: bcm_sf2: Add support for optional reset controller line")
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      net: dsa: sja1105: Don't error out on disabled ports with no phy-mode · 27afe0d3
      Vladimir Oltean authored
      The sja1105_parse_ports_node function was tested only on device trees
      where all ports were enabled. Fix this check so that the driver
      continues to probe only with the ports where status is not "disabled",
      as expected.
      Fixes: 8aa9ebcc
       ("net: dsa: Introduce driver for NXP SJA1105 5-port L2 switch")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      net: phy: dp83867: Set FORCE_LINK_GOOD to default after reset · 86ffe920
      Michael Grzeschik authored
      According to the Datasheet this bit should be 0 (Normal operation) in
      default. With the FORCE_LINK_GOOD bit set, it is not possible to get a
      link. This patch sets FORCE_LINK_GOOD to the default value after
      resetting the phy.
      Signed-off-by: default avatarMichael Grzeschik <m.grzeschik@pengutronix.de>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      net: hns: fix soft lockup when there is not enough memory · 49edd6a2
      Yonglong Liu authored
      When there is not enough memory and napi_alloc_skb() return NULL,
      the HNS driver will print error message, and than try again, if
      the memory is not enough for a while, huge error message and the
      retry operation will cause soft lockup.
      When napi_alloc_skb() return NULL because of no memory, we can
      get a warn_alloc() call trace, so this patch deletes the error
      message. We already use polling mode to handle irq, but the
      retry operation will render the polling weight inactive, this
      patch just return budget when the rx is not completed to avoid
      dead loop.
      Fixes: 36eedfde ("net: hns: Optimize hns_nic_common_poll for better performance")
      Fixes: b5996f11
       ("net: add Hisilicon Network Subsystem basic ethernet support")
      Signed-off-by: default avatarYonglong Liu <liuyonglong@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      net: avoid updating qdisc_xmit_lock_key in netdev_update_lockdep_key() · 53d37497
      Cong Wang authored
      syzbot reported some bogus lockdep warnings, for example bad unlock
      balance in sch_direct_xmit(). They are due to a race condition between
      slow path and fast path, that is qdisc_xmit_lock_key gets re-registered
      in netdev_update_lockdep_key() on slow path, while we could still
      acquire the queue->_xmit_lock on fast path in this small window:
      CPU A						CPU B
      In fact, unlike the addr_list_lock which has to be reordered when
      the master/slave device relationship changes, queue->_xmit_lock is
      only acquired on fast path and only when NETIF_F_LLTX is not set,
      so there is likely no nested locking for it.
      Therefore, we can just get rid of re-registration of
      Reported-by: default avatar <syzbot+4ec99438ed7450da6272@syzkaller.appspotmail.com>
      Fixes: ab92d68f
       ("net: core: add generic lockdep keys")
      Cc: Taehee Yoo <ap420073@gmail.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      net/sched: act_ife: initalize ife->metalist earlier · 44c23d71
      Eric Dumazet authored
      It seems better to init ife->metalist earlier in tcf_ife_init()
      to avoid the following crash :
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] PREEMPT SMP KASAN
      CPU: 0 PID: 10483 Comm: syz-executor216 Not tainted 5.5.0-rc5-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:_tcf_ife_cleanup net/sched/act_ife.c:412 [inline]
      RIP: 0010:tcf_ife_cleanup+0x6e/0x400 net/sched/act_ife.c:431
      Code: 48 c1 ea 03 80 3c 02 00 0f 85 94 03 00 00 49 8b bd f8 00 00 00 48 b8 00 00 00 00 00 fc ff df 4c 8d 67 e8 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 5c 03 00 00 48 bb 00 00 00 00 00 fc ff df 48 8b
      RSP: 0018:ffffc90001dc6d00 EFLAGS: 00010246
      RAX: dffffc0000000000 RBX: ffffffff864619c0 RCX: ffffffff815bfa09
      RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000000
      RBP: ffffc90001dc6d50 R08: 0000000000000004 R09: fffff520003b8d8e
      R10: fffff520003b8d8d R11: 0000000000000003 R12: ffffffffffffffe8
      R13: ffff8880a79fc000 R14: ffff88809aba0e00 R15: 0000000000000000
      FS:  0000000001b51880(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000563f52cce140 CR3: 0000000093541000 CR4: 00000000001406f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       tcf_action_cleanup+0x62/0x1b0 net/sched/act_api.c:119
       __tcf_action_put+0xfa/0x130 net/sched/act_api.c:135
       __tcf_idr_release net/sched/act_api.c:165 [inline]
       __tcf_idr_release+0x59/0xf0 net/sched/act_api.c:145
       tcf_idr_release include/net/act_api.h:171 [inline]
       tcf_ife_init+0x97c/0x1870 net/sched/act_ife.c:616
       tcf_action_init_1+0x6b6/0xa40 net/sched/act_api.c:944
       tcf_action_init+0x21a/0x330 net/sched/act_api.c:1000
       tcf_action_add+0xf5/0x3b0 net/sched/act_api.c:1410
       tc_ctl_action+0x390/0x488 net/sched/act_api.c:1465
       rtnetlink_rcv_msg+0x45e/0xaf0 net/core/rtnetlink.c:5424
       netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477
       rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5442
       netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
       netlink_unicast+0x58c/0x7d0 net/netlink/af_netlink.c:1328
       netlink_sendmsg+0x91c/0xea0 net/netlink/af_netlink.c:1917
       sock_sendmsg_nosec net/socket.c:639 [inline]
       sock_sendmsg+0xd7/0x130 net/socket.c:659
       ____sys_sendmsg+0x753/0x880 net/socket.c:2330
       ___sys_sendmsg+0x100/0x170 net/socket.c:2384
       __sys_sendmsg+0x105/0x1d0 net/socket.c:2417
       __do_sys_sendmsg net/socket.c:2426 [inline]
       __se_sys_sendmsg net/socket.c:2424 [inline]
       __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2424
       do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
      Fixes: 11a94d7f
       ("net/sched: act_ife: validate the control action inside init()")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Cc: Davide Caratti <dcaratti@redhat.com>
      Reviewed-by: default avatarDavide Caratti <dcaratti@redhat.com>
      Acked-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf · a72b6a1e
      Pablo Neira Ayuso says:
      Netfilter updates for net
      The following patchset contains Netfilter fixes for net:
      1) Fix use-after-free in ipset bitmap destroy path, from Cong Wang.
      2) Missing init netns in entry cleanup path of arp_tables,
         from Florian Westphal.
      3) Fix WARN_ON in set destroy path due to missing cleanup on
         transaction error.
      4) Incorrect netlink sanity check in tunnel, from Florian Westphal.
      5) Missing sanity check for erspan version netlink attribute, also
         from Florian.
      6) Remove WARN in nft_request_module() that can be triggered from
         userspace, from Florian Westphal.
      7) Memleak in NFTA_HOOK_DEVS netlink parser, from Dan Carpenter.
      8) List poison from commit path for flowtables that are added and
         deleted in the same batch, from Florian Westphal.
      9) Fix NAT ICMP packet corruption, from Eyal Birger.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Merge branch 'bpf-sockmap-tls-fixes' · 85ddd9c3
      John Fastabend says:
      To date our usage of sockmap/tls has been fairly simple, the BPF programs
      did only well-defined pop, push, pull and apply/cork operations.
      Now that we started to push more complex programs into sockmap we uncovered
      a series of issues addressed here. Further OpenSSL3.0 version should be
      released soon with kTLS support so its important to get any remaining
      issues on BPF and kTLS support resolved.
      Additionally, I have a patch under development to allow sockmap to be
      enabled/disabled at runtime for Cilium endpoints. This allows us to stress
      the map insert/delete with kTLS more than previously where Cilium only
      added the socket to the map when it entered ESTABLISHED state and never
      touched it from the control path side again relying on the sockets own
      close() hook to remove it.
      To test I have a set of test cases in test_sockmap.c that expose these
      issues. Once we get fixes here merged and in bpf-next I'll submit the
      tests to bpf-next tree to ensure we don't regress again. Also I've run
      these patches in the Cilium CI with OpenSSL (master branch) this will
      run tools such as netperf, ab, wrk2, curl, etc. to get a broad set of
      I'm aware of two more issues that we are working to resolve in another
      couple (probably two) patches. First we see an auth tag corruption in
      kTLS when sending small 1byte chunks under stress. I've not pinned this
      down yet. But, guessing because its under 1B stress tests it must be
      some error path being triggered. And second we need to ensure BPF RX
      programs are not skipped when kTLS ULP is loaded. This breaks some of the
      sockmap selftests when running with kTLS. I'll send a follow up for this.
      v2: I dropped a patch that added !0 size check in tls_push_record
          this originated from a panic I caught awhile ago with a trace
          in the crypto stack. But I can not reproduce it anymore so will
          dig into that and send another patch later if needed. Anyways
          after a bit of thought it would be nicer if tls/crypto/bpf didn't
          require special case handling for the !0 size.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      bpf: Sockmap/tls, fix pop data with SK_DROP return code · 7361d448
      John Fastabend authored
      When user returns SK_DROP we need to reset the number of copied bytes
      to indicate to the user the bytes were dropped and not sent. If we
      don't reset the copied arg sendmsg will return as if those bytes were
      copied giving the user a positive return value.
      This works as expected today except in the case where the user also
      pops bytes. In the pop case the sg.size is reduced but we don't correctly
      account for this when copied bytes is reset. The popped bytes are not
      accounted for and we return a small positive value potentially confusing
      the user.
      The reason this happens is due to a typo where we do the wrong comparison
      when accounting for pop bytes. In this fix notice the if/else is not
      needed and that we have a similar problem if we push data except its not
      visible to the user because if delta is larger the sg.size we return a
      negative value so it appears as an error regardless.
      Fixes: 7246d8ed
       ("bpf: helper to pop data from messages")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJonathan Lemon <jonathan.lemon@gmail.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/bpf/20200111061206.8028-9-john.fastabend@gmail.com
      bpf: Sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining · 9aaaa568
      John Fastabend authored
      Its possible through a set of push, pop, apply helper calls to construct
      a skmsg, which is just a ring of scatterlist elements, with the start
      value larger than the end value. For example,
            end       start
        |_0_|_1_| ... |_n_|_n+1_|
      Where end points at 1 and start points and n so that valid elements is
      the set {n, n+1, 0, 1}.
      Currently, because we don't build the correct chain only {n, n+1} will
      be sent. This adds a check and sg_chain call to correctly submit the
      above to the crypto and tls send path.
      Fixes: d3b18ad3
       ("tls: add bpf support to sk_msg handling")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJonathan Lemon <jonathan.lemon@gmail.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/bpf/20200111061206.8028-8-john.fastabend@gmail.com
      bpf: Sockmap/tls, tls_sw can create a plaintext buf > encrypt buf · d468e477
      John Fastabend authored
      It is possible to build a plaintext buffer using push helper that is larger
      than the allocated encrypt buffer. When this record is pushed to crypto
      layers this can result in a NULL pointer dereference because the crypto
      API expects the encrypt buffer is large enough to fit the plaintext
      buffer. Kernel splat below.
      To resolve catch the cases this can happen and split the buffer into two
      records to send individually. Unfortunately, there is still one case to
      handle where the split creates a zero sized buffer. In this case we merge
      the buffers and unmark the split. This happens when apply is zero and user
      pushed data beyond encrypt buffer. This fixes the original case as well
      because the split allocated an encrypt buffer larger than the plaintext
      buffer and the merge simply moves the pointers around so we now have
      a reference to the new (larger) encrypt buffer.
      Perhaps its not ideal but it seems the best solution for a fixes branch
      and avoids handling these two cases, (a) apply that needs split and (b)
      non apply case. The are edge cases anyways so optimizing them seems not
      necessary unless someone wants later in next branches.
      [  306.719107] BUG: kernel NULL pointer dereference, address: 0000000000000008
      [  306.747260] RIP: 0010:scatterwalk_copychunks+0x12f/0x1b0
      [  306.770350] Call Trace:
      [  306.770956]  scatterwalk_map_and_copy+0x6c/0x80
      [  306.772026]  gcm_enc_copy_hash+0x4b/0x50
      [  306.772925]  gcm_hash_crypt_remain_continue+0xef/0x110
      [  306.774138]  gcm_hash_crypt_continue+0xa1/0xb0
      [  306.775103]  ? gcm_hash_crypt_continue+0xa1/0xb0
      [  306.776103]  gcm_hash_assoc_remain_continue+0x94/0xa0
      [  306.777170]  gcm_hash_assoc_continue+0x9d/0xb0
      [  306.778239]  gcm_hash_init_continue+0x8f/0xa0
      [  306.779121]  gcm_hash+0x73/0x80
      [  306.779762]  gcm_encrypt_continue+0x6d/0x80
      [  306.780582]  crypto_gcm_encrypt+0xcb/0xe0
      [  306.781474]  crypto_aead_encrypt+0x1f/0x30
      [  306.782353]  tls_push_record+0x3b9/0xb20 [tls]
      [  306.783314]  ? sk_psock_msg_verdict+0x199/0x300
      [  306.784287]  bpf_exec_tx_verdict+0x3f2/0x680 [tls]
      [  306.785357]  tls_sw_sendmsg+0x4a3/0x6a0 [tls]
      test_sockmap test signature to trigger bug,
      [TEST]: (1, 1, 1, sendmsg, pass,redir,start 1,end 2,pop (1,2),ktls,):
      Fixes: d3b18ad3
       ("tls: add bpf support to sk_msg handling")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJonathan Lemon <jonathan.lemon@gmail.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/bpf/20200111061206.8028-7-john.fastabend@gmail.com
      bpf: Sockmap/tls, msg_push_data may leave end mark in place · cf21e9ba
      John Fastabend authored
      Leaving an incorrect end mark in place when passing to crypto
      layer will cause crypto layer to stop processing data before
      all data is encrypted. To fix clear the end mark on push
      data instead of expecting users of the helper to clear the
      mark value after the fact.
      This happens when we push data into the middle of a skmsg and
      have room for it so we don't do a set of copies that already
      clear the end flag.
      Fixes: 6fff607e
       ("bpf: sk_msg program helper bpf_msg_push_data")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/bpf/20200111061206.8028-6-john.fastabend@gmail.com
      bpf: Sockmap, skmsg helper overestimates push, pull, and pop bounds · 6562e29c
      John Fastabend authored
      In the push, pull, and pop helpers operating on skmsg objects to make
      data writable or insert/remove data we use this bounds check to ensure
      specified data is valid,
       /* Bounds checks: start and pop must be inside message */
       if (start >= offset + l || last >= msg->sg.size)
           return -EINVAL;
      The problem here is offset has already included the length of the
      current element the 'l' above. So start could be past the end of
      the scatterlist element in the case where start also points into an
      offset on the last skmsg element.
      To fix do the accounting slightly different by adding the length of
      the previous entry to offset at the start of the iteration. And
      ensure its initialized to zero so that the first iteration does
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Fixes: 6fff607e ("bpf: sk_msg program helper bpf_msg_push_data")
      Fixes: 7246d8ed
       ("bpf: helper to pop data from messages")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/bpf/20200111061206.8028-5-john.fastabend@gmail.com
      bpf: Sockmap/tls, push write_space updates through ulp updates · 33bfe20d
      John Fastabend authored
      When sockmap sock with TLS enabled is removed we cleanup bpf/psock state
      and call tcp_update_ulp() to push updates to TLS ULP on top. However, we
      don't push the write_space callback up and instead simply overwrite the
      op with the psock stored previous op. This may or may not be correct so
      to ensure we don't overwrite the TLS write space hook pass this field to
      the ULP and have it fixup the ctx.
      This completes a previous fix that pushed the ops through to the ULP
      but at the time missed doing this for write_space, presumably because
      write_space TLS hook was added around the same time.
      Fixes: 95fa1454
       ("bpf: sockmap/tls, close can race with map free")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Acked-by: default avatarJonathan Lemon <jonathan.lemon@gmail.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/bpf/20200111061206.8028-4-john.fastabend@gmail.com
      bpf: Sockmap, ensure sock lock held during tear down · 7e81a353
      John Fastabend authored
      The sock_map_free() and sock_hash_free() paths used to delete sockmap
      and sockhash maps walk the maps and destroy psock and bpf state associated
      with the socks in the map. When done the socks no longer have BPF programs
      attached and will function normally. This can happen while the socks in
      the map are still "live" meaning data may be sent/received during the walk.
      Currently, though we don't take the sock_lock when the psock and bpf state
      is removed through this path. Specifically, this means we can be writing
      into the ops structure pointers such as sendmsg, sendpage, recvmsg, etc.
      while they are also being called from the networking side. This is not
      safe, we never used proper READ_ONCE/WRITE_ONCE semantics here if we
      believed it was safe. Further its not clear to me its even a good idea
      to try and do this on "live" sockets while networking side might also
      be using the socket. Instead of trying to reason about using the socks
      from both sides lets realize that every use case I'm aware of rarely
      deletes maps, in fact kubernetes/Cilium case builds map at init and
      never tears it down except on errors. So lets do the simple fix and
      grab sock lock.
      This patch wraps sock deletes from maps in sock lock and adds some
      annotations so we catch any other cases easier.
      Fixes: 604326b4
       ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/bpf/20200111061206.8028-3-john.fastabend@gmail.com