- Mar 16, 2019
-
-
Björn Töpel authored
When the umem is cleaned up, the task that created it might already be gone. If the task was gone, the xdp_umem_release function did not free the pages member of struct xdp_umem. It turned out that the task lookup was not needed at all; The code was a left-over when we moved from task accounting to user accounting [1]. This patch fixes the memory leak by removing the task lookup logic completely. [1] https://lore.kernel.org/netdev/20180131135356.19134-3-bjorn.topel@gmail.com/ Link: https://lore.kernel.org/netdev/c1cb2ca8-6a14-3980-8672-f3de0bb38dfd@suse.cz/ Fixes: c0c77d8f ("xsk: add user memory registration support sockopt") Reported-by:
Jiri Slaby <jslaby@suse.cz> Signed-off-by:
Björn Töpel <bjorn.topel@intel.com> Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net>
-
- Mar 14, 2019
-
-
Quentin Monnet authored
Add documentation for the BPF spinlock-related helpers to the doc in bpf.h. I added the constraints and restrictions coming with the use of spinlocks for BPF: not all of it is directly related to the use of the helper, but I thought it would be nice for users to find them in the man page. This list of restrictions is nearly a verbatim copy of the list in Alexei's commit log for those helpers. Signed-off-by:
Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by:
Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by:
Alexei Starovoitov <ast@kernel.org>
-
Quentin Monnet authored
Another round of minor fixes for the documentation of the BPF helpers located in the UAPI bpf.h header file. Changes include: - Moving around description of some helpers, to keep the descriptions in the same order as helpers are declared (bpf_map_push_elem(), leftover from commit 90b1023f ("bpf: fix documentation for eBPF helpers"), bpf_rc_keydown(), and bpf_skb_ancestor_cgroup_id()). - Fixing typos ("contex" -> "context"). - Harmonising return types ("void* " -> "void *", "uint64_t" -> "u64"). - Addition of the "bpf_" prefix to bpf_get_storage(). - Light additions of RST markup on some keywords. - Empty line deletion between description and return value for bpf_tcp_sock(). - Edit for the description for bpf_skb_ecn_set_ce() (capital letters, acronym expansion, no effect if ECT not set, more details on return value). Signed-off-by:
Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by:
Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by:
Alexei Starovoitov <ast@kernel.org>
-
- Mar 13, 2019
-
-
Martin KaFai Lau authored
Add a new helper "struct bpf_sock *bpf_get_listener_sock(struct bpf_sock *sk)" which returns a bpf_sock in TCP_LISTEN state. It will trace back to the listener sk from a request_sock if possible. It returns NULL for all other cases. No reference is taken because the helper ensures the sk is in SOCK_RCU_FREE (where the TCP_LISTEN sock should be in). Hence, bpf_sk_release() is unnecessary and the verifier does not allow bpf_sk_release(listen_sk) to be called either. The following is also allowed because the bpf_prog is run under rcu_read_lock(): sk = bpf_sk_lookup_tcp(); /* if (!sk) { ... } */ listen_sk = bpf_get_listener_sock(sk); /* if (!listen_sk) { ... } */ bpf_sk_release(sk); src_port = listen_sk->src_port; /* Allowed */ Signed-off-by:
Martin KaFai Lau <kafai@fb.com> Signed-off-by:
Alexei Starovoitov <ast@kernel.org>
-
Martin KaFai Lau authored
Lorenz Bauer [thanks!] reported that a ptr returned by bpf_tcp_sock(sk) can still be accessed after bpf_sk_release(sk). Both bpf_tcp_sock() and bpf_sk_fullsock() have the same issue. This patch addresses them together. A simple reproducer looks like this: sk = bpf_sk_lookup_tcp(); /* if (!sk) ... */ tp = bpf_tcp_sock(sk); /* if (!tp) ... */ bpf_sk_release(sk); snd_cwnd = tp->snd_cwnd; /* oops! The verifier does not complain. */ The problem is the verifier did not scrub the register's states of the tcp_sock ptr (tp) after bpf_sk_release(sk). [ Note that when calling bpf_tcp_sock(sk), the sk is not always refcount-acquired. e.g. bpf_tcp_sock(skb->sk). The verifier works fine for this case. ] Currently, the verifier does not track if a helper's return ptr (in REG_0) is "carry"-ing one of its argument's refcount status. To carry this info, the reg1->id needs to be stored in reg0. One approach was tried, like "reg0->id = reg1->id", when calling "bpf_tcp_sock()". The main idea was to avoid adding another "ref_obj_id" for the same reg. However, overlapping the NULL marking and ref tracking purpose in one "id" does not work well: ref_sk = bpf_sk_lookup_tcp(); fullsock = bpf_sk_fullsock(ref_sk); tp = bpf_tcp_sock(ref_sk); if (!fullsock) { bpf_sk_release(ref_sk); return 0; } /* fullsock_reg->id is marked for NOT-NULL. * Same for tp_reg->id because they have the same id. */ /* oops. verifier did not complain about the missing !tp check */ snd_cwnd = tp->snd_cwnd; Hence, a new "ref_obj_id" is needed in "struct bpf_reg_state". With a new ref_obj_id, when bpf_sk_release(sk) is called, the verifier can scrub all reg states which has a ref_obj_id match. It is done with the changes in release_reg_references() in this patch. While fixing it, sk_to_full_sk() is removed from bpf_tcp_sock() and bpf_sk_fullsock() to avoid these helpers from returning another ptr. It will make bpf_sk_release(tp) possible: sk = bpf_sk_lookup_tcp(); /* if (!sk) ... */ tp = bpf_tcp_sock(sk); /* if (!tp) ... */ bpf_sk_release(tp); A separate helper "bpf_get_listener_sock()" will be added in a later patch to do sk_to_full_sk(). Misc change notes: - To allow bpf_sk_release(tp), the arg of bpf_sk_release() is changed from ARG_PTR_TO_SOCKET to ARG_PTR_TO_SOCK_COMMON. ARG_PTR_TO_SOCKET is removed from bpf.h since no helper is using it. - arg_type_is_refcounted() is renamed to arg_type_may_be_refcounted() because ARG_PTR_TO_SOCK_COMMON is the only one and skb->sk is not refcounted. All bpf_sk_release(), bpf_sk_fullsock() and bpf_tcp_sock() take ARG_PTR_TO_SOCK_COMMON. - check_refcount_ok() ensures is_acquire_function() cannot take arg_type_may_be_refcounted() as its argument. - The check_func_arg() can only allow one refcount-ed arg. It is guaranteed by check_refcount_ok() which ensures at most one arg can be refcounted. Hence, it is a verifier internal error if >1 refcount arg found in check_func_arg(). - In release_reference(), release_reference_state() is called first to ensure a match on "reg->ref_obj_id" can be found before scrubbing the reg states with release_reg_references(). - reg_is_refcounted() is no longer needed. 1. In mark_ptr_or_null_regs(), its usage is replaced by "ref_obj_id && ref_obj_id == id" because, when is_null == true, release_reference_state() should only be called on the ref_obj_id obtained by a acquire helper (i.e. is_acquire_function() == true). Otherwise, the following would happen: sk = bpf_sk_lookup_tcp(); /* if (!sk) { ... } */ fullsock = bpf_sk_fullsock(sk); if (!fullsock) { /* * release_reference_state(fullsock_reg->ref_obj_id) * where fullsock_reg->ref_obj_id == sk_reg->ref_obj_id. * * Hence, the following bpf_sk_release(sk) will fail * because the ref state has already been released in the * earlier release_reference_state(fullsock_reg->ref_obj_id). */ bpf_sk_release(sk); } 2. In release_reg_references(), the current reg_is_refcounted() call is unnecessary because the id check is enough. - The type_is_refcounted() and type_is_refcounted_or_null() are no longer needed also because reg_is_refcounted() is removed. Fixes: 655a51e5 ("bpf: Add struct bpf_tcp_sock and BPF_FUNC_tcp_sock") Reported-by:
Lorenz Bauer <lmb@cloudflare.com> Signed-off-by:
Martin KaFai Lau <kafai@fb.com> Signed-off-by:
Alexei Starovoitov <ast@kernel.org>
-
- Mar 10, 2019
-
-
Eric Dumazet authored
ip_mc_may_pull() must return 0 if there is a problem, not an errno. syzbot reported : BUG: KASAN: use-after-free in br_ip4_multicast_igmp3_report net/bridge/br_multicast.c:947 [inline] BUG: KASAN: use-after-free in br_multicast_ipv4_rcv net/bridge/br_multicast.c:1631 [inline] BUG: KASAN: use-after-free in br_multicast_rcv+0x3cd8/0x4440 net/bridge/br_multicast.c:1741 Read of size 4 at addr ffff88820a4084ee by task syz-executor.2/11183 CPU: 1 PID: 11183 Comm: syz-executor.2 Not tainted 5.0.0+ #14 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187 kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317 __asan_report_load4_noabort+0x14/0x20 mm/kasan/generic_report.c:131 br_ip4_multicast_igmp3_report net/bridge/br_multicast.c:947 [inline] br_multicast_ipv4_rcv net/bridge/br_multicast.c:1631 [inline] br_multicast_rcv+0x3cd8/0x4440 net/bridge/br_multicast.c:1741 br_handle_frame_finish+0xa3a/0x14c0 net/bridge/br_input.c:108 br_nf_hook_thresh+0x2ec/0x380 net/bridge/br_netfilter_hooks.c:1005 br_nf_pre_routing_finish+0x8e2/0x1750 net/bridge/br_netfilter_hooks.c:410 NF_HOOK include/linux/netfilter.h:289 [inline] NF_HOOK include/linux/netfilter.h:283 [inline] br_nf_pre_routing+0x7e7/0x13a0 net/bridge/br_netfilter_hooks.c:506 nf_hook_entry_hookfn include/linux/netfilter.h:119 [inline] nf_hook_slow+0xbf/0x1f0 net/netfilter/core.c:511 nf_hook include/linux/netfilter.h:244 [inline] NF_HOOK include/linux/netfilter.h:287 [inline] br_handle_frame+0x95b/0x1450 net/bridge/br_input.c:305 __netif_receive_skb_core+0xa96/0x3040 net/core/dev.c:4902 __netif_receive_skb_one_core+0xa8/0x1a0 net/core/dev.c:4971 __netif_receive_skb+0x2c/0x1c0 net/core/dev.c:5083 netif_receive_skb_internal+0x117/0x660 net/core/dev.c:5186 netif_receive_skb+0x6e/0x5a0 net/core/dev.c:5261 Fixes: ba5ea614 ("bridge: simplify ip_mc_check_igmp() and ipv6_mc_check_mld() calls") Signed-off-by:
Eric Dumazet <edumazet@google.com> Reported-by:
syzbot <syzkaller@googlegroups.com> Cc: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Guillaume Nault authored
As Eric Dumazet said, "We do not have a way to tell if the req was ever inserted in a hash table, so better play safe.". Let's remove this comment, so that nobody will be tempted to drop the WARN_ON_ONCE() line. Signed-off-by:
Guillaume Nault <gnault@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Mar 08, 2019
-
-
Pedro Tammela authored
This patch adds missing documentation for some inline functions on linux/skbuff.h. The patch is incomplete and a lot more can be added, just wondering if it's of interest of the netdev developers. Also fixed some whitespaces. Signed-off-by:
Pedro Tammela <pctammela@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Bo YU authored
Sparse warning below: sudo make C=2 CF=-D__CHECK_ENDIAN__ M=net/bpf/ CHECK net/bpf//test_run.c net/bpf//test_run.c:19:77: warning: Using plain integer as NULL pointer ./include/linux/bpf-cgroup.h:295:77: warning: Using plain integer as NULL pointer Fixes: 8bad74f9 ("bpf: extend cgroup bpf core to allow multiple cgroup storage types") Acked-by:
Yonghong Song <yhs@fb.com> Signed-off-by:
Bo YU <tsu.yubo@gmail.com> Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net>
-
David Howells authored
rxrpc_disconnect_client_call() reads the call's connection ID protocol value (call->cid) as part of that function's variable declarations. This is bad because it's not inside the locked section and so may race with someone granting use of the channel to the call. This manifests as an assertion failure (see below) where the call in the presumed channel (0 because call->cid wasn't set when we read it) doesn't match the call attached to the channel we were actually granted (if 1, 2 or 3). Fix this by moving the read and dependent calculations inside of the channel_lock section. Also, only set the channel number and pointer variables if cid is not zero (ie. unset). This problem can be induced by injecting an occasional error in rxrpc_wait_for_channel() before the call to schedule(). Make two further changes also: (1) Add a trace for wait failure in rxrpc_connect_call(). (2) Drop channel_lock before BUG'ing in the case of the assertion failure. The failure causes a trace akin to the following: rxrpc: Assertion failed - 18446612685268945920(0xffff8880beab8c00) == 18446612685268621312(0xffff8880bea69800) is false ------------[ cut here ]------------ kernel BUG at net/rxrpc/conn_client.c:824! ... RIP: 0010:rxrpc_disconnect_client_call+0x2bf/0x99d ... Call Trace: rxrpc_connect_call+0x902/0x9b3 ? wake_up_q+0x54/0x54 rxrpc_new_client_call+0x3a0/0x751 ? rxrpc_kernel_begin_call+0x141/0x1bc ? afs_alloc_call+0x1b5/0x1b5 rxrpc_kernel_begin_call+0x141/0x1bc afs_make_call+0x20c/0x525 ? afs_alloc_call+0x1b5/0x1b5 ? __lock_is_held+0x40/0x71 ? lockdep_init_map+0xaf/0x193 ? lockdep_init_map+0xaf/0x193 ? __lock_is_held+0x40/0x71 ? yfs_fs_fetch_data+0x33b/0x34a yfs_fs_fetch_data+0x33b/0x34a afs_fetch_data+0xdc/0x3b7 afs_read_dir+0x52d/0x97f afs_dir_iterate+0xa0/0x661 ? iterate_dir+0x63/0x141 iterate_dir+0xa2/0x141 ksys_getdents64+0x9f/0x11b ? filldir+0x111/0x111 ? do_syscall_64+0x3e/0x1a0 __x64_sys_getdents64+0x16/0x19 do_syscall_64+0x7d/0x1a0 entry_SYSCALL_64_after_hwframe+0x49/0xbe Fixes: 45025bce ("rxrpc: Improve management and caching of client connection objects") Signed-off-by:
David Howells <dhowells@redhat.com> Reviewed-by:
Marc Dionne <marc.dionne@auristor.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Mar 07, 2019
-
-
Willem de Bruijn authored
BPF can adjust gso only for tcp bytestreams. Fail on other gso types. But only on gso packets. It does not touch this field if !gso_size. Fixes: b90efd22 ("bpf: only adjust gso_size on bytestream protocols") Signed-off-by:
Willem de Bruijn <willemb@google.com> Acked-by:
Yonghong Song <yhs@fb.com> Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net>
-
- Mar 06, 2019
-
-
Arnd Bergmann authored
A bugfix just broke compilation of appletalk when CONFIG_SYSCTL is disabled: In file included from net/appletalk/ddp.c:65: net/appletalk/ddp.c: In function 'atalk_init': include/linux/atalk.h:164:34: error: expected expression before 'do' #define atalk_register_sysctl() do { } while(0) ^~ net/appletalk/ddp.c:1934:7: note: in expansion of macro 'atalk_register_sysctl' rc = atalk_register_sysctl(); This is easier to avoid by using conventional inline functions as stubs rather than macros. The header already has inline functions for other purposes, so I'm changing over all the macros for consistency. Fixes: 6377f787 ("appletalk: Fix use-after-free in atalk_proc_exit") Signed-off-by:
Arnd Bergmann <arnd@arndb.de> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Mar 04, 2019
-
-
Eran Ben Elisha authored
It is possible that a reporter state will be updated due to a recover flow which is not triggered by a devlink health related operation, but as a side effect of some other operation in the system. Expose devlink health API for a direct update of a reporter status. Move devlink_health_reporter_state enum definition to devlink.h so it could be used from drivers as a parameter of devlink_health_reporter_state_update. In addition, add trace_devlink_health_reporter_state_update to provide user notification for reporter state change. Signed-off-by:
Eran Ben Elisha <eranbe@mellanox.com> Acked-by:
Jiri Pirko <jiri@mellanox.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Linus Torvalds authored
Every in-kernel use of this function defined it to KERNEL_DS (either as an actual define, or as an inline function). It's an entirely historical artifact, and long long long ago used to actually read the segment selector valueof '%ds' on x86. Which in the kernel is always KERNEL_DS. Inspired by a patch from Jann Horn that just did this for a very small subset of users (the ones in fs/), along with Al who suggested a script. I then just took it to the logical extreme and removed all the remaining gunk. Roughly scripted with git grep -l '(get_ds())' -- :^tools/ | xargs sed -i 's/(get_ds())/(KERNEL_DS)/' git grep -lw 'get_ds' -- :^tools/ | xargs sed -i '/^#define get_ds()/d' plus manual fixups to remove a few unusual usage patterns, the couple of inline function cases and to fix up a comment that had become stale. The 'get_ds()' function remains in an x86 kvm selftest, since in user space it actually does something relevant. Inspired-by:
Jann Horn <jannh@google.com> Inspired-by:
Al Viro <viro@zeniv.linux.org.uk> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Linus Torvalds authored
Al Viro root-caused a race where the IOCB_CMD_POLL handling of fget/fput() could cause us to access the file pointer after it had already been freed: "In more details - normally IOCB_CMD_POLL handling looks so: 1) io_submit(2) allocates aio_kiocb instance and passes it to aio_poll() 2) aio_poll() resolves the descriptor to struct file by req->file = fget(iocb->aio_fildes) 3) aio_poll() sets ->woken to false and raises ->ki_refcnt of that aio_kiocb to 2 (bumps by 1, that is). 4) aio_poll() calls vfs_poll(). After sanity checks (basically, "poll_wait() had been called and only once") it locks the queue. That's what the extra reference to iocb had been for - we know we can safely access it. 5) With queue locked, we check if ->woken has already been set to true (by aio_poll_wake()) and, if it had been, we unlock the queue, drop a reference to aio_kiocb and bugger off - at that point it's a responsibility to aio_poll_wake() and the stuff called/scheduled by it. That code will drop the reference to file in req->file, along with the other reference to our aio_kiocb. 6) otherwise, we see whether we need to wait. If we do, we unlock the queue, drop one reference to aio_kiocb and go away - eventual wakeup (or cancel) will deal with the reference to file and with the other reference to aio_kiocb 7) otherwise we remove ourselves from waitqueue (still under the queue lock), so that wakeup won't get us. No async activity will be happening, so we can safely drop req->file and iocb ourselves. If wakeup happens while we are in vfs_poll(), we are fine - aio_kiocb won't get freed under us, so we can do all the checks and locking safely. And we don't touch ->file if we detect that case. However, vfs_poll() most certainly *does* touch the file it had been given. So wakeup coming while we are still in ->poll() might end up doing fput() on that file. That case is not too rare, and usually we are saved by the still present reference from descriptor table - that fput() is not the final one. But if another thread closes that descriptor right after our fget() and wakeup does happen before ->poll() returns, we are in trouble - final fput() done while we are in the middle of a method: Al also wrote a patch to take an extra reference to the file descriptor to fix this, but I instead suggested we just streamline the whole file pointer handling by submit_io() so that the generic aio submission code simply keeps the file pointer around until the aio has completed. Fixes: bfe4037e ("aio: implement IOCB_CMD_POLL") Acked-by:
Al Viro <viro@zeniv.linux.org.uk> Reported-by:
<syzbot+503d4cc169fcec1cb18c@syzkaller.appspotmail.com> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Boris Pismenny authored
TLS device cannot use the sw context. This patch returns the original tls device write space handler and moves the sw/device specific portions to the relevant files. Also, we remove the write_space call for the tls_sw flow, because it handles partial records in its delayed tx work handler. Fixes: a42055e8 ("net/tls: Add support for async encryption of records for performance") Signed-off-by:
Boris Pismenny <borisp@mellanox.com> Reviewed-by:
Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Boris Pismenny authored
Cleanup the handling of partial records while fixing a bug where the tls_push_pending_closed_record function is using the software tls context instead of the hardware context. The bug resulted in the following crash: [ 88.791229] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 [ 88.793271] #PF error: [normal kernel read fault] [ 88.794449] PGD 800000022a426067 P4D 800000022a426067 PUD 22a156067 PMD 0 [ 88.795958] Oops: 0000 [#1] SMP PTI [ 88.796884] CPU: 2 PID: 4973 Comm: openssl Not tainted 5.0.0-rc4+ #3 [ 88.798314] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [ 88.800067] RIP: 0010:tls_tx_records+0xef/0x1d0 [tls] [ 88.801256] Code: 00 02 48 89 43 08 e8 a0 0b 96 d9 48 89 df e8 48 dd 4d d9 4c 89 f8 4d 8b bf 98 00 00 00 48 05 98 00 00 00 48 89 04 24 49 39 c7 <49> 8b 1f 4d 89 fd 0f 84 af 00 00 00 41 8b 47 10 85 c0 0f 85 8d 00 [ 88.805179] RSP: 0018:ffffbd888186fca8 EFLAGS: 00010213 [ 88.806458] RAX: ffff9af1ed657c98 RBX: ffff9af1e88a1980 RCX: 0000000000000000 [ 88.808050] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9af1e88a1980 [ 88.809724] RBP: ffff9af1e88a1980 R08: 0000000000000017 R09: ffff9af1ebeeb700 [ 88.811294] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 88.812917] R13: ffff9af1e88a1980 R14: ffff9af1ec13f800 R15: 0000000000000000 [ 88.814506] FS: 00007fcad2240740(0000) GS:ffff9af1f7880000(0000) knlGS:0000000000000000 [ 88.816337] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 88.817717] CR2: 0000000000000000 CR3: 0000000228b3e000 CR4: 00000000001406e0 [ 88.819328] Call Trace: [ 88.820123] tls_push_data+0x628/0x6a0 [tls] [ 88.821283] ? remove_wait_queue+0x20/0x60 [ 88.822383] ? n_tty_read+0x683/0x910 [ 88.823363] tls_device_sendmsg+0x53/0xa0 [tls] [ 88.824505] sock_sendmsg+0x36/0x50 [ 88.825492] sock_write_iter+0x87/0x100 [ 88.826521] __vfs_write+0x127/0x1b0 [ 88.827499] vfs_write+0xad/0x1b0 [ 88.828454] ksys_write+0x52/0xc0 [ 88.829378] do_syscall_64+0x5b/0x180 [ 88.830369] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 88.831603] RIP: 0033:0x7fcad1451680 [ 1248.470626] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 [ 1248.472564] #PF error: [normal kernel read fault] [ 1248.473790] PGD 0 P4D 0 [ 1248.474642] Oops: 0000 [#1] SMP PTI [ 1248.475651] CPU: 3 PID: 7197 Comm: openssl Tainted: G OE 5.0.0-rc4+ #3 [ 1248.477426] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [ 1248.479310] RIP: 0010:tls_tx_records+0x110/0x1f0 [tls] [ 1248.480644] Code: 00 02 48 89 43 08 e8 4f cb 63 d7 48 89 df e8 f7 9c 1b d7 4c 89 f8 4d 8b bf 98 00 00 00 48 05 98 00 00 00 48 89 04 24 49 39 c7 <49> 8b 1f 4d 89 fd 0f 84 af 00 00 00 41 8b 47 10 85 c0 0f 85 8d 00 [ 1248.484825] RSP: 0018:ffffaa0a41543c08 EFLAGS: 00010213 [ 1248.486154] RAX: ffff955a2755dc98 RBX: ffff955a36031980 RCX: 0000000000000006 [ 1248.487855] RDX: 0000000000000000 RSI: 000000000000002b RDI: 0000000000000286 [ 1248.489524] RBP: ffff955a36031980 R08: 0000000000000000 R09: 00000000000002b1 [ 1248.491394] R10: 0000000000000003 R11: 00000000ad55ad55 R12: 0000000000000000 [ 1248.493162] R13: 0000000000000000 R14: ffff955a2abe6c00 R15: 0000000000000000 [ 1248.494923] FS: 0000000000000000(0000) GS:ffff955a378c0000(0000) knlGS:0000000000000000 [ 1248.496847] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1248.498357] CR2: 0000000000000000 CR3: 000000020c40e000 CR4: 00000000001406e0 [ 1248.500136] Call Trace: [ 1248.500998] ? tcp_check_oom+0xd0/0xd0 [ 1248.502106] tls_sk_proto_close+0x127/0x1e0 [tls] [ 1248.503411] inet_release+0x3c/0x60 [ 1248.504530] __sock_release+0x3d/0xb0 [ 1248.505611] sock_close+0x11/0x20 [ 1248.506612] __fput+0xb4/0x220 [ 1248.507559] task_work_run+0x88/0xa0 [ 1248.508617] do_exit+0x2cb/0xbc0 [ 1248.509597] ? core_sys_select+0x17a/0x280 [ 1248.510740] do_group_exit+0x39/0xb0 [ 1248.511789] get_signal+0x1d0/0x630 [ 1248.512823] do_signal+0x36/0x620 [ 1248.513822] exit_to_usermode_loop+0x5c/0xc6 [ 1248.515003] do_syscall_64+0x157/0x180 [ 1248.516094] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 1248.517456] RIP: 0033:0x7fb398bd3f53 [ 1248.518537] Code: Bad RIP value. Fixes: a42055e8 ("net/tls: Add support for async encryption of records for performance") Signed-off-by:
Boris Pismenny <borisp@mellanox.com> Signed-off-by:
Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Heiner Kallweit authored
genphy_no_soft_reset and gen10g_no_soft_reset are both the same no-ops, one is enough. Signed-off-by:
Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by:
Andrew Lunn <andrew@lunn.ch> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Heiner Kallweit authored
gen10g_read_status is deprecated, therefore stop exporting it. We don't want to encourage anybody to use it. Signed-off-by:
Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by:
Andrew Lunn <andrew@lunn.ch> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Heiner Kallweit authored
ETHTOOL_LINK_MODE_10000baseT_Full_BIT is set anyway in the supported and advertising bitmap because it's part of PHY_10GBIT_FEATURES. And all users of gen10g_config_init use PHY_10GBIT_FEATURES. Signed-off-by:
Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by:
Andrew Lunn <andrew@lunn.ch> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Heiner Kallweit authored
phy_suspend() and phy_resume() are no-ops anyway if no callback is defined. Therefore we don't need these stubs. Signed-off-by:
Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by:
Andrew Lunn <andrew@lunn.ch> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Francesco Ruggeri authored
By default IPv6 socket with IPV6_ROUTER_ALERT socket option set will receive all IPv6 RA packets from all namespaces. IPV6_ROUTER_ALERT_ISOLATE socket option restricts packets received by the socket to be only from the socket's namespace. Signed-off-by:
Maxim Martynov <maxim@arista.com> Signed-off-by:
Francesco Ruggeri <fruggeri@arista.com> Reviewed-by:
David Ahern <dsahern@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Kevin Darbyshire-Bryant authored
Add flag 'FWMARK' to enable use of firewall connmarks as tin selector. The connmark (skbuff->mark) needs to be in the range 1->tin_cnt ie. for diffserv3 the mark needs to be 1->3. Background Typically CAKE uses DSCP as the basis for tin selection. DSCP values are relatively easily changed as part of the egress path, usually with iptables & the mangle table, ingress is more challenging. CAKE is often used on the WAN interface of a residential gateway where passthrough of DSCP from the ISP is either missing or set to unhelpful values thus use of ingress DSCP values for tin selection isn't helpful in that environment. An approach to solving the ingress tin selection problem is to use CAKE's understanding of tc filters. Naive tc filters could match on source/destination port numbers and force tin selection that way, but multiple filters don't scale particularly well as each filter must be traversed whether it matches or not. e.g. a simple example to map 3 firewall marks to tins: MAJOR=$( tc qdisc show dev $DEV | head -1 | awk '{print $3}' ) tc filter add dev $DEV parent $MAJOR protocol all handle 0x01 fw action skbedit priority ${MAJOR}1 tc filter add dev $DEV parent $MAJOR protocol all handle 0x02 fw action skbedit priority ${MAJOR}2 tc filter add dev $DEV parent $MAJOR protocol all handle 0x03 fw action skbedit priority ${MAJOR}3 Another option is to use eBPF cls_act with tc filters e.g. MAJOR=$( tc qdisc show dev $DEV | head -1 | awk '{print $3}' ) tc filter add dev $DEV parent $MAJOR bpf da obj my-bpf-fwmark-to-class.o This has the disadvantages of a) needing someone to write & maintain the bpf program, b) a bpf toolchain to compile it and c) needing to hardcode the major number in the bpf program so it matches the cake instance (or forcing the cake instance to a particular major number) since the major number cannot be passed to the bpf program via tc command line. As already hinted at by the previous examples, it would be helpful to associate tins with something that survives the Internet path and ideally allows tin selection on both egress and ingress. Netfilter's conntrack permits setting an identifying mark on a connection which can also be restored to an ingress packet with tc action connmark e.g. tc filter add dev eth0 parent ffff: protocol all prio 10 u32 \ match u32 0 0 flowid 1:1 action connmark action mirred egress redirect dev ifb1 Since tc's connmark action has restored any connmark into skb->mark, any of the previous solutions are based upon it and in one form or another copy that mark to the skb->priority field where again CAKE picks this up. This change cuts out at least one of the (less intuitive & non-scalable) middlemen and permit direct access to skb->mark. Signed-off-by:
Kevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk> Signed-off-by:
Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Mar 03, 2019
-
-
Axel Lin authored
By setting curr_table, n_current_limits, csel_reg and csel_mask, the regmap users can use regulator_set_current_limit_regmap and regulator_get_current_limit_regmap for set/get_current_limit callbacks. Signed-off-by:
Axel Lin <axel.lin@ingics.com> Signed-off-by:
Mark Brown <broonie@kernel.org>
-
Axel Lin authored
The csel_reg and csel_mask fields in struct regulator_desc needs to be generic for drivers. Not just for TPS65218. Signed-off-by:
Axel Lin <axel.lin@ingics.com> Signed-off-by:
Mark Brown <broonie@kernel.org>
-
Tristram Ha authored
KSZ9893 switch is similar to KSZ9477 switch except the ingress tail tag has 1 byte instead of 2 bytes. The size of the portmap is smaller and so the override and lookup bits are also moved. Signed-off-by:
Tristram Ha <Tristram.Ha@microchip.com> Reviewed-by:
Florian Fainelli <f.fainelli@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
YueHaibing authored
KASAN report this: BUG: KASAN: use-after-free in pde_subdir_find+0x12d/0x150 fs/proc/generic.c:71 Read of size 8 at addr ffff8881f41fe5b0 by task syz-executor.0/2806 CPU: 0 PID: 2806 Comm: syz-executor.0 Not tainted 5.0.0-rc7+ #45 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0xfa/0x1ce lib/dump_stack.c:113 print_address_description+0x65/0x270 mm/kasan/report.c:187 kasan_report+0x149/0x18d mm/kasan/report.c:317 pde_subdir_find+0x12d/0x150 fs/proc/generic.c:71 remove_proc_entry+0xe8/0x420 fs/proc/generic.c:667 atalk_proc_exit+0x18/0x820 [appletalk] atalk_exit+0xf/0x5a [appletalk] __do_sys_delete_module kernel/module.c:1018 [inline] __se_sys_delete_module kernel/module.c:961 [inline] __x64_sys_delete_module+0x3dc/0x5e0 kernel/module.c:961 do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x462e99 Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fb2de6b9c58 EFLAGS: 00000246 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 000000000073bf00 RCX: 0000000000462e99 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000200001c0 RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007fb2de6ba6bc R13: 00000000004bccaa R14: 00000000006f6bc8 R15: 00000000ffffffff Allocated by task 2806: set_track mm/kasan/common.c:85 [inline] __kasan_kmalloc.constprop.3+0xa0/0xd0 mm/kasan/common.c:496 slab_post_alloc_hook mm/slab.h:444 [inline] slab_alloc_node mm/slub.c:2739 [inline] slab_alloc mm/slub.c:2747 [inline] kmem_cache_alloc+0xcf/0x250 mm/slub.c:2752 kmem_cache_zalloc include/linux/slab.h:730 [inline] __proc_create+0x30f/0xa20 fs/proc/generic.c:408 proc_mkdir_data+0x47/0x190 fs/proc/generic.c:469 0xffffffffc10c01bb 0xffffffffc10c0166 do_one_initcall+0xfa/0x5ca init/main.c:887 do_init_module+0x204/0x5f6 kernel/module.c:3460 load_module+0x66b2/0x8570 kernel/module.c:3808 __do_sys_finit_module+0x238/0x2a0 kernel/module.c:3902 do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 2806: set_track mm/kasan/common.c:85 [inline] __kasan_slab_free+0x130/0x180 mm/kasan/common.c:458 slab_free_hook mm/slub.c:1409 [inline] slab_free_freelist_hook mm/slub.c:1436 [inline] slab_free mm/slub.c:2986 [inline] kmem_cache_free+0xa6/0x2a0 mm/slub.c:3002 pde_put+0x6e/0x80 fs/proc/generic.c:647 remove_proc_entry+0x1d3/0x420 fs/proc/generic.c:684 0xffffffffc10c031c 0xffffffffc10c0166 do_one_initcall+0xfa/0x5ca init/main.c:887 do_init_module+0x204/0x5f6 kernel/module.c:3460 load_module+0x66b2/0x8570 kernel/module.c:3808 __do_sys_finit_module+0x238/0x2a0 kernel/module.c:3902 do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe The buggy address belongs to the object at ffff8881f41fe500 which belongs to the cache proc_dir_entry of size 256 The buggy address is located 176 bytes inside of 256-byte region [ffff8881f41fe500, ffff8881f41fe600) The buggy address belongs to the page: page:ffffea0007d07f80 count:1 mapcount:0 mapping:ffff8881f6e69a00 index:0x0 flags: 0x2fffc0000000200(slab) raw: 02fffc0000000200 dead000000000100 dead000000000200 ffff8881f6e69a00 raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff8881f41fe480: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc ffff8881f41fe500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff8881f41fe580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff8881f41fe600: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb ffff8881f41fe680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb It should check the return value of atalk_proc_init fails, otherwise atalk_exit will trgger use-after-free in pde_subdir_find while unload the module.This patch fix error cleanup path of atalk_init Reported-by:
Hulk Robot <hulkci@huawei.com> Signed-off-by:
YueHaibing <yuehaibing@huawei.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Mar 02, 2019
-
-
Eric Dumazet authored
In the series fc8b81a5 ("Merge branch 'lockless-qdisc-series'") John made the assumption that the data path had no need to read the qdisc qlen (number of packets in the qdisc). It is true when pfifo_fast is used as the root qdisc, or as direct MQ/MQPRIO children. But pfifo_fast can be used as leaf in class full qdiscs, and existing logic needs to access the child qlen in an efficient way. HTB breaks badly, since it uses cl->leaf.q->q.qlen in : htb_activate() -> WARN_ON() htb_dequeue_tree() to decide if a class can be htb_deactivated when it has no more packets. HFSC, DRR, CBQ, QFQ have similar issues, and some calls to qdisc_tree_reduce_backlog() also read q.qlen directly. Using qdisc_qlen_sum() (which iterates over all possible cpus) in the data path is a non starter. It seems we have to put back qlen in a central location, at least for stable kernels. For all qdisc but pfifo_fast, qlen is guarded by the qdisc lock, so the existing q.qlen{++|--} are correct. For 'lockless' qdisc (pfifo_fast so far), we need to use atomic_{inc|dec}() because the spinlock might be not held (for example from pfifo_fast_enqueue() and pfifo_fast_dequeue()) This patch adds atomic_qlen (in the same location than qlen) and renames the following helpers, since we want to express they can be used without qdisc lock, and that qlen is no longer percpu. - qdisc_qstats_cpu_qlen_dec -> qdisc_qstats_atomic_qlen_dec() - qdisc_qstats_cpu_qlen_inc -> qdisc_qstats_atomic_qlen_inc() Later (net-next) we might revert this patch by tracking all these qlen uses and replace them by a more efficient method (not having to access a precise qlen, but an empty/non_empty status that might be less expensive to maintain/track). Another possibility is to have a legacy pfifo_fast version that would be used when used a a child qdisc, since the parent qdisc needs a spinlock anyway. But then, future lockless qdiscs would also have the same problem. Fixes: 7e66016f ("net: sched: helpers to sum qlen and qlen for per cpu logic") Signed-off-by:
Eric Dumazet <edumazet@google.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
brakmo authored
This patch adds a new bpf helper BPF_FUNC_skb_ecn_set_ce "int bpf_skb_ecn_set_ce(struct sk_buff *skb)". It is added to BPF_PROG_TYPE_CGROUP_SKB typed bpf_prog which currently can be attached to the ingress and egress path. The helper is needed because his type of bpf_prog cannot modify the skb directly. This helper is used to set the ECN field of ECN capable IP packets to ce (congestion encountered) in the IPv6 or IPv4 header of the skb. It can be used by a bpf_prog to manage egress or ingress network bandwdith limit per cgroupv2 by inducing an ECN response in the TCP sender. This works best when using DCTCP. Signed-off-by:
Lawrence Brakmo <brakmo@fb.com> Signed-off-by:
Martin KaFai Lau <kafai@fb.com> Acked-by:
Song Liu <songliubraving@fb.com> Signed-off-by:
Alexei Starovoitov <ast@kernel.org>
-
Florian Fainelli authored
There are no more in tree users of the switchdev_trans_item_{dequeue,enqueue} or switchdev_trans_item structure in the kernel since commit 00fc0c51 ("rocker: Change world_ops API and implementation to be switchdev independant"). Remove this unused code and update the documentation accordingly since. Signed-off-by:
Florian Fainelli <f.fainelli@gmail.com> Acked-by:
Jiri Pirko <jiri@mellanox.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Hangbin Liu authored
For ip rules, we need to use 'ipproto ipv6-icmp' to match ICMPv6 headers. But for ip -6 route, currently we only support tcp, udp and icmp. Add ICMPv6 support so we can match ipv6-icmp rules for route lookup. v2: As David Ahern and Sabrina Dubroca suggested, Add an argument to rtm_getroute_parse_ip_proto() to handle ICMP/ICMPv6 with different family. Reported-by:
Jianlin Shi <jishi@redhat.com> Fixes: eacb9384 ("ipv6: support sport, dport and ip_proto in RTM_GETROUTE") Signed-off-by:
Hangbin Liu <liuhangbin@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Mar 01, 2019
-
-
Roi Dayan authored
Under multipath offload scheme, as part of handling fib events, emit mlx5 port affinity event on the enabled ports which will be handled by the tc offloads code. Signed-off-by:
Roi Dayan <roid@mellanox.com> Reviewed-by:
Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by:
Saeed Mahameed <saeedm@mellanox.com>
-
Roi Dayan authored
In order to offload ecmp-on-host scheme where next-hop routes are used, we will make use of HW LAG. Add accessor function to let upper layers in the driver to realize if the lag acts in multi-path mode. Signed-off-by:
Roi Dayan <roid@mellanox.com> Reviewed-by:
Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by:
Saeed Mahameed <saeedm@mellanox.com>
-
Florian Westphal authored
The family specific masq modules are way too small to warrant an extra module, just place all of them in nft_masq. before: text data bss dec hex filename 1001 832 0 1833 729 nft_masq.ko 766 896 0 1662 67e nft_masq_ipv4.ko 764 896 0 1660 67c nft_masq_ipv6.ko after: 2010 960 0 2970 b9a nft_masq.ko Signed-off-by:
Florian Westphal <fw@strlen.de> Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org>
-
Florian Westphal authored
before: text data bss dec hex filename 990 832 0 1822 71e nft_redir.ko 697 896 0 1593 639 nft_redir_ipv4.ko 713 896 0 1609 649 nft_redir_ipv6.ko after: text data bss dec hex filename 1910 960 0 2870 b36 nft_redir.ko size is reduced, all helpers from nft_redir.ko can be made static. Signed-off-by:
Florian Westphal <fw@strlen.de> Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org>
-
Li RongQing authored
The proto in struct xt_match and struct xt_target is u16, when calling xt_check_target/match, their proto argument is u8, and will cause truncation, it is harmless to ip packet, since ip proto is u8 if a etable's match/target has proto that is u16, will cause the check failure. and convert be16 to short in bridge/netfilter/ebtables.c Signed-off-by:
Zhang Yu <zhangyu31@baidu.com> Signed-off-by:
Li RongQing <lirongqing@baidu.com> Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org>
-
- Feb 28, 2019
-
-
Avri Altman authored
SD spec v5.1 adds discard support. The flows and commands are similar to mmc, so just set the discard arg in CMD38. A host which supports DISCARD shall check if the DISCARD_SUPPORT (b313) is set in the SD_STATUS register. If the card does not support discard, the host shall not issue DISCARD command, but ERASE command instead. Post the DISCARD operation, the card may de-allocate the discarded blocks partially or completely. So the host mustn't make any assumptions concerning the content of the discarded region. This is unlike ERASE command, in which the region is guaranteed to contain either '0's or '1's, depends on the content of DATA_STAT_AFTER_ERASE (b55) in the scr register. One more important difference compared to ERASE is the busy timeout which we will address on the next patch. Signed-off-by:
Avri Altman <avri.altman@wdc.com> Signed-off-by:
Ulf Hansson <ulf.hansson@linaro.org>
-
- Feb 27, 2019
-
-
wenxu authored
Current fib_multipath_hash_policy can make hash based on the L3 or L4. But it only work on the outer IP. So a specific tunnel always has the same hash value. But a specific tunnel may contain so many inner connections. This patch provide a generic multipath_hash in floi_common. It can make a user-define hash which can mix with L3 or L4 hash. Signed-off-by:
wenxu <wenxu@ucloud.cn> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Florian Fainelli authored
Now that we have converted all possible callers to using a switchdev notifier for attributes we do not have a need for implementing switchdev_ops anymore, and this can be removed from all drivers the net_device structure. Signed-off-by:
Florian Fainelli <f.fainelli@gmail.com> Reviewed-by:
Ido Schimmel <idosch@mellanox.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Florian Fainelli authored
In preparation for allowing switchdev enabled drivers to veto specific attribute settings from within the context of the caller, introduce a new switchdev notifier type for port attributes. Suggested-by:
Ido Schimmel <idosch@mellanox.com> Reviewed-by:
Ido Schimmel <idosch@mellanox.com> Signed-off-by:
Florian Fainelli <f.fainelli@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-