Skip to content
Snippets Groups Projects
  1. Mar 13, 2019
  2. Mar 12, 2019
    • Olga Kornievskaia's avatar
      fix null pointer deref in tracepoints in back channel · f87b543a
      Olga Kornievskaia authored
      
      Backchannel doesn't have the rq_task->tk_clientid pointer set.
      
      Otherwise can lead to the following oops:
      ocalhost login: [  111.385319] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
      [  111.388073] #PF error: [normal kernel read fault]
      [  111.389452] PGD 80000000290d8067 P4D 80000000290d8067 PUD 75f25067 PMD 0
      [  111.391224] Oops: 0000 [#1] SMP PTI
      [  111.392151] CPU: 0 PID: 3533 Comm: NFSv4 callback Not tainted 5.0.0-rc7+ #1
      [  111.393787] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
      [  111.396340] RIP: 0010:trace_event_raw_event_xprt_enq_xmit+0x6f/0xf0 [sunrpc]
      [  111.397974] Code: 00 00 00 48 89 ee 48 89 e7 e8 bd 0a 85 d7 48 85 c0 74 4a 41 0f b7 94 24 e0 00 00 00 48 89 e7 89 50 08 49 8b 94 24 a8 00 00 00 <8b> 52 04 89 50 0c 49 8b 94 24 c0 00 00 00 8b 92 a8 00 00 00 0f ca
      [  111.402215] RSP: 0018:ffffb98743263cf8 EFLAGS: 00010286
      [  111.403406] RAX: ffffa0890fc3bc88 RBX: 0000000000000003 RCX: 0000000000000000
      [  111.405057] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffb98743263cf8
      [  111.406656] RBP: ffffa0896f5368f0 R08: 0000000000000246 R09: 0000000000000000
      [  111.408437] R10: ffffe19b01c01500 R11: 0000000000000000 R12: ffffa08977d28a00
      [  111.410210] R13: 0000000000000004 R14: ffffa089315303f0 R15: ffffa08931530000
      [  111.411856] FS:  0000000000000000(0000) GS:ffffa0897bc00000(0000) knlGS:0000000000000000
      [  111.413699] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  111.415068] CR2: 0000000000000004 CR3: 000000002ac90004 CR4: 00000000001606f0
      [  111.416745] Call Trace:
      [  111.417339]  xprt_request_enqueue_transmit+0x2b6/0x4a0 [sunrpc]
      [  111.418709]  ? rpc_task_need_encode+0x40/0x40 [sunrpc]
      [  111.419957]  call_bc_transmit+0xd5/0x170 [sunrpc]
      [  111.421067]  __rpc_execute+0x7e/0x3f0 [sunrpc]
      [  111.422177]  rpc_run_bc_task+0x78/0xd0 [sunrpc]
      [  111.423212]  bc_svc_process+0x281/0x340 [sunrpc]
      [  111.424325]  nfs41_callback_svc+0x130/0x1c0 [nfsv4]
      [  111.425430]  ? remove_wait_queue+0x60/0x60
      [  111.426398]  kthread+0xf5/0x130
      [  111.427155]  ? nfs_callback_authenticate+0x50/0x50 [nfsv4]
      [  111.428388]  ? kthread_bind+0x10/0x10
      [  111.429270]  ret_from_fork+0x1f/0x30
      
      localhost login: [  467.462259] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
      [  467.464411] #PF error: [normal kernel read fault]
      [  467.465445] PGD 80000000728c1067 P4D 80000000728c1067 PUD 728c0067 PMD 0
      [  467.466980] Oops: 0000 [#1] SMP PTI
      [  467.467759] CPU: 0 PID: 3517 Comm: NFSv4 callback Not tainted 5.0.0-rc7+ #1
      [  467.469393] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
      [  467.471840] RIP: 0010:trace_event_raw_event_xprt_transmit+0x7c/0xf0 [sunrpc]
      [  467.473392] Code: f6 48 85 c0 74 4b 49 8b 94 24 98 00 00 00 48 89 e7 0f b7 92 e0 00 00 00 89 50 08 49 8b 94 24 98 00 00 00 48 8b 92 a8 00 00 00 <8b> 52 04 89 50 0c 41 8b 94 24 a8 00 00 00 0f ca 89 50 10 41 8b 94
      [  467.477605] RSP: 0018:ffffabe7434fbcd0 EFLAGS: 00010282
      [  467.478793] RAX: ffff99720fc3bce0 RBX: 0000000000000003 RCX: 0000000000000000
      [  467.480409] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffabe7434fbcd0
      [  467.482011] RBP: ffff99726f631948 R08: 0000000000000246 R09: 0000000000000000
      [  467.483591] R10: 0000000070000000 R11: 0000000000000000 R12: ffff997277dfcc00
      [  467.485226] R13: 0000000000000000 R14: 0000000000000000 R15: ffff99722fecdca8
      [  467.486830] FS:  0000000000000000(0000) GS:ffff99727bc00000(0000) knlGS:0000000000000000
      [  467.488596] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  467.489931] CR2: 0000000000000004 CR3: 00000000270e6006 CR4: 00000000001606f0
      [  467.491559] Call Trace:
      [  467.492128]  xprt_transmit+0x303/0x3f0 [sunrpc]
      [  467.493143]  ? rpc_task_need_encode+0x40/0x40 [sunrpc]
      [  467.494328]  call_bc_transmit+0x49/0x170 [sunrpc]
      [  467.495379]  __rpc_execute+0x7e/0x3f0 [sunrpc]
      [  467.496451]  rpc_run_bc_task+0x78/0xd0 [sunrpc]
      [  467.497467]  bc_svc_process+0x281/0x340 [sunrpc]
      [  467.498507]  nfs41_callback_svc+0x130/0x1c0 [nfsv4]
      [  467.499751]  ? remove_wait_queue+0x60/0x60
      [  467.500686]  kthread+0xf5/0x130
      [  467.501438]  ? nfs_callback_authenticate+0x50/0x50 [nfsv4]
      [  467.502640]  ? kthread_bind+0x10/0x10
      [  467.503454]  ret_from_fork+0x1f/0x30
      
      Signed-off-by: default avatarOlga Kornievskaia <kolga@netapp.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      f87b543a
  3. Mar 08, 2019
    • David Howells's avatar
      rxrpc: Fix client call connect/disconnect race · 930c9f91
      David Howells authored
      
      rxrpc_disconnect_client_call() reads the call's connection ID protocol
      value (call->cid) as part of that function's variable declarations.  This
      is bad because it's not inside the locked section and so may race with
      someone granting use of the channel to the call.
      
      This manifests as an assertion failure (see below) where the call in the
      presumed channel (0 because call->cid wasn't set when we read it) doesn't
      match the call attached to the channel we were actually granted (if 1, 2 or
      3).
      
      Fix this by moving the read and dependent calculations inside of the
      channel_lock section.  Also, only set the channel number and pointer
      variables if cid is not zero (ie. unset).
      
      This problem can be induced by injecting an occasional error in
      rxrpc_wait_for_channel() before the call to schedule().
      
      Make two further changes also:
      
       (1) Add a trace for wait failure in rxrpc_connect_call().
      
       (2) Drop channel_lock before BUG'ing in the case of the assertion failure.
      
      The failure causes a trace akin to the following:
      
      rxrpc: Assertion failed - 18446612685268945920(0xffff8880beab8c00) == 18446612685268621312(0xffff8880bea69800) is false
      ------------[ cut here ]------------
      kernel BUG at net/rxrpc/conn_client.c:824!
      ...
      RIP: 0010:rxrpc_disconnect_client_call+0x2bf/0x99d
      ...
      Call Trace:
       rxrpc_connect_call+0x902/0x9b3
       ? wake_up_q+0x54/0x54
       rxrpc_new_client_call+0x3a0/0x751
       ? rxrpc_kernel_begin_call+0x141/0x1bc
       ? afs_alloc_call+0x1b5/0x1b5
       rxrpc_kernel_begin_call+0x141/0x1bc
       afs_make_call+0x20c/0x525
       ? afs_alloc_call+0x1b5/0x1b5
       ? __lock_is_held+0x40/0x71
       ? lockdep_init_map+0xaf/0x193
       ? lockdep_init_map+0xaf/0x193
       ? __lock_is_held+0x40/0x71
       ? yfs_fs_fetch_data+0x33b/0x34a
       yfs_fs_fetch_data+0x33b/0x34a
       afs_fetch_data+0xdc/0x3b7
       afs_read_dir+0x52d/0x97f
       afs_dir_iterate+0xa0/0x661
       ? iterate_dir+0x63/0x141
       iterate_dir+0xa2/0x141
       ksys_getdents64+0x9f/0x11b
       ? filldir+0x111/0x111
       ? do_syscall_64+0x3e/0x1a0
       __x64_sys_getdents64+0x16/0x19
       do_syscall_64+0x7d/0x1a0
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Fixes: 45025bce ("rxrpc: Improve management and caching of client connection objects")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Reviewed-by: default avatarMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      930c9f91
  4. Mar 04, 2019
  5. Feb 25, 2019
    • adam's avatar
      btrfs: qgroup: Move reserved data accounting from btrfs_delayed_ref_head to... · 1418bae1
      adam authored
      btrfs: qgroup: Move reserved data accounting from btrfs_delayed_ref_head to btrfs_qgroup_extent_record
      
      [BUG]
      Btrfs/139 will fail with a high probability if the testing machine (VM)
      has only 2G RAM.
      
      Resulting the final write success while it should fail due to EDQUOT,
      and the fs will have quota exceeding the limit by 16K.
      
      The simplified reproducer will be: (needs a 2G ram VM)
      
        $ mkfs.btrfs -f $dev
        $ mount $dev $mnt
      
        $ btrfs subv create $mnt/subv
        $ btrfs quota enable $mnt
        $ btrfs quota rescan -w $mnt
        $ btrfs qgroup limit -e 1G $mnt/subv
      
        $ for i in $(seq -w  1 8); do
        	xfs_io -f -c "pwrite 0 128M" $mnt/subv/file_$i > /dev/null
        	echo "file $i written" > /dev/kmsg
          done
        $ sync
        $ btrfs qgroup show -pcre --raw $mnt
      
      The last pwrite will not trigger EDQUOT and final 'qgroup show' will
      show something like:
      
        qgroupid         rfer         excl     max_rfer     max_excl parent  child
        --------         ----         ----     --------     -------- ------  -----
        0/5             16384        16384         none         none ---     ---
        0/256      1073758208   1073758208         none   1073741824 ---     ---
      
      And 1073758208 is larger than
        > 1073741824.
      
      [CAUSE]
      It's a bug in btrfs qgroup data reserved space management.
      
      For quota limit, we must ensure that:
        reserved (data + metadata) + rfer/excl <= limit
      
      Since rfer/excl is only updated at transaction commmit time, reserved
      space needs to be taken special care.
      
      One important part of reserved space is data, and for a new data extent
      written to disk, we still need to take the reserved space until
      rfer/excl numbers get updated.
      
      Originally when an ordered extent finishes, we migrate the reserved
      qgroup data space from extent_io tree to delayed ref head of the data
      extent, expecting delayed ref will only be cleaned up at commit
      transaction time.
      
      However for small RAM machine, due to memory pressure dirty pages can be
      flushed back to disk without committing a transaction.
      
      The related events will be something like:
      
        file 1 written
        btrfs_finish_ordered_io: ino=258 ordered offset=0 len=54947840
        btrfs_finish_ordered_io: ino=258 ordered offset=54947840 len=5636096
        btrfs_finish_ordered_io: ino=258 ordered offset=61153280 len=57344
        btrfs_finish_ordered_io: ino=258 ordered offset=61210624 len=8192
        btrfs_finish_ordered_io: ino=258 ordered offset=60583936 len=569344
        cleanup_ref_head: num_bytes=54947840
        cleanup_ref_head: num_bytes=5636096
        cleanup_ref_head: num_bytes=569344
        cleanup_ref_head: num_bytes=57344
        cleanup_ref_head: num_bytes=8192
        ^^^^^^^^^^^^^^^^ This will free qgroup data reserved space
        file 2 written
        ...
        file 8 written
        cleanup_ref_head: num_bytes=8192
        ...
        btrfs_commit_transaction  <<< the only transaction committed during
      				the test
      
      When file 2 is written, we have already freed 128M reserved qgroup data
      space for ino 258. Thus later write won't trigger EDQUOT.
      
      This allows us to write more data beyond qgroup limit.
      
      In my 2G ram VM, it could reach about 1.2G before hitting EDQUOT.
      
      [FIX]
      By moving reserved qgroup data space from btrfs_delayed_ref_head to
      btrfs_qgroup_extent_record, we can ensure that reserved qgroup data
      space won't be freed half way before commit transaction, thus fix the
      problem.
      
      Fixes: f64d5ca8 ("btrfs: delayed_ref: Add new function to record reserved space into delayed ref")
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1418bae1
    • Josef Bacik's avatar
      btrfs: don't use global reserve for chunk allocation · 450114fc
      Josef Bacik authored
      
      We've done this forever because of the voodoo around knowing how much
      space we have.  However, we have better ways of doing this now, and on
      normal file systems we'll easily have a global reserve of 512MiB, and
      since metadata chunks are usually 1GiB that means we'll allocate
      metadata chunks more readily.  Instead use the actual used amount when
      determining if we need to allocate a chunk or not.
      
      This has a side effect for mixed block group fs'es where we are no
      longer allocating enough chunks for the data/metadata requirements.  To
      deal with this add a ALLOC_CHUNK_FORCE step to the flushing state
      machine.  This will only get used if we've already made a full loop
      through the flushing machinery and tried committing the transaction.
      
      If we have then we can try and force a chunk allocation since we likely
      need it to make progress.  This resolves issues I was seeing with
      the mixed bg tests in xfstests without the new flushing state.
      
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      [ merged with patch "add ALLOC_CHUNK_FORCE to the flushing code" ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      450114fc
    • Jiri Pirko's avatar
      mlxsw: spectrum_acl: Add vregion migration end tracepoint · 6375da3d
      Jiri Pirko authored
      
      Hit the new tracepoint once the vregion migration ends.
      
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6375da3d
  6. Feb 24, 2019
  7. Feb 17, 2019
  8. Feb 14, 2019
  9. Feb 13, 2019
  10. Feb 08, 2019
  11. Feb 07, 2019
    • Eran Ben Elisha's avatar
      devlink: Add health report functionality · c8e1da0b
      Eran Ben Elisha authored
      
      Upon error discover, every driver can report it to the devlink health
      mechanism via devlink_health_report function, using the appropriate
      reporter registered to it. Driver can pass error specific context which
      will be delivered to it as part of the dump / recovery callbacks.
      
      Once an error is reported, devlink health will do the following actions:
      * A log is being send to the kernel trace events buffer
      * Health status and statistics are being updated for the reporter instance
      * Object dump is being taken and stored at the reporter instance (as long
        as there is no other dump which is already stored)
      * Auto recovery attempt is being done. Depends on:
        - Auto Recovery configuration
        - Grace period vs. Time since last recover
      
      Signed-off-by: default avatarEran Ben Elisha <eranbe@mellanox.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c8e1da0b
    • Thierry Reding's avatar
      gpu: host1x: Introduce support for wide opcodes · 5a5fccbd
      Thierry Reding authored
      
      The CDMA push buffer can currently only handle opcodes that take a
      single word parameter. However, the host1x implementation on Tegra186
      and later supports opcodes that require multiple words as parameters.
      
      Unfortunately the way the push buffer is structured, these wide opcodes
      cannot simply be composed of two regular opcodes because that could
      result in the wide opcode being split across the end of the push buffer
      and the final RESTART opcode required to wrap the push buffer around
      would break the wide opcode.
      
      One way to fix this would be to remove the concept of slots to simplify
      push buffer operations. However, that's not entirely trivial and should
      be done in a separate patch. For now, simply use a different function
      to push four-word opcodes into the push buffer. Technically only three
      words are pushed, with the fourth word used as padding to preserve the
      2-word alignment required by the slots abstraction. The fourth word is
      always a NOP opcode.
      
      Additional care must be taken when the end of the push buffer is
      reached. If a four-word opcode doesn't fit into the push buffer without
      being split by the boundary, NOP opcodes will be introduced and the new
      wide opcode placed at the beginning of the push buffer.
      
      Signed-off-by: default avatarThierry Reding <treding@nvidia.com>
      5a5fccbd
  12. Feb 06, 2019
  13. Jan 30, 2019
  14. Jan 25, 2019
  15. Jan 18, 2019
    • Eran Ben Elisha's avatar
      devlink: Add health report functionality · c7af343b
      Eran Ben Elisha authored
      
      Upon error discover, every driver can report it to the devlink health
      mechanism via devlink_health_report function, using the appropriate
      reporter registered to it. Driver can pass error specific context which
      will be delivered to it as part of the dump / recovery callbacks.
      
      Once an error is reported, devlink health will do the following actions:
      * A log is being send to the kernel trace events buffer
      * Health status and statistics are being updated for the reporter instance
      * Object dump is being taken and stored at the reporter instance (as long
        as there is no other dump which is already stored)
      * Auto recovery attempt is being done. depends on:
        - Auto Recovery configuration
        - Grace period vs. time since last recover
      
      Signed-off-by: default avatarEran Ben Elisha <eranbe@mellanox.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c7af343b
  16. Jan 17, 2019
    • David Howells's avatar
      afs: Fix race in async call refcounting · 34fa4761
      David Howells authored
      
      There's a race between afs_make_call() and afs_wake_up_async_call() in the
      case that an error is returned from rxrpc_kernel_send_data() after it has
      queued the final packet.
      
      afs_make_call() will try and clean up the mess, but the call state may have
      been moved on thereby causing afs_process_async_call() to also try and to
      delete the call.
      
      Fix this by:
      
       (1) Getting an extra ref for an asynchronous call for the call itself to
           hold.  This makes sure the call doesn't evaporate on us accidentally
           and will allow the call to be retained by the caller in a future
           patch.  The ref is released on leaving afs_make_call() or
           afs_wait_for_call_to_complete().
      
       (2) In the event of an error from rxrpc_kernel_send_data():
      
           (a) Don't set the call state to AFS_CALL_COMPLETE until *after* the
           	 call has been aborted and ended.  This prevents
           	 afs_deliver_to_call() from doing anything with any notifications
           	 it gets.
      
           (b) Explicitly end the call immediately to prevent further callbacks.
      
           (c) Cancel any queued async_work and wait for the work if it's
           	 executing.  This allows us to be sure the race won't recur when we
           	 change the state.  We put the work queue's ref on the call if we
           	 managed to cancel it.
      
           (d) Put the call's ref that we got in (1).  This belongs to us as long
           	 as the call is in state AFS_CALL_CL_REQUESTING.
      
      Fixes: 341f741f ("afs: Refcount the afs_call struct")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      34fa4761
  17. Jan 16, 2019
  18. Jan 08, 2019
  19. Jan 07, 2019
  20. Jan 02, 2019
  21. Dec 28, 2018
    • Vasily Averin's avatar
      sunrpc: use-after-free in svc_process_common() · d4b09acf
      Vasily Averin authored
      
      if node have NFSv41+ mounts inside several net namespaces
      it can lead to use-after-free in svc_process_common()
      
      svc_process_common()
              /* Setup reply header */
              rqstp->rq_xprt->xpt_ops->xpo_prep_reply_hdr(rqstp); <<< HERE
      
      svc_process_common() can use incorrect rqstp->rq_xprt,
      its caller function bc_svc_process() takes it from serv->sv_bc_xprt.
      The problem is that serv is global structure but sv_bc_xprt
      is assigned per-netnamespace.
      
      According to Trond, the whole "let's set up rqstp->rq_xprt
      for the back channel" is nothing but a giant hack in order
      to work around the fact that svc_process_common() uses it
      to find the xpt_ops, and perform a couple of (meaningless
      for the back channel) tests of xpt_flags.
      
      All we really need in svc_process_common() is to be able to run
      rqstp->rq_xprt->xpt_ops->xpo_prep_reply_hdr()
      
      Bruce J Fields points that this xpo_prep_reply_hdr() call
      is an awfully roundabout way just to do "svc_putnl(resv, 0);"
      in the tcp case.
      
      This patch does not initialiuze rqstp->rq_xprt in bc_svc_process(),
      now it calls svc_process_common() with rqstp->rq_xprt = NULL.
      
      To adjust reply header svc_process_common() just check
      rqstp->rq_prot and calls svc_tcp_prep_reply_hdr() for tcp case.
      
      To handle rqstp->rq_xprt = NULL case in functions called from
      svc_process_common() patch intruduces net namespace pointer
      svc_rqst->rq_bc_net and adjust SVC_NET() definition.
      Some other function was also adopted to properly handle described case.
      
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Cc: stable@vger.kernel.org
      Fixes: 23c20ecd ("NFS: callback up - users counting cleanup")
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      d4b09acf
  22. Dec 21, 2018
  23. Dec 19, 2018
    • Theodore Ts'o's avatar
      ext4: force inode writes when nfsd calls commit_metadata() · fde87268
      Theodore Ts'o authored
      
      Some time back, nfsd switched from calling vfs_fsync() to using a new
      commit_metadata() hook in export_operations().  If the file system did
      not provide a commit_metadata() hook, it fell back to using
      sync_inode_metadata().  Unfortunately doesn't work on all file
      systems.  In particular, it doesn't work on ext4 due to how the inode
      gets journalled --- the VFS writeback code will not always call
      ext4_write_inode().
      
      So we need to provide our own ext4_nfs_commit_metdata() method which
      calls ext4_write_inode() directly.
      
      Google-Bug-Id: 121195940
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      fde87268
Loading