Skip to content
Snippets Groups Projects
  1. Apr 06, 2019
  2. Mar 29, 2019
  3. Mar 28, 2019
    • Sean Christopherson's avatar
      KVM: x86: update %rip after emulating IO · 45def77e
      Sean Christopherson authored
      
      Most (all?) x86 platforms provide a port IO based reset mechanism, e.g.
      OUT 92h or CF9h.  Userspace may emulate said mechanism, i.e. reset a
      vCPU in response to KVM_EXIT_IO, without explicitly announcing to KVM
      that it is doing a reset, e.g. Qemu jams vCPU state and resumes running.
      
      To avoid corruping %rip after such a reset, commit 0967b7bf ("KVM:
      Skip pio instruction when it is emulated, not executed") changed the
      behavior of PIO handlers, i.e. today's "fast" PIO handling to skip the
      instruction prior to exiting to userspace.  Full emulation doesn't need
      such tricks becase re-emulating the instruction will naturally handle
      %rip being changed to point at the reset vector.
      
      Updating %rip prior to executing to userspace has several drawbacks:
      
        - Userspace sees the wrong %rip on the exit, e.g. if PIO emulation
          fails it will likely yell about the wrong address.
        - Single step exits to userspace for are effectively dropped as
          KVM_EXIT_DEBUG is overwritten with KVM_EXIT_IO.
        - Behavior of PIO emulation is different depending on whether it
          goes down the fast path or the slow path.
      
      Rather than skip the PIO instruction before exiting to userspace,
      snapshot the linear %rip and cancel PIO completion if the current
      value does not match the snapshot.  For a 64-bit vCPU, i.e. the most
      common scenario, the snapshot and comparison has negligible overhead
      as VMCS.GUEST_RIP will be cached regardless, i.e. there is no extra
      VMREAD in this case.
      
      All other alternatives to snapshotting the linear %rip that don't
      rely on an explicit reset announcenment suffer from one corner case
      or another.  For example, canceling PIO completion on any write to
      %rip fails if userspace does a save/restore of %rip, and attempting to
      avoid that issue by canceling PIO only if %rip changed then fails if PIO
      collides with the reset %rip.  Attempting to zero in on the exact reset
      vector won't work for APs, which means adding more hooks such as the
      vCPU's MP_STATE, and so on and so forth.
      
      Checking for a linear %rip match technically suffers from corner cases,
      e.g. userspace could theoretically rewrite the underlying code page and
      expect a different instruction to execute, or the guest hardcodes a PIO
      reset at 0xfffffff0, but those are far, far outside of what can be
      considered normal operation.
      
      Fixes: 432baf60 ("KVM: VMX: use kvm_fast_pio_in for handling IN I/O")
      Cc: <stable@vger.kernel.org>
      Reported-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      45def77e
    • Vitaly Kuznetsov's avatar
      x86/kvm/hyper-v: avoid spurious pending stimer on vCPU init · 013cc6eb
      Vitaly Kuznetsov authored
      
      When userspace initializes guest vCPUs it may want to zero all supported
      MSRs including Hyper-V related ones including HV_X64_MSR_STIMERn_CONFIG/
      HV_X64_MSR_STIMERn_COUNT. With commit f3b138c5 ("kvm/x86: Update SynIC
      timers on guest entry only") we began doing stimer_mark_pending()
      unconditionally on every config change.
      
      The issue I'm observing manifests itself as following:
      - Qemu writes 0 to STIMERn_{CONFIG,COUNT} MSRs and marks all stimers as
        pending in stimer_pending_bitmap, arms KVM_REQ_HV_STIMER;
      - kvm_hv_has_stimer_pending() starts returning true;
      - kvm_vcpu_has_events() starts returning true;
      - kvm_arch_vcpu_runnable() starts returning true;
      - when kvm_arch_vcpu_ioctl_run() gets into
        (vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED) case:
        - kvm_vcpu_block() gets in 'kvm_vcpu_check_block(vcpu) < 0' and returns
          immediately, avoiding normal wait path;
        - -EAGAIN is returned from kvm_arch_vcpu_ioctl_run() immediately forcing
          userspace to retry.
      
      So instead of normal wait path we get a busy loop on all secondary vCPUs
      before they get INIT signal. This seems to be undesirable, especially given
      that this happens even when Hyper-V extensions are not used.
      
      Generally, it seems to be pointless to mark an stimer as pending in
      stimer_pending_bitmap and arm KVM_REQ_HV_STIMER as the only thing
      kvm_hv_process_stimers() will do is clear the corresponding bit. We may
      just not mark disabled timers as pending instead.
      
      Fixes: f3b138c5 ("kvm/x86: Update SynIC timers on guest entry only")
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      013cc6eb
    • Xiaoyao Li's avatar
      kvm/x86: Move MSR_IA32_ARCH_CAPABILITIES to array emulated_msrs · 2bdb76c0
      Xiaoyao Li authored
      
      Since MSR_IA32_ARCH_CAPABILITIES is emualted unconditionally even if
      host doesn't suppot it. We should move it to array emulated_msrs from
      arry msrs_to_save, to report to userspace that guest support this msr.
      
      Signed-off-by: default avatarXiaoyao Li <xiaoyao.li@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2bdb76c0
    • Sean Christopherson's avatar
      KVM: x86: Emulate MSR_IA32_ARCH_CAPABILITIES on AMD hosts · 0cf9135b
      Sean Christopherson authored
      
      The CPUID flag ARCH_CAPABILITIES is unconditioinally exposed to host
      userspace for all x86 hosts, i.e. KVM advertises ARCH_CAPABILITIES
      regardless of hardware support under the pretense that KVM fully
      emulates MSR_IA32_ARCH_CAPABILITIES.  Unfortunately, only VMX hosts
      handle accesses to MSR_IA32_ARCH_CAPABILITIES (despite KVM_GET_MSRS
      also reporting MSR_IA32_ARCH_CAPABILITIES for all hosts).
      
      Move the MSR_IA32_ARCH_CAPABILITIES handling to common x86 code so
      that it's emulated on AMD hosts.
      
      Fixes: 1eaafe91 ("kvm: x86: IA32_ARCH_CAPABILITIES is always supported")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarXiaoyao Li <xiaoyao.li@linux.intel.com>
      Cc: Jim Mattson <jmattson@google.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0cf9135b
    • Ben Gardon's avatar
      kvm: mmu: Used range based flushing in slot_handle_level_range · f285c633
      Ben Gardon authored
      
      Replace kvm_flush_remote_tlbs with kvm_flush_remote_tlbs_with_address
      in slot_handle_level_range. When range based flushes are not enabled
      kvm_flush_remote_tlbs_with_address falls back to kvm_flush_remote_tlbs.
      
      This changes the behavior of many functions that indirectly use
      slot_handle_level_range, iff the range based flushes are enabled. The
      only potential problem I see with this is that kvm->tlbs_dirty will be
      cleared less often, however the only caller of slot_handle_level_range that
      checks tlbs_dirty is kvm_mmu_notifier_invalidate_range_start which
      checks it and does a kvm_flush_remote_tlbs after calling
      kvm_unmap_hva_range anyway.
      
      Tested: Ran all kvm-unit-tests on a Intel Haswell machine with and
      	without this patch. The patch introduced no new failures.
      
      Signed-off-by: default avatarBen Gardon <bgardon@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f285c633
    • Masahiro Yamada's avatar
      KVM: export <linux/kvm_para.h> and <asm/kvm_para.h> iif KVM is supported · 3d9683cf
      Masahiro Yamada authored
      
      I do not see any consistency about headers_install of <linux/kvm_para.h>
      and <asm/kvm_para.h>.
      
      According to my analysis of Linux 5.1-rc1, there are 3 groups:
      
       [1] Both <linux/kvm_para.h> and <asm/kvm_para.h> are exported
      
          alpha, arm, hexagon, mips, powerpc, s390, sparc, x86
      
       [2] <asm/kvm_para.h> is exported, but <linux/kvm_para.h> is not
      
          arc, arm64, c6x, h8300, ia64, m68k, microblaze, nios2, openrisc,
          parisc, sh, unicore32, xtensa
      
       [3] Neither <linux/kvm_para.h> nor <asm/kvm_para.h> is exported
      
          csky, nds32, riscv
      
      This does not match to the actual KVM support. At least, [2] is
      half-baked.
      
      Nor do arch maintainers look like they care about this. For example,
      commit 0add5371 ("microblaze: Add missing kvm_para.h to Kbuild")
      exported <asm/kvm_para.h> to user-space in order to fix an in-kernel
      build error.
      
      We have two ways to make this consistent:
      
       [A] export both <linux/kvm_para.h> and <asm/kvm_para.h> for all
           architectures, irrespective of the KVM support
      
       [B] Match the header export of <linux/kvm_para.h> and <asm/kvm_para.h>
           to the KVM support
      
      My first attempt was [A] because the code looks cleaner, but Paolo
      suggested [B].
      
      So, this commit goes with [B].
      
      For most architectures, <asm/kvm_para.h> was moved to the kernel-space.
      I changed include/uapi/linux/Kbuild so that it checks generated
      asm/kvm_para.h as well as check-in ones.
      
      After this commit, there will be two groups:
      
       [1] Both <linux/kvm_para.h> and <asm/kvm_para.h> are exported
      
          arm, arm64, mips, powerpc, s390, x86
      
       [2] Neither <linux/kvm_para.h> nor <asm/kvm_para.h> is exported
      
          alpha, arc, c6x, csky, h8300, hexagon, ia64, m68k, microblaze,
          nds32, nios2, openrisc, parisc, riscv, sh, sparc, unicore32, xtensa
      
      Signed-off-by: default avatarMasahiro Yamada <yamada.masahiro@socionext.com>
      Acked-by: default avatarCornelia Huck <cohuck@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3d9683cf
    • Wei Yang's avatar
      KVM: x86: remove check on nr_mmu_pages in kvm_arch_commit_memory_region() · 4d66623c
      Wei Yang authored
      
      * nr_mmu_pages would be non-zero only if kvm->arch.n_requested_mmu_pages is
        non-zero.
      
      * nr_mmu_pages is always non-zero, since kvm_mmu_calculate_mmu_pages()
        never return zero.
      
      Based on these two reasons, we can merge the two *if* clause and use the
      return value from kvm_mmu_calculate_mmu_pages() directly. This simplify
      the code and also eliminate the possibility for reader to believe
      nr_mmu_pages would be zero.
      
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4d66623c
    • Krish Sadhukhan's avatar
      kvm: nVMX: Add a vmentry check for HOST_SYSENTER_ESP and HOST_SYSENTER_EIP fields · 711eff3a
      Krish Sadhukhan authored
      
      According to section "Checks on VMX Controls" in Intel SDM vol 3C, the
      following check is performed on vmentry of L2 guests:
      
          On processors that support Intel 64 architecture, the IA32_SYSENTER_ESP
          field and the IA32_SYSENTER_EIP field must each contain a canonical
          address.
      
      Signed-off-by: default avatarKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Reviewed-by: default avatarMihai Carabas <mihai.carabas@oracle.com>
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      711eff3a
    • Singh, Brijesh's avatar
      KVM: SVM: Workaround errata#1096 (insn_len maybe zero on SMAP violation) · 05d5a486
      Singh, Brijesh authored
      
      Errata#1096:
      
      On a nested data page fault when CR.SMAP=1 and the guest data read
      generates a SMAP violation, GuestInstrBytes field of the VMCB on a
      VMEXIT will incorrectly return 0h instead the correct guest
      instruction bytes .
      
      Recommend Workaround:
      
      To determine what instruction the guest was executing the hypervisor
      will have to decode the instruction at the instruction pointer.
      
      The recommended workaround can not be implemented for the SEV
      guest because guest memory is encrypted with the guest specific key,
      and instruction decoder will not be able to decode the instruction
      bytes. If we hit this errata in the SEV guest then log the message
      and request a guest shutdown.
      
      Reported-by: default avatarVenkatesh Srinivas <venkateshs@google.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      05d5a486
    • Sean Christopherson's avatar
      KVM: x86: fix handling of role.cr4_pae and rename it to 'gpte_size' · 47c42e6b
      Sean Christopherson authored
      
      The cr4_pae flag is a bit of a misnomer, its purpose is really to track
      whether the guest PTE that is being shadowed is a 4-byte entry or an
      8-byte entry.  Prior to supporting nested EPT, the size of the gpte was
      reflected purely by CR4.PAE.  KVM fudged things a bit for direct sptes,
      but it was mostly harmless since the size of the gpte never mattered.
      Now that a spte may be tracking an indirect EPT entry, relying on
      CR4.PAE is wrong and ill-named.
      
      For direct shadow pages, force the gpte_size to '1' as they are always
      8-byte entries; EPT entries can only be 8-bytes and KVM always uses
      8-byte entries for NPT and its identity map (when running with EPT but
      not unrestricted guest).
      
      Likewise, nested EPT entries are always 8-bytes.  Nested EPT presents a
      unique scenario as the size of the entries are not dictated by CR4.PAE,
      but neither is the shadow page a direct map.  To handle this scenario,
      set cr0_wp=1 and smap_andnot_wp=1, an otherwise impossible combination,
      to denote a nested EPT shadow page.  Use the information to avoid
      incorrectly zapping an unsync'd indirect page in __kvm_sync_page().
      
      Providing a consistent and accurate gpte_size fixes a bug reported by
      Vitaly where fast_cr3_switch() always fails when switching from L2 to
      L1 as kvm_mmu_get_page() would force role.cr4_pae=0 for direct pages,
      whereas kvm_calc_mmu_role_common() would set it according to CR4.PAE.
      
      Fixes: 7dcd5755 ("x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is needed")
      Reported-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Tested-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      47c42e6b
    • Sean Christopherson's avatar
      KVM: nVMX: Do not inherit quadrant and invalid for the root shadow EPT · 552c69b1
      Sean Christopherson authored
      
      Explicitly zero out quadrant and invalid instead of inheriting them from
      the root_mmu.  Functionally, this patch is a nop as we (should) never
      set quadrant for a direct mapped (EPT) root_mmu and nested EPT is only
      allowed if EPT is used for L1, and the root_mmu will never be invalid at
      this point.
      
      Explicitly setting flags sets the stage for repurposing the legacy
      paging bits in role, e.g. nxe, cr0_wp, and sm{a,e}p_andnot_wp, at which
      point 'smm' would be the only flag to be inherited from root_mmu.
      
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      552c69b1
    • Jann Horn's avatar
      x86/cpufeature: Fix __percpu annotation in this_cpu_has() · f6027c81
      Jann Horn authored
      
      &cpu_info.x86_capability is __percpu, and the second argument of
      x86_this_cpu_test_bit() is expected to be __percpu. Don't cast the
      __percpu away and then implicitly add it again. This gets rid of 106
      lines of sparse warnings with the kernel config I'm using.
      
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20190328154948.152273-1-jannh@google.com
      f6027c81
    • Ralph Campbell's avatar
      x86/mm: Don't exceed the valid physical address space · 92c77f7c
      Ralph Campbell authored
      
      valid_phys_addr_range() is used to sanity check the physical address range
      of an operation, e.g., access to /dev/mem. It uses __pa(high_memory)
      internally.
      
      If memory is populated at the end of the physical address space, then
      __pa(high_memory) is outside of the physical address space because:
      
         high_memory = (void *)__va(max_pfn * PAGE_SIZE - 1) + 1;
      
      For the comparison in valid_phys_addr_range() this is not an issue, but if
      CONFIG_DEBUG_VIRTUAL is enabled, __pa() maps to __phys_addr(), which
      verifies that the resulting physical address is within the valid physical
      address space of the CPU. So in the case that memory is populated at the
      end of the physical address space, this is not true and triggers a
      VIRTUAL_BUG_ON().
      
      Use __pa(high_memory - 1) to prevent the conversion from going beyond
      the end of valid physical addresses.
      
      Fixes: be62a320 ("x86/mm: Limit mmap() of /dev/mem to valid physical addresses")
      Signed-off-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Craig Bergstrom <craigb@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hans Verkuil <hans.verkuil@cisco.com>
      Cc: Mauro Carvalho Chehab <mchehab@s-opensource.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sander Eikelenboom <linux@eikelenboom.it>
      Cc: Sean Young <sean@mess.org>
      
      Link: https://lkml.kernel.org/r/20190326001817.15413-2-rcampbell@nvidia.com
      92c77f7c
    • Daniel Borkmann's avatar
      x86/retpolines: Disable switch jump tables when retpolines are enabled · a9d57ef1
      Daniel Borkmann authored
      Commit ce02ef06 ("x86, retpolines: Raise limit for generating indirect
      calls from switch-case") raised the limit under retpolines to 20 switch
      cases where gcc would only then start to emit jump tables, and therefore
      effectively disabling the emission of slow indirect calls in this area.
      
      After this has been brought to attention to gcc folks [0], Martin Liska
      has then fixed gcc to align with clang by avoiding to generate switch jump
      tables entirely under retpolines. This is taking effect in gcc starting
      from stable version 8.4.0. Given kernel supports compilation with older
      versions of gcc where the fix is not being available or backported anymore,
      we need to keep the extra KBUILD_CFLAGS around for some time and generally
      set the -fno-jump-tables to align with what more recent gcc is doing
      automatically today.
      
      More than 20 switch cases are not expected to be fast-path critical, but
      it would still be good to align with gcc behavior for versions < 8.4.0 in
      order to have consistency across supported gcc versions. vmlinux size is
      slightly growing by 0.27% for older gcc. This flag is only set to work
      around affected gcc, no change for clang.
      
        [0] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86952
      
      
      
      Suggested-by: default avatarMartin Liska <mliska@suse.cz>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Björn Töpel<bjorn.topel@intel.com>
      Cc: Magnus Karlsson <magnus.karlsson@intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: H.J. Lu <hjl.tools@gmail.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: David S. Miller <davem@davemloft.net>
      Link: https://lkml.kernel.org/r/20190325135620.14882-1-daniel@iogearbox.net
      a9d57ef1
    • Thomas Gleixner's avatar
      x86/smp: Enforce CONFIG_HOTPLUG_CPU when SMP=y · bebd024e
      Thomas Gleixner authored
      
      The SMT disable 'nosmt' command line argument is not working properly when
      CONFIG_HOTPLUG_CPU is disabled. The teardown of the sibling CPUs which are
      required to be brought up due to the MCE issues, cannot work. The CPUs are
      then kept in a half dead state.
      
      As the 'nosmt' functionality has become popular due to the speculative
      hardware vulnerabilities, the half torn down state is not a proper solution
      to the problem.
      
      Enforce CONFIG_HOTPLUG_CPU=y when SMP is enabled so the full operation is
      possible.
      
      Reported-by: default avatarTianyu Lan <Tianyu.Lan@microsoft.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Konrad Wilk <konrad.wilk@oracle.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Mukesh Ojha <mojha@codeaurora.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Micheal Kelley <michael.h.kelley@microsoft.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190326163811.598166056@linutronix.de
      bebd024e
    • Thomas Richter's avatar
      s390/cpumf: Fix warning from check_processor_id · b6ffdf27
      Thomas Richter authored
      
      Function __hw_perf_event_init() used a CPU variable without
      ensuring CPU preemption has been disabled. This caused the
      following warning in the kernel log:
      
        [ 7.277085] BUG: using smp_processor_id() in preemptible
                       [00000000] code: cf-csdiag/1892
        [ 7.277111] caller is cf_diag_event_init+0x13a/0x338
        [ 7.277122] CPU: 10 PID: 1892 Comm: cf-csdiag Not tainted
                       5.0.0-20190318.rc0.git0.9e1a11e0f602.300.fc29.s390x+debug #1
        [ 7.277131] Hardware name: IBM 2964 NC9 712 (LPAR)
        [ 7.277139] Call Trace:
        [ 7.277150] ([<000000000011385a>] show_stack+0x82/0xd0)
        [ 7.277161]  [<0000000000b7a71a>] dump_stack+0x92/0xd0
        [ 7.277174]  [<00000000007b7e9c>] check_preemption_disabled+0xe4/0x100
        [ 7.277183]  [<00000000001228aa>] cf_diag_event_init+0x13a/0x338
        [ 7.277195]  [<00000000002cf3aa>] perf_try_init_event+0x72/0xf0
        [ 7.277204]  [<00000000002d0bba>] perf_event_alloc+0x6fa/0xce0
        [ 7.277214]  [<00000000002dc4a8>] __s390x_sys_perf_event_open+0x398/0xd50
        [ 7.277224]  [<0000000000b9e8f0>] system_call+0xdc/0x2d8
        [ 7.277233] 2 locks held by cf-csdiag/1892:
        [ 7.277241]  #0: 00000000976f5510 (&sig->cred_guard_mutex){+.+.},
                        at: __s390x_sys_perf_event_open+0xd2e/0xd50
        [ 7.277257]  #1: 00000000363b11bd (&pmus_srcu){....},
                        at: perf_event_alloc+0x52e/0xce0
      
      The variable is now accessed in proper context. Use
      get_cpu_var()/put_cpu_var() pair to disable
      preemption during access.
      As the hardware authorization settings apply to all CPUs, it
      does not matter which CPU is used to check the authorization setting.
      
      Remove the event->count assignment. It is not needed as function
      perf_event_alloc() allocates memory for the event with kzalloc() and
      thus count is already set to zero.
      
      Fixes: fe5908bc ("s390/cpum_cf_diag: Add support for s390 counter facility diagnostic trace")
      
      Signed-off-by: default avatarThomas Richter <tmricht@linux.ibm.com>
      Reviewed-by: default avatarHendrik Brueckner <brueckner@linux.ibm.com>
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      b6ffdf27
  4. Mar 27, 2019
    • Chen Zhou's avatar
      arm64: replace memblock_alloc_low with memblock_alloc · 9e0a17db
      Chen Zhou authored
      If we use "crashkernel=Y[@X]" and the start address is above 4G,
      the arm64 kdump capture kernel may call memblock_alloc_low() failure
      in request_standard_resources(). Replacing memblock_alloc_low() with
      memblock_alloc().
      
      [    0.000000] MEMBLOCK configuration:
      [    0.000000]  memory size = 0x0000000040650000 reserved size = 0x0000000004db7f39
      [    0.000000]  memory.cnt  = 0x6
      [    0.000000]  memory[0x0]	[0x00000000395f0000-0x000000003968ffff], 0x00000000000a0000 bytes on node 0 flags: 0x4
      [    0.000000]  memory[0x1]	[0x0000000039730000-0x000000003973ffff], 0x0000000000010000 bytes on node 0 flags: 0x4
      [    0.000000]  memory[0x2]	[0x0000000039780000-0x000000003986ffff], 0x00000000000f0000 bytes on node 0 flags: 0x4
      [    0.000000]  memory[0x3]	[0x0000000039890000-0x0000000039d0ffff], 0x0000000000480000 bytes on node 0 flags: 0x4
      [    0.000000]  memory[0x4]	[0x000000003ed00000-0x000000003ed2ffff], 0x0000000000030000 bytes on node 0 flags: 0x4
      [    0.000000]  memory[0x5]	[0x0000002040000000-0x000000207fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
      [    0.000000]  reserved.cnt  = 0x7
      [    0.000000]  reserved[0x0]	[0x0000002040080000-0x0000002041c4dfff], 0x0000000001bce000 bytes flags: 0x0
      [    0.000000]  reserved[0x1]	[0x0000002041c53000-0x0000002042c203f8], 0x0000000000fcd3f9 bytes flags: 0x0
      [    0.000000]  reserved[0x2]	[0x000000207da00000-0x000000207dbfffff], 0x0000000000200000 bytes flags: 0x0
      [    0.000000]  reserved[0x3]	[0x000000207ddef000-0x000000207fbfffff], 0x0000000001e11000 bytes flags: 0x0
      [    0.000000]  reserved[0x4]	[0x000000207fdf2b00-0x000000207fdfc03f], 0x0000000000009540 bytes flags: 0x0
      [    0.000000]  reserved[0x5]	[0x000000207fdfd000-0x000000207ffff3ff], 0x0000000000202400 bytes flags: 0x0
      [    0.000000]  reserved[0x6]	[0x000000207ffffe00-0x000000207fffffff], 0x0000000000000200 bytes flags: 0x0
      [    0.000000] Kernel panic - not syncing: request_standard_resources: Failed to allocate 384 bytes
      [    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 5.1.0-next-20190321+ #4
      [    0.000000] Call trace:
      [    0.000000]  dump_backtrace+0x0/0x188
      [    0.000000]  show_stack+0x24/0x30
      [    0.000000]  dump_stack+0xa8/0xcc
      [    0.000000]  panic+0x14c/0x31c
      [    0.000000]  setup_arch+0x2b0/0x5e0
      [    0.000000]  start_kernel+0x90/0x52c
      [    0.000000] ---[ end Kernel panic - not syncing: request_standard_resources: Failed to allocate 384 bytes ]---
      
      Link: https://www.spinics.net/lists/arm-kernel/msg715293.html
      
      
      Signed-off-by: default avatarChen Zhou <chenzhou10@huawei.com>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      9e0a17db
    • Matteo Croce's avatar
      x86/realmode: Don't leak the trampoline kernel address · b929a500
      Matteo Croce authored
      
      Since commit
      
        ad67b74d ("printk: hash addresses printed with %p")
      
      at boot "____ptrval____" is printed instead of the trampoline addresses:
      
        Base memory trampoline at [(____ptrval____)] 99000 size 24576
      
      Remove the print as we don't want to leak kernel addresses and this
      statement is not needed anymore.
      
      Fixes: ad67b74d ("printk: hash addresses printed with %p")
      Signed-off-by: default avatarMatteo Croce <mcroce@redhat.com>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20190326203046.20787-1-mcroce@redhat.com
      b929a500
    • Baoquan He's avatar
      x86/boot: Fix incorrect ifdeffery scope · 0f02daed
      Baoquan He authored
      
      The declarations related to immovable memory handling are out of the
      BOOT_COMPRESSED_MISC_H #ifdef scope, wrap them inside.
      
      Signed-off-by: default avatarBaoquan He <bhe@redhat.com>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: Chao Fan <fanc.fnst@cn.fujitsu.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20190304055546.18566-1-bhe@redhat.com
      0f02daed
    • Anup Patel's avatar
      RISC-V: Always compile mm/init.c with cmodel=medany and notrace · 387181dc
      Anup Patel authored
      
      The Linux RISC-V 32bit kernel is broken after we moved setup_vm() from
      kernel/setup.c to mm/init.c because Linux RISC-V 32bit kernel by default
      uses cmodel=medlow which results in a non-position-independent setup_vm().
      
      This patch fixes Linux RISC-V 32bit kernel booting by:
      1. Forcing cmodel=medany for mm/init.c
      2. Moving remaing MM-related stuff va_pa_offset, pfn_base and
         empty_zero_page from kernel/setup.c to mm/init.c
      
      Further, the setup_vm() cannot handle GCC instrumentation for FTRACE so
      we disable it for mm/init.c by not using "-pg" compiler flag.
      
      Fixes: 6f1e9e94 ("RISC-V: Move setup_vm() to mm/init.c")
      Suggested-by: default avatarChristoph Hellwig <hch@lst.de>
      Suggested-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAnup Patel <anup.patel@wdc.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarPalmer Dabbelt <palmer@sifive.com>
      387181dc
    • Alan Kao's avatar
      riscv: fix accessing 8-byte variable from RV32 · dbee9c9c
      Alan Kao authored
      
      A memory save operation to 8-byte variable in RV32 is divided into
      two sw instructions in the put_user macro.  The current fixup returns
      execution flow to the second sw instead of the one after it.
      
      This patch fixes this fixup code according to the load access part.
      
      Signed-off-by: default avatarAlan <Kao&lt;alankao@andestech.com>
      Cc: Greentime Hu <greentime@andestech.com>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Signed-off-by: default avatarPalmer Dabbelt <palmer@sifive.com>
      dbee9c9c
  5. Mar 26, 2019
  6. Mar 25, 2019
    • Jonathan Hunter's avatar
      arm64: tegra: Disable CQE Support for SDMMC4 on Tegra186 · 93958742
      Jonathan Hunter authored
      
      Enabling CQE support on Tegra186 Jetson TX2 has introduced a regression
      that is causing accesses to the file-system on the eMMC to fail. Errors
      such as the following have been observed ...
      
       mmc2: running CQE recovery
       mmc2: mmc_select_hs400 failed, error -110
       print_req_error: I/O error, dev mmcblk2, sector 8 flags 80700
       mmc2: cqhci: CQE failed to exit halt state
      
      For now disable CQE support for Tegra186 until this issue is resolved.
      
      Fixes: dfd3cb6f arm64: tegra: Add CQE Support for SDMMC4
      Signed-off-by: default avatarJonathan Hunter <jonathanh@nvidia.com>
      Signed-off-by: default avatarThierry Reding <treding@nvidia.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      93958742
    • Linus Walleij's avatar
      ARM: dts: nomadik: Fix polarity of SPI CS · fa946356
      Linus Walleij authored
      
      The SPI DT bindings are for historical reasons a pitfall,
      the ability to flag a GPIO line as active high/low with
      the second cell flags was introduced later so the SPI
      subsystem will only accept the bool flag spi-cs-high
      to indicate that the line is active high.
      
      It worked by mistake, but the mistake was corrected
      in another commit.
      
      The comment in the DTS file was also misleading: this
      CS is indeed active high.
      
      Fixes: cffbb02d ("ARM: dts: nomadik: Augment NHK15 panel setting")
      Signed-off-by: default avatarLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      fa946356
    • Sekhar Nori's avatar
      ARM: davinci: fix build failure with allnoconfig · 2dbed152
      Sekhar Nori authored
      
      allnoconfig build with just ARCH_DAVINCI enabled
      fails because drivers/clk/davinci/* depends on
      REGMAP being enabled.
      
      Fix it by selecting REGMAP_MMIO when building in
      DaVinci support.
      
      Signed-off-by: default avatarSekhar Nori <nsekhar@ti.com>
      Reviewed-by: default avatarDavid Lechner <david@lechnology.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      2dbed152
    • Michael Ellerman's avatar
      powerpc/64: Fix memcmp reading past the end of src/dest · d9470757
      Michael Ellerman authored
      
      Chandan reported that fstests' generic/026 test hit a crash:
      
        BUG: Unable to handle kernel data access at 0xc00000062ac40000
        Faulting instruction address: 0xc000000000092240
        Oops: Kernel access of bad area, sig: 11 [#1]
        LE SMP NR_CPUS=2048 DEBUG_PAGEALLOC NUMA pSeries
        CPU: 0 PID: 27828 Comm: chacl Not tainted 5.0.0-rc2-next-20190115-00001-g6de6dba64dda #1
        NIP:  c000000000092240 LR: c00000000066a55c CTR: 0000000000000000
        REGS: c00000062c0c3430 TRAP: 0300   Not tainted  (5.0.0-rc2-next-20190115-00001-g6de6dba64dda)
        MSR:  8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 44000842  XER: 20000000
        CFAR: 00007fff7f3108ac DAR: c00000062ac40000 DSISR: 40000000 IRQMASK: 0
        GPR00: 0000000000000000 c00000062c0c36c0 c0000000017f4c00 c00000000121a660
        GPR04: c00000062ac3fff9 0000000000000004 0000000000000020 00000000275b19c4
        GPR08: 000000000000000c 46494c4500000000 5347495f41434c5f c0000000026073a0
        GPR12: 0000000000000000 c0000000027a0000 0000000000000000 0000000000000000
        GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
        GPR20: c00000062ea70020 c00000062c0c38d0 0000000000000002 0000000000000002
        GPR24: c00000062ac3ffe8 00000000275b19c4 0000000000000001 c00000062ac30000
        GPR28: c00000062c0c38d0 c00000062ac30050 c00000062ac30058 0000000000000000
        NIP memcmp+0x120/0x690
        LR  xfs_attr3_leaf_lookup_int+0x53c/0x5b0
        Call Trace:
          xfs_attr3_leaf_lookup_int+0x78/0x5b0 (unreliable)
          xfs_da3_node_lookup_int+0x32c/0x5a0
          xfs_attr_node_addname+0x170/0x6b0
          xfs_attr_set+0x2ac/0x340
          __xfs_set_acl+0xf0/0x230
          xfs_set_acl+0xd0/0x160
          set_posix_acl+0xc0/0x130
          posix_acl_xattr_set+0x68/0x110
          __vfs_setxattr+0xa4/0x110
          __vfs_setxattr_noperm+0xac/0x240
          vfs_setxattr+0x128/0x130
          setxattr+0x248/0x600
          path_setxattr+0x108/0x120
          sys_setxattr+0x28/0x40
          system_call+0x5c/0x70
        Instruction dump:
        7d201c28 7d402428 7c295040 38630008 38840008 408201f0 4200ffe8 2c050000
        4182ff6c 20c50008 54c61838 7d201c28 <7d402428> 7d293436 7d4a3436 7c295040
      
      The instruction dump decodes as:
        subfic  r6,r5,8
        rlwinm  r6,r6,3,0,28
        ldbrx   r9,0,r3
        ldbrx   r10,0,r4      <-
      
      Which shows us doing an 8 byte load from c00000062ac3fff9, which
      crosses the page boundary at c00000062ac40000 and faults.
      
      It's not OK for memcmp to read past the end of the source or
      destination buffers if that would cross a page boundary, because we
      don't know that the next page is mapped.
      
      As pointed out by Segher, we can read past the end of the source or
      destination as long as we don't cross a 4K boundary, because that's
      our minimum page size on all platforms.
      
      The bug is in the code at the .Lcmp_rest_lt8bytes label. When we get
      there we know that s1 is 8-byte aligned and we have at least 1 byte to
      read, so a single 8-byte load won't read past the end of s1 and cross
      a page boundary.
      
      But we have to be more careful with s2. So check if it's within 8
      bytes of a 4K boundary and if so go to the byte-by-byte loop.
      
      Fixes: 2d9ee327 ("powerpc/64: Align bytes before fall back to .Lshort in powerpc64 memcmp()")
      Cc: stable@vger.kernel.org # v4.19+
      Reported-by: default avatarChandan Rajendra <chandan@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Reviewed-by: default avatarSegher Boessenkool <segher@kernel.crashing.org>
      Tested-by: default avatarChandan Rajendra <chandan@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      d9470757
  7. Mar 24, 2019
  8. Mar 23, 2019
    • Kairui Song's avatar
      x86/gart: Exclude GART aperture from kcore · ffc8599a
      Kairui Song authored
      
      On machines where the GART aperture is mapped over physical RAM,
      /proc/kcore contains the GART aperture range. Accessing the GART range via
      /proc/kcore results in a kernel crash.
      
      vmcore used to have the same issue, until it was fixed with commit
      2a3e83c6 ("x86/gart: Exclude GART aperture from vmcore")', leveraging
      existing hook infrastructure in vmcore to let /proc/vmcore return zeroes
      when attempting to read the aperture region, and so it won't read from the
      actual memory.
      
      Apply the same workaround for kcore. First implement the same hook
      infrastructure for kcore, then reuse the hook functions introduced in the
      previous vmcore fix. Just with some minor adjustment, rename some functions
      for more general usage, and simplify the hook infrastructure a bit as there
      is no module usage yet.
      
      Suggested-by: default avatarBaoquan He <bhe@redhat.com>
      Signed-off-by: default avatarKairui Song <kasong@redhat.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJiri Bohac <jbohac@suse.cz>
      Acked-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Dave Young <dyoung@redhat.com>
      Link: https://lkml.kernel.org/r/20190308030508.13548-1-kasong@redhat.com
      
      ffc8599a
  9. Mar 22, 2019
  10. Mar 21, 2019
Loading