Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AutoBump] Merge with 6cf3e7d0 (Aug 14) (2) #355

Merged
merged 713 commits into from
Oct 4, 2024

Conversation

mgehre-amd
Copy link
Collaborator

No description provided.

DavidSpickett and others added 30 commits August 12, 2024 12:41
…ADS is not defined

When LLVM_ENABLE_THREADS is not defined, llvm::get_threadid returns 0 which
makes this test case fail.

This is a pretty niche setting, Linaro uses it to stop lld crashing our 32 bit
containers. So the test will get plenty of runs elsewhere.

In lldb's code it's not getting the current thread ID anyway, it's using
a value it got from ptrace. So even if that copy of lldb was built with
LLVM_ENABLE_THREADS off, it should still be able to debug threads.
On PlayStation, allow users to supply -static to the linker, via the
driver.

An initial step. Later changes will have the PS5 driver supply
additional options to the linker, if and when -static is passed.

SIE tracker: TOOLCHAIN-16704
We only need to see that 1 frame of the stack is in user code. No need
to carry on looking.

Doing so actually caused a test failure on Armv8 Ubuntu Jammy where
a libc function does not have a display name. I'm sure I'm going to
get stung by this elsewhere, but for this test, breaking early
sidesteps the problem.
The Mul factor was zero-extended here, resulting in incorrect
results for integers larger than 64-bit.

As we currently only multiply by 1 or -1, just split this into
two cases -- there's no need for a full multiplication here.

Fixes llvm#102597.
Implement FEAT_SME_B16B16 to enable ZA-targeting non-widening SME
BFloat16 instructions. Remove the now redundant FEAT_B16B16 which has
been replaced by FEAT_SVE_B16B16 and FEAT_SME_B16B16 (this commit), see
llvm#101480 for the details and
reasoning of this change to LLVM.

FEAT_SME_B16B16 is documented under the latest Armv9.4 feature
documentation:

https://developer.arm.com/documentation/109697/0100/Feature-descriptions/The-Armv9-4-architecture-extensio

- Changes to Clang AArch64 frontend
- Change target guard of SME2 ZA-targeting non-widening BFloat16
intrinsics to 'sme-b16b16'

- Changes to LLVM AArch64 backend
  - llvm/lib/Target/AArch64/AArch64Features.td
- Create FeatureSMEB16B16, which implies FeatureSME2 and
FeatureSVEB16B16
	- Remove FeatureB16B16
	- Fix description of FeatureSVEB16B16
  - llvm/lib/Target/AArch64/AArch64InstrInfo.td
	- Create HasSMEB16B16 predicate
  - llvm/lib/Target/AArch64/AArch64SMEInstrInfo.td
- Change predictication of SME2 ZA-targeting non-widening BFloat16
instructions to new HasSMEB16B16
  - llvm/lib/Target/AArch64/AArch64.td
- Add HasSMEB16B16 to SME2Unsupported (FEAT_SME_B16B16 implies
FEAT_SME2)
  - llvm/lib/AArch64/AsmParser/AArch64AsmParser.cpp
	- Remove flag 'b16b16' mapping to removed FeatureB16B16
	- Add flag 'sme-b16b16' mapping to new FeatureSMEB16B16

- Changes to LLVM unit tests
  - llvm/unittests/TargetParser/TargetParserTest.cpp
	- Add new sme-b16b16 flag to existing target parser tests
	- Add tests for the sme-b16b16 dependencies:
- 'sme-b16b16' should enable 'sme2', 'sve-b16b16'. - Remove 'b16b16'
from bf16 dependency test

- Added MC tests
    - llvm/test/MC/AArch64/SME2p1
- To ensure that ZA-targeting multi-vector non-widening BFloat16
instructions are enabled by +sme-b16b16, and that this feature is
removed by +nosme-b61b6.

- Modidified tests
- All CodeGen, Semantic, and MC tests that are effected by the removal
of 'b16b16', have been modified to supply and/or expect 'sme-b16b16'
where appropriate.
Include chain of ops feeding inductions in cost precomputation for
inductions, not just the induction increment. In VPlan, those
instructions will be cleaned up, as both phi and increment are generated
by VPWidenIntOrFpInductionRecipe independently.

Fixes llvm#101337.
This PR fixes emission of valid OpLifestart/OpLifestop instructions.
According to
https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#OpLifetimeStart:
"Size must be 0 if Pointer is a pointer to a non-void type or the
Addresses
[capability](https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#Capability)
is not declared.". The `Size` argument is set the corresponding
intrinsics arguments, so Size is not zero we must ensure that Pointer
has the required type by inserting a bitcast if needed.
…01732)

This PR contains changes in virtual register processing aimed to improve
correctness of emitted MIR between passes from the perspective of
MachineVerifier. This potentially helps to detect previously missed
flaws in code emission and harden the test suite. As a measure of
correctness and usefulness of this PR we may use a mode with expensive
checks set on, and MachineVerifier reports problems in the test suite.

In order to satisfy Machine Verifier requirements to MIR correctness not
only a rework of usage of virtual registers' types and classes is
required, but also corrections into pre-legalizer and instruction
selection logics. Namely, the following changes are introduced:
* scalar virtual registers have proper bit width,
* detect register class by SPIR-V type,
* add a superclass for id virtual register classes,
* fix Tablegen rules used for instruction selection,
* fixes of minor existed issues (missed flag for proper representation
of a null constant for OpenCL vs. HLSL, wrong usage of integer virtual
registers as a synonym of any non-type virtual register).
…ble.

Currently SLP vectorizer tries to keep only GEPs as scalar, if they are
vectorized but used externally. Same approach can be used for all scalar
values. This patch tries to keep original scalars if all its operands
remain scalar or externally used, the cost of the original scalar is
lower than the cost of the extractelement instruction, or if the number
of externally used scalars in the same entry is power of 2. Last
criterion allows better revectorization for multiply used scalars.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: llvm#100904
Now that the branches to the scalar epilogue are modeled in VPlan
directly, check the VPlan to see if a scalar epilogue is required.

Preparation for llvm#100658.
…te unitAttr. (llvm#102340)

Adds a new ComposableOpInterface for OpenMP operations that can
represent a single leaf of a composite OpenMP construct.

This is patch 1/2 in a series of patches. Patch 2 - llvm#102341.
… add verifier checks (llvm#102341)

This patch sets the omp.composite unit attr for composite wrapper ops
and also add appropriate checks to the verifiers of supported ops for
the presence/absence of the attribute.

This is patch 2/2 in a series of patches. Patch 1 - llvm#102340.
Just a simple check to ignore Inline asm fwait insertion

Fixes llvm#101613
llvm#101283)

`lldb-server platform --server` works on Windows now w/o multithreading.
The rest functionality remains unchanged.

Fixes llvm#90923, fixes llvm#56346.

This is the part 1 of the replacement of llvm#100670.

In the part 2 I plan to switch `lldb-server gdbserver` to use `--fd` and
listen a common gdb port for all gdbserver connections. Then we can
remove gdb port mapping to fiх llvm#97537.
…ss (llvm#102633)

Add missing math.atan to spirv.CL.atan and math.atan2 to spirv.CL.atan2
in MathToSPIRV.
Add math.atan to spirv.GL.atan too.
This has been flaky on our Windows on Arm bot:
https://lab.llvm.org/buildbot/#/builders/141/builds/1497

Despite passing when first landed.
…m#102824)

We were calling initialize() unconditionally when copying the union.
… boolean argument (llvm#102902)

This PR resolves a TODO in `generateGroupInst()`
(`lib/Target/SPIRV/SPIRVBuiltins.cpp`) and Issues
llvm#97311 and
llvm#97312 by implementing
support for non-const arguments in a Group builtin that requires a
boolean argument.
…lvm#101353)

This specifically handles the case of a transpose from a vector type
like `vector<8x[4]xf32>` to `vector<[4]x8xf32>`. Such transposes occur
fairly frequently when scalably vectorizing `linalg.generic`s. There is
no direct lowering for these (as types like `vector<[4]x8xf32>` cannot
be represented in LLVM-IR). However, if the only use of the transpose is
a write, then it is possible to lower the `transfer_write(transpose)` as
a VLA loop.

Example:

```mlir
%transpose = vector.transpose %vec, [1, 0]
   : vector<4x[4]xf32> to vector<[4]x4xf32>
vector.transfer_write %transpose, %dest[%i, %j] {in_bounds = [true, true]}
   : vector<[4]x4xf32>,  memref<?x?xf32>
```

Becomes:

```mlir
%c1 = arith.constant 1 : index
%c4 = arith.constant 4 : index
%c0 = arith.constant 0 : index
%0 = vector.extract %arg0[0] : vector<[4]xf32> from vector<4x[4]xf32>
%1 = vector.extract %arg0[1] : vector<[4]xf32> from vector<4x[4]xf32>
%2 = vector.extract %arg0[2] : vector<[4]xf32> from vector<4x[4]xf32>
%3 = vector.extract %arg0[3] : vector<[4]xf32> from vector<4x[4]xf32>
%vscale = vector.vscale
%c4_vscale = arith.muli %vscale, %c4 : index
scf.for %idx = %c0 to %c4_vscale step %c1 {
  %4 = vector.extract %0[%idx] : f32 from vector<[4]xf32>
  %5 = vector.extract %1[%idx] : f32 from vector<[4]xf32>
  %6 = vector.extract %2[%idx] : f32 from vector<[4]xf32>
  %7 = vector.extract %3[%idx] : f32 from vector<[4]xf32>
  %slice_i = affine.apply #map(%idx)[%i]
  %slice = vector.from_elements %4, %5, %6, %7 : vector<4xf32>
  vector.transfer_write %slice, %arg1[%slice_i, %j] {in_bounds = [true]}
    : vector<4xf32>, memref<?x?xf32>
}
```
Add small test that I missed adding to llvm#102341.
…m#102686)

The extra field in the descriptor carries multiple information and
cannot be deducted anymore when doing a reboxing. This patch updates the
codegen to retrieve the extra field value from the inboc and set it in
the new box.
This sorts DWARF op descriptions in `DWARFExpression.cpp` by opcode and version, packing the standardised ops together. A few ops also had the wrong version listed, so this fixes those versions as well. (The version does not appear to actually be used currently.)
aengelke and others added 26 commits August 14, 2024 09:24
There's only a single user (MCMachOStreamer), so it makes more sense to
move the version emission to the source of the data.
In order to guarantee that extracting 64 bits doesn't require more than
2 words, the word size would need to be 64 bits or more. If the word
size was smaller than 64, like 32, you may need to read 3 words to get
64 bits.
…lvm#102482)

This PR adds a field to the pass builder options struct, `AAPipeline`,
exposed through a C API `LLVMPassBuilderOptionsSetAAPipeline`, that is
used to set an alias analysis pipeline to be used in stead of the
default one.

x-ref https://discourse.llvm.org/t/newpm-c-api-questions/80598
)

Inside computeConstantDifference(), handle the case where both sides are
of the form `C * %x`, in which case we can strip off the common
multiplication (as long as we remember to multiply by it for the
following difference calculation).

There is an obvious alternative implementation here, which would be to
directly decompose multiplies inside the "Multiplicity" accumulation.
This does work, but I've found this to be both significantly slower
(because everything has to work on APInt) and more complex in
implementation (e.g. because we now need to match back the new More/Less
with an arbitrary factor) without providing more power in practice. As
such, I went for the simpler variant here.

This is the last step to make computeConstantDifference() sufficiently
powerful to replace existing uses of
`cast<SCEVConstant>(getMinusSCEV())` with it.
Avoid one heap allocation per function per constructed TLI. The
BitVector is never resized, so a bitset is sufficient.

Pull Request: llvm#103411
This metadata is queried quite often, so avoiding frequent lookups in
the hash map is beneficial. Therefore, cache the metadata node directly
in the module.

Pull Request: llvm#103410
This only has a single use and is equally well served by the existing
constructor -- blocks of a loop are already an array.

Pull Request: llvm#103399
Aggregate type specification doesn't have the size component.
Don't abuse LayoutAlignElem to avoid confusion.
This is a reland of llvm#96287. This change makes tests in logf128.ll ignore
the sign of NaNs for negative value tests and moves an #include <cmath>
to be blocked behind #ifndef _GLIBCXX_MATH_H.
Syntacore SCR4 is a microcontroller-class processor core that has much
in common with SCR3, but also supports F and D extensions.
Overview: https://syntacore.com/products/scr4

Syntacore SCR5 is an entry-level Linux-capable 32/64-bit RISC-V
processor core which scheduling model almost match SCR4.
Overview: https://syntacore.com/products/scr5

Co-authored-by: Dmitrii Petrov <dmitrii.petrov@syntacore.com>
Co-authored-by: Anton Afanasyev <anton.afanasyev@syntacore.com>
A truncate is considered saturated if no additional conversion is required between the target and return values. If the target is saturated when attempting to truncate from a vector, there is an opportunity to optimize it.

Previously, each architecture had its own attempt at optimization, leading to redundant code. This patch implements common logic by introducing three new ISDs:

`ISD::TRUNCATE_SSAT_S`: When the operand is a signed value and  the range of values matches the range of signed values of the  destination type.

`ISD::TRUNCATE_SSAT_U`: When the operand is a signed value and the range of values matches the range of unsigned values of the destination type.

`ISD::TRUNCATE_USAT_U`: When the operand is an unsigned value and the range of values matches the range of unsigned values of the destination type.

These ISDs indicate a saturated truncate.

Fixes llvm#85903
This patch removes the `ClauseProcessor::processDefault` method due to
it having been implemented in
`DataSharingProcessor::collectDefaultSymbols` instead.

Also, some `genXyzClauses` functions are updated to avoid triggering
TODO errors for clauses not supported by the corresponding construct and
to keep alphabetical sorting on the order in which clauses are
processed.
Add a test where diff checks are generated initial and then re-generated
when re-trying with runtime checks.

At the moment, the order doesn't match the order they are created in, as
the DiffChecks field in LAI isn't cleared as other fields holding
runtime checks.
DiffChecks will get populated twice when re-trying with runtime checks.
Without clearing it like the regular Checks vector, it will contain some
duplicates and the order the checks are created may not match the order
the checks have been queued when re-trying.
…#101472)

This patch add `void* PlatformArgs` parameter to
`__init_riscv_feature_bits`. `PlatformArgs` allows the platform to
provide pre-computed data and access it without extra effort. For
example, Linux could pass the vDSO object to avoid an extra system call.

```
__init_riscv_feature_bits()

->

__init_riscv_feature_bits(void *PlatformArgs)
```
…mission of the OpGroupBroadcast instruction (llvm#103050)

This PR addresses a TODO in
lib/Target/SPIRV/SPIRVInstructionSelector.cpp by adding implementation
of the non-const G_BUILD_VECTOR, and fix emission of the
OpGroupBroadcast instruction for the case when the `..._group_broadcast`
builtin has more than one `local_id` argument and `OpGroupBroadcast`
requires a newly constructed vector with 2 or 3 components instead of
originally passed series of `local_id` arguments.

This PR may resolve llvm#97310 if
the reason for the reported fail is an incorrectly generated
OpGroupBroadcast instruction that was definitely a case.

Existing test is hardened and a new test is added to cover this special
case of the OpGroupBroadcast instruction emission.
llvm#101449)

…ImplID

This patch 

1. remove the vendorId from `__riscv_vendor_feature_bits`
2. Define a new structure for vendorID, ArchID and ImplID
3. Update the relate init code
This reverts commit 3cab7c5.

The modified test fails on ppc64le buildbots.
…lvm#101827)

Adding a pass that is expected to run after the deallocation pipeline
and will move buffer deallocations right after their last user or
dependency, thus optimizing the allocation liveness.
This also adds a default constructor and a few uses of it.
Base automatically changed from bump_to_5855237 to feature/fused-ops October 2, 2024 07:14
@mgehre-amd mgehre-amd merged commit 79d2891 into feature/fused-ops Oct 4, 2024
6 checks passed
@mgehre-amd mgehre-amd deleted the bump_to_6cf3e7d0 branch October 4, 2024 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.