Skip to content

[CIR][CUDA] Emit address space casts on lowering #1518

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2,419 commits into
base: main
Choose a base branch
from

Conversation

AdUhTkJm
Copy link
Contributor

This is different from the approach in OG, because we use cir.get_global to access global variables, while in OG we just directly write the name of the variable. My approach is to add another attribute to GlobalOps and emit additional casts in LoweringPrepare.

I hope the comments are clear enough about the changes, and please tell me if you find a better approach.

bcardosolopes and others added 30 commits March 17, 2025 14:26
…n to match OG"

Seems like windows bots are now broken!

This reverts commit 9a63c50.
llvm#1180)

This removes some NYI in CIRGenModule::tryEmitBaseDestructorAsAlias and
similar to llvm#1179, use
`assert(false)` to tell devs to add test.

It is slightly verbose due to the difference between LLVM and CIR's type
system. LLVM's pointer are opaque types while CIR's pointer are typed.
So we need to handle these pointers when transforming the generated cir.
…lvm#1186)

This PR adds a new command line option `--target` to our tool
`cir-translate`. The concrete behaviour of it also depends on the triple
and data layout in the CIR module. See the table in code comments for
details.

The default triple is `x86_64-unknown-linux-gnu` currently.

Some tests are updated with triple and DLTI attribute eliminated
(replaced by an option in RUN line). But still some tests remain
unchanged, primarily because they use `cir-opt` instead.
…nAttr (llvm#1199)

CIR PoisonOp is needed in this context as alternative would be to use
VecCreateOp to prepare an arg for VecInsertElement, but VecCreate is for
different purpose and [it would insert all
elements](https://github.com/llvm/clangir/blob/eacaabba76ebdbf87217fefaa77f92c45cf4509c/clang/lib/CIR/Lowering/DirectToLLVM/LowerToLLVM.cpp#L1679)
which is not totally unnecessary in this context.

Here is the [intrinsic def
](https://developer.arm.com/architectures/instruction-sets/intrinsics/#f:@navigationhierarchiessimdisa=[Neon]&q=vqmovns_)
This would facilitate implementation of neon intrinsic `neon_vmax_v` and
`__builtin_elementwise_max`, and potentially future optimizations. CIR
BinOp supports vector type.
Floating point has already been supported by FMaxOp.
This option creates and links all tools against a single libLLVM shared
library (and the corresponding CLANG_LINK_CLANG_DYLIB option also gets
turned on by default). In order for this to work, we need to use
LINK_COMPONENTS instead of LINK_LIBS for all LLVM dependencies, and
clang_target_link_libraries for all Clang dependencies, so that they get
rewritten to use the dylib. Remove llvm_update_compile_flags while I'm
here, since the build macros handle that internally. Before this change,
we'd link against certain LLVM libraries both statically and
dynamically, leading to test failures from duplicate singletons.

The way this works for MLIR is fragile right now: MLIR can create its
own dylib as well but doesn't have build system support for linking
against that dylib. We end up folding the MLIR libraries into
libclang-cpp.so (because all Clang dependencies get pulled into it), but
MLIR tools still link the MLIR libraries statically. It'll still work,
but BUILD_SHARED_LIBS is possibly a better alternative for development.
Distributions like Fedora build their LLVM packages with
LLVM_LINK_LLVM_DYLIB, so we'll want to eventually have good MLIR support
for that setup too.
…lvm#1203)

C/C++ functions returning void had an explicit !cir.void return type
while not having any returned value, which was breaking a lot of MLIR
invariants when the CIR dialect is used in a greater context, for
example with the inliner. Now, a C/C++ function returning void has not
return type and no return values, which does not break the MLIR
invariant about the same number of return types and returned values.
This change keeps the same parsing/pretty-printed syntax as before for
compatibility.
Combined implementaiton with `neon_vaddlvq_u16`
OG somehow implemented them separately but they are no different except
signess and intrinsic name
[OG's
neon_vaddlvq_s16](https://github.com/llvm/clangir/blob/2b1a638ea07ca10c5727ea835bfbe17b881175cc/clang/lib/CodeGen/CGBuiltin.cpp#L13483)
[OG's
neon_vaddlvq_u16](https://github.com/llvm/clangir/blob/2b1a638ea07ca10c5727ea835bfbe17b881175cc/clang/lib/CodeGen/CGBuiltin.cpp#L13449)
The module-level uwtable attribute controls the unwind tables for any
synthesized functions, and the function-level attribute controls them
for those functions. I'll add support for this attribute to the LLVM
dialect as well, but translate it from CIR directly for now to avoid
waiting on the MLIR addition and a subsequent rebase.
llvm#1232)

Match `CodeGenModule::SetLLVMFunctionAttributesForDefinition` so that we
can see what's missing and have a good base to build upon.
I am working on a clangir based solution to improve C++'s safety
(https://discourse.llvm.org/t/rfc-a-clangir-based-safe-c/83245). This is
similar with the previous analysis only approach I proposed, where we
may not care about the lowered code. And this is what I described as
layering problems in
https://discourse.llvm.org/t/rfc-a-clangir-based-safe-c/83245

This is similar with the other issue proposed
llvm#1128. We'd better to emit the
higher level operations and lowering/optimizing it later.

This is also inspired our handling method for VAArg, where we use ABI
information to lower things during the passes. It gives me more
confidence that I am doing things right.
The PR should help us to get rid of NYI 
`NYI UNREACHABLE executed at
clang/lib/CIR/CodeGen/CIRGenExprAgg.cpp:899`
[Relevant OG code
here](https://github.com/llvm/clangir/blob/7fb608d4d1b72c25a1739a1bd66c9024208819cb/clang/lib/CodeGen/CGExpr.cpp#L4767):
I put `HasExplicitObjectParameter` support as a missing feature, which
is a new C++23 feature.
…ge (llvm#1214)

#### The Problem
Let's take a look at the following code:
```
struct A {
~A() {}
};

int foo() { return 42; }
void bar() {
  A a;
  int b = foo(); 
}
```
The call to `foo` guarded by the synthetic `tryOp` looks approximately
like the following:
```
cir.try synthetic cleanup {
   %2 = cir.call exception @_Z3foov() : () -> !s32i cleanup {
      cir.call @_ZN1AD1Ev(%0) : (!cir.ptr<!ty_A>) -> () extra(#fn_attr1)   // call to destructor of 'A'
      cir.yield
    } 
    cir.yield
} catch [#cir.unwind {
    cir.resume 
}] 
cir.store %2, %1: !s32i, !cir.ptr<!s32i>  // CIR verification error
```
The result of the `foo` call is in the `try` region - and is not
accessible from the outside, so the code generation fails with
`operand #0 does not dominate its use` .

#### Solution
So we have several options how to handle this properly. 
1. We may intpoduce a new operation here, like `TryCall` but probably
more high level one, e.g. introduce the `InvokeOp`.
2. Also, we may add the result to `TryOp`.
3. The fast fix that is implemented in this PR is a temporary `alloca`
where we store the call result right in the try region. And the result
of the whole `emitCall` is a `load` from the temp `alloca`.

So this PR is both the request for changes and an open discussion as
well - how to handle this properly. So far I choose the third approach.
If it's ok - I will need to create one more PR with a similar fix for
the aggregated results or update this one.
This PR puts for-loop body, while-loop body, and do-while-loop body in
nested scopes. Allocas in the loop body are now push down to the nested
scope.

Resolve llvm#1218 .
There are two sets of intrinsics regarding Min and Max operations for
floating points

[Maximum](https://mlir.llvm.org/docs/Dialects/LLVM/#llvmintrmaximum-llvmmaximumop)
vs
[Maxnum](https://mlir.llvm.org/docs/Dialects/LLVM/#llvmintrmaxnum-llvmmaxnumop)

[Minimum](https://mlir.llvm.org/docs/Dialects/LLVM/#llvmintrminimum-llvmminimumop)
vs
[Minnum](https://mlir.llvm.org/docs/Dialects/LLVM/#llvmintrminnum-llvmminnumop)

[The difference is whether NaN should be propagated when one of the
inputs is
NaN](https://llvm.org/docs/LangRef.html#llvm-maximumnum-intrinsic)
Maxnum and Minnum would return number if one of inputs is NaN, and the
other is a number,
But 
Maximum and Minimum would return NaN (propagation of NaN)

And they are resolved to different ASM such as
[FMAX](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/FMAX--vector---Floating-point-Maximum--vector--?lang=en)
vs
[FMAXNM](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/FMAXNM--vector---Floating-point-Maximum-Number--vector--?lang=en)

Both have user cases, we already implemented Maxnum and Minnum
But Maximum and Minimum has user cases in [neon intrinsic
](https://developer.arm.com/architectures/instruction-sets/intrinsics/vmax_f32
)
and [__builtin_elementwise_maximum
](https://github.com/llvm/clangir/blob/a989ecb2c55da1fe28e4072c31af025cba6c4f0f/clang/test/CodeGen/strictfp-elementwise-bulitins.cpp#L53)
…lvm#1235)

Use iterator to visit std::initializer_list field reduce the readability
…d neon_vaddvq_f64 (llvm#1238)

[Neon intrinsic
definition](https://developer.arm.com/architectures/instruction-sets/intrinsics/vaddv_f32).
They are vector across operation which LLVM doesn't currently have a
generic intrinsic about it. As a side note for brainstorm, it might be
worth in the future for CIR to introduce Vector Across type operations
even though LLVM dialect doesn't have it yet. This would help to expose
opt opportunities.
E.g. a very trivial constant fold can happen if we are adding across a
constant vector.
…#1239)

This implementation is different from OG in the sense we chose to use
CIR op which eventually lowers to generic LLVM intrinsics instead of
llvm.aarch64.neon intrinsics
But down to the ASM level, [they are identical
](https://godbolt.org/z/Gbbos9z6Y).
…vm#1242)

This patch follows
llvm#1220 (comment) by
augmenting `CIR_Type` with a new field, `tbaaName`. Specifically, it
enables TableGen support for the `-gen-cir-tbaa-name-lowering` option,
allowing for the generation of `getTBAAName` functions based on the
`tbaaName`. This enhancement enables us to replace the hardcoded TBAA
names in the `getTypeName` function with the newly generated
`getTBAAName`.
This PR adds a bitcast when we rewrite globals type. Previously we just
set a new type and it worked.
But recently I started to test ClangIR with CSmith in order to find some
run time bugs and faced with the next problem.

```
typedef struct {
    int x : 15;   
    uint8_t y;
} S;

S g = { -12, 254};

int main() {    
    printf("%d\n", g.y);
    return 0;
}

```
The output for this program is  ... 127 but not 254!
The reason is that first global var is created with the type of struct
`S`, then `get_member` operation is generated with index `1`
and then after, the type of the global is rewritten - I assume because
of the anon struct created on the right side in the initialization.
But the `get_member` operation still wants to access to the field at
index `1` and get a wrong byte.
If we change the `y` type to `int` we will fail on the verification
stage. But in the example above it's a run time error!

This is why I suggest to add a bitcast once we have to rewrite the
global type.
We figure it would be nice to have a common place with all our known
crashes that is tracked by git and is actively verified whether or not
we can now support the crashes by lit. It can act as our source of truth
for known failures and also being potential good first tasks for new
developers.

Add a simple test case of a known crash that involves copying a struct
in a catch.

Reviewers: smeenai, bcardosolopes

Reviewed By: bcardosolopes

Pull Request: llvm#1243
…#1247)

Basically that is - the return value for `=` operator for bitfield
assignment is wrong now. For example, the next function returns `7` for
3 bit bit field, though it should be `-1`:
```
int get_a(T *t) {
  return (t->a = 7);
}
```

This PR fix it. Actually, the bug was in the lowering - the integer cast
is applied in the wrong place (in comparison with the original codegen).
AmrDeveloper and others added 3 commits March 24, 2025 11:15
Sub-issue of llvm#1192. Adds
CIR_ATanOp and support for __builtin_elementwise_atan.
Part of llvm#258 .

1. Added `AddressPointAttr`
2. Change all occurrences of `VTableAddrPointOp` into using the
attribute
3. Update tests

---------

Co-authored-by: Sirui Mu <[email protected]>
Co-authored-by: Morris Hafner <[email protected]>
Co-authored-by: Morris Hafner <[email protected]>
Co-authored-by: Sharp-Edged <[email protected]>
Co-authored-by: Amr Hesham <[email protected]>
Co-authored-by: Bruno Cardoso Lopes <[email protected]>
Co-authored-by: Letu Ren <[email protected]>
AdUhTkJm and others added 11 commits March 25, 2025 11:06
Clang relies on `llvm::Intrinsic::getOrInsertDeclaration` to handle
functions marked as `ClangBuiltin` in TableGen. That function receives a
`CodeGenModule*` so CIR can't use that. We need to re-implement parts of
it.
Closes llvm#1367

---------

Co-authored-by: Sirui Mu <[email protected]>
Co-authored-by: Amr Hesham <[email protected]>
Co-authored-by: Chibuoyim (Wilson) Ogbonna <[email protected]>
Co-authored-by: Yue Huang <[email protected]>
Co-authored-by: Morris Hafner <[email protected]>
Co-authored-by: Morris Hafner <[email protected]>
Co-authored-by: Sharp-Edged <[email protected]>
Co-authored-by: Bruno Cardoso Lopes <[email protected]>
Co-authored-by: Letu Ren <[email protected]>
…#1521)

- Part of llvm#70

This is not very useful now, as both branches of interest are NYI. I
will try to implement the real features in subsequent patches.
…ity and maintainability (llvm#1525)

As noted in [this
comment](llvm#1442 (comment)),
the nested if-arms in `GlobalOpLowering` are somewhat confusing and
error-prone. This PR simplifies the logic into more straightforward
components.

Since LLVM's GlobalOp accepts two types of initializers (either an
initializer value or an initializer region), we've extracted the
decision logic into a separate function called `lowerInitializer`. This
function takes two inout arguments: `mlir::Attribute &init` (for the
attribute value) and `bool useInitializerRegion` (as the decision
indicator). All code paths then converge at a common epilogue that
handles the operation rewriting.

The previous implementation for lowering `DataMemberAttr` initializers
relied on recursion between MLIR rewrite calls, which made the control
flow somewhat opaque. The new version makes this explicit by using a
clear self-recursive pattern within `lowerInitializer`.
Fix llvm#1371.

Not sure about whether we could remove `convertTypeForMem` completely.
Let's fix the doc first.
This change moves all declarations of emit* functions in
CIRGenFunction.h into a common location and sorts them alphabetically.
The goal of this change is to make it easier to keep upstream and
incubator code in a consistent location, making functions easier to find
for upstreaming and minimizing conflicts in the incubator when rebasing.

I did most of this sort manually, and I've probably been inconsistent in
how I treat sorting of uppercase versus lowercase. I made no attempt to
provide a rule for ordering different declarations of functions with the
same name. We can improve on that over time if anyone feels the need.

I tried very hard not to drop comments (one of the reasons I had to do
this manually), but I may have lost a few.

This change loses the grouping of some declarations that were co-located
by common purpose, but most of the declarations lacked a coherent
ordering, so I think this is a step forward overall.
This is a rebased version of the inactive PR llvm#1380.

---------

Co-authored-by: koparasy <[email protected]>
Adds implementation for ATanOp's lowering ThroughMLIR.
@AdUhTkJm AdUhTkJm force-pushed the addrspace branch 2 times, most recently from d6f21e5 to 4888f55 Compare March 27, 2025 05:16
ayokunle321 and others added 2 commits March 28, 2025 11:34
Adds implementation for ACosOp's lowering ThroughMLIR.
@@ -470,7 +470,7 @@ CIRGenModule::getOrCreateStaticVarDecl(const VarDecl &D,
Name = getStaticDeclName(*this, D);

mlir::Type LTy = getTypes().convertTypeForMem(Ty);
cir::AddressSpaceAttr AS =
cir::AddressSpaceAttr actualAS =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment on what you mean by "actual"

auto definingOp = addr.getDefiningOp();
bool hasCast = isa<cir::CastOp>(definingOp);
auto getAddrOp = mlir::cast<cir::GetGlobalOp>(
hasCast ? definingOp->getOperand(0).getDefiningOp() : definingOp);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm missing the point here: why you need to unwrap a cast that was wrapped by createGetGlobal? If you need to unwrap it, why do you need to generate it in the first place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need both the cast and the get_global op for examination below. We will store addr in the LocalDeclMap, with potential bitcasts. Previously a bitcast is emitted whenever the global variable type and the initializer type don't match (for example int* and int[]), but they can mismatch because of different AS but the same underlying type (say int* addrspace(1) and int*). So I think we must emit a cast to check which case we're in.

auto oldTy = mlir::cast<cir::PointerType>(getGlobal.getType());
auto newTy =
cir::PointerType::get(oldTy.getPointee(), /*addrspace=*/gpuAS);
auto cast = createAddrSpaceCast(loc, getGlobal, newTy);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of emitting all this casts, have you considered changing the address space of the global for good instead of playing with two different ones? Are there uses that actually try to retrieve two different ASs out of the same global?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change the address space of the global, then it will be placed in the wrong section. For __device__ (AS 1) as an example, OG does

; Global
@var: ptr addrspace(1)
; Use a global
... (addrspacecast ptr addrspace(1) @var to ptr)

@bcardosolopes
Copy link
Member

Sorry for the delay here, I still have a few thing I want to double check here before I consider this ready, should put more time on it beginning of next week. Thanks for your patience

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.