Type

LLVM 类型系统的基础为 Type 类

所有类型由如下列枚举定义

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/include/llvm/IR/Type.h#L54

enum TypeID {
    // PrimitiveTypes
    HalfTyID = 0,  ///< 16-bit floating point type
    BFloatTyID,    ///< 16-bit floating point type (7-bit significand)
    FloatTyID,     ///< 32-bit floating point type
    DoubleTyID,    ///< 64-bit floating point type
    X86_FP80TyID,  ///< 80-bit floating point type (X87)
    FP128TyID,     ///< 128-bit floating point type (112-bit significand)
    PPC_FP128TyID, ///< 128-bit floating point type (two 64-bits, PowerPC)
    VoidTyID,      ///< type with no size
    LabelTyID,     ///< Labels
    MetadataTyID,  ///< Metadata
    X86_MMXTyID,   ///< MMX vectors (64 bits, X86 specific)
    X86_AMXTyID,   ///< AMX vectors (8192 bits, X86 specific)
    TokenTyID,     ///< Tokens

    // Derived types... see DerivedTypes.h file.
    IntegerTyID,       ///< Arbitrary bit width integers
    FunctionTyID,      ///< Functions
    PointerTyID,       ///< Pointers
    StructTyID,        ///< Structures
    ArrayTyID,         ///< Arrays
    FixedVectorTyID,   ///< Fixed width SIMD vector type
    ScalableVectorTyID ///< Scalable SIMD vector type
};

其中

primitive types 代表没有子类的类型
derived types 代表拥有子类的类型

所有结构等价的类型在全局只有一个对象实例 (单例)

Type 类的继承关系如下图所示

LLVMContext 类中包含了一个顶层 const 指针，指向 LLVMContextImpl

经典 PImpl 设计

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/include/llvm/IR/LLVMContext.h#L69

LLVMContextImpl *const pImpl;

LLVMContextImpl 中包含了上述 primitive types 和 integer type 的单例，在构造函数中初始化

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.cpp#L40

LLVMContextImpl::LLVMContextImpl(LLVMContext &C)
    : DiagHandler(std::make_unique<DiagnosticHandler>()),
      VoidTy(C, Type::VoidTyID), LabelTy(C, Type::LabelTyID),
      HalfTy(C, Type::HalfTyID), BFloatTy(C, Type::BFloatTyID),
      FloatTy(C, Type::FloatTyID), DoubleTy(C, Type::DoubleTyID),
      MetadataTy(C, Type::MetadataTyID), TokenTy(C, Type::TokenTyID),
      X86_FP80Ty(C, Type::X86_FP80TyID), FP128Ty(C, Type::FP128TyID),
      PPC_FP128Ty(C, Type::PPC_FP128TyID), X86_MMXTy(C, Type::X86_MMXTyID),
      X86_AMXTy(C, Type::X86_AMXTyID), Int1Ty(C, 1), Int8Ty(C, 8),
      Int16Ty(C, 16), Int32Ty(C, 32), Int64Ty(C, 64), Int128Ty(C, 128) {
  if (OpaquePointersCL.getNumOccurrences()) {
    OpaquePointers = OpaquePointersCL;
  }
}

Type 类也提供了对应的静态方法，用于获取这些单例

Floating Point Types

primitive type

Type	Description
`half`	16-bit floating-point value
`bfloat`	16-bit “brain” floating-point value (7-bit significand). Provides the same number of exponent bits as `float`, so that it matches its dynamic range, but with greatly reduced precision. Used in Intel’s AVX-512 BF16 extensions and Arm’s ARMv8.6-A extensions, among others.
`float`	32-bit floating-point value
`double`	64-bit floating-point value
`fp128`	128-bit floating-point value (113-bit significand)
`x86_fp80`	80-bit floating-point value (X87)
`ppc_fp128`	128-bit floating-point value (two 64-bits)

通常使用 float 和 double 类型

Void Type

primitive type

可以通过如下代码获取 void 类型的单例

llvm::Type *type = llvm::Type::getVoidTy(TheContext);

void 类型不代表任何值，也没有大小，仅起到占位符的作用，如函数的返回值

define dso_local void @foo() #0 {
  ret void
}

Label Type

primitive type

用于标记基本块，例如 max 函数可能对应的 LLVM IR

define dso_local i32 @max(i32 noundef %0, i32 noundef %1) #0 {
  %3 = alloca i32, align 4
  %4 = alloca i32, align 4
  store i32 %0, i32* %3, align 4
  store i32 %1, i32* %4, align 4
  %5 = load i32, i32* %3, align 4
  %6 = load i32, i32* %4, align 4
  %7 = icmp sgt i32 %5, %6
  br i1 %7, label %8, label %10

8:                                                ; preds = %2
  %9 = load i32, i32* %3, align 4
  br label %12

10:                                               ; preds = %2
  %11 = load i32, i32* %4, align 4
  br label %12

12:                                               ; preds = %10, %8
  %13 = phi i32 [ %9, %8 ], [ %11, %10 ]
  ret i32 %13
}

注意这里隐式的 %2 编号

Token Type

primitive type

The token type is used when a value is associated with an instruction but all uses of the value must not attempt to introspect or obscure it. As such, it is not appropriate to have a phi or select of type token.

The identifier ‘none’ is recognized as an empty token constant and must be of token type.

略过

Metadata Type

primitive type

The metadata type represents embedded metadata. No derived types may be created from metadata except for function arguments.

LLVM IR allows metadata to be attached to instructions and global objects in the program that can convey extra information about the code to the optimizers and code generator. One example application of metadata is source-level debug information. There are two metadata primitives: strings and nodes.

Metadata does not have a type, and is not a value. If referenced from a call instruction, it uses the metadata type.

All metadata are identified in syntax by an exclamation point (‘!’).

例如

!llvm.module.flags = !{!0, !1, !2, !3, !4}
!llvm.ident = !{!5}

!0 = !{i32 1, !"wchar_size", i32 4}
!1 = !{i32 7, !"PIC Level", i32 2}
!2 = !{i32 7, !"PIE Level", i32 2}
!3 = !{i32 7, !"uwtable", i32 1}
!4 = !{i32 7, !"frame-pointer", i32 2}
!5 = !{!"clang version 14.0.6"}

Integer Type

语法结构为 iN，其中 N 为表示所需整数大小的位宽

可以通过如下代码获取 i32 类型的单例

llvm::Type *type = llvm::Type::getInt32Ty(TheContext);

在构造 i32 类型的过程中，向 Type 类中存储了 SubclassData 信息

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/include/llvm/IR/Type.h#L86

  TypeID   ID : 8;            // The current base type of this type.
  unsigned SubclassData : 24; // Space for subclasses to store data.
                              // Note that this should be synchronized with
                              // MAX_INT_BITS value in IntegerType class.

受其大小限制，integer type 的宽度范围为 $[1, 2^{23}]$

也就是说 LLVM 所能够表示的最大整数为 $2^{2^{23}}=2^{8388608}$

注意这里的 integer type 并不包含符号信息

LLVMContextImpl 使用了下述数据结构缓存了所有的 integer type

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.h#L1524

DenseMap<unsigned, IntegerType *> IntegerTypes;

Pointer Type

pointer type 通常用于引用指定内存位置中的对象

pointer type 可以定义指向对象所在的地址空间编号，默认为 0

AddrSpace 同样被存储到了 SubclassData 中

可以通过如下代码获取 i32* 类型的单例

llvm::Type *type = llvm::Type::getInt32PtrTy(TheContext, 0);

上述方法封装了 PointerType::get 方法

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/Type.cpp#L301
PointerType *Type::getInt32PtrTy(LLVMContext &C, unsigned AS) {
    return getInt32Ty(C)->getPointerTo(AS);
}
其中

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/Type.cpp#L776
PointerType *Type::getPointerTo(unsigned AddrSpace) const {
    return PointerType::get(const_cast<Type*>(this), AddrSpace);
}

LLVMContextImpl 使用了下述数据结构缓存了所有的 pointer type

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.h#L1538

DenseMap<Type *, PointerType *> PointerTypes; // Pointers in AddrSpace = 0
DenseMap<std::pair<Type *, unsigned>, PointerType *> ASPointerTypes;

注意到这里的 pointer type 携带了 pointee 的类型信息

pointee 的类型存储在 Type 类的 ContainedTys 中

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/include/llvm/IR/Type.h#L106

/// Keeps track of how many Type*'s there are in the ContainedTys list.
unsigned NumContainedTys = 0;

/// A pointer to the array of Types contained by this Type. For example, this
/// includes the arguments of a function type, the elements of a structure,
/// the pointee of a pointer, the element type of an array, etc. This pointer
/// may be 0 for types that don't contain other types (Integer, Double,
/// Float).
Type * const *ContainedTys = nullptr;

社区的这种 explicit pointee types 的讨论如下

从历史上看，LLVM 是 C 的某种类型安全子集，为指针类型提供了额外的检查层，指针类型为前端的类型检查提供了便利
在 LLVM 的发展过程中，人们逐渐意识到指针类型的设计并不能有效地支持编译优化
许多操作实际上并不关心 pointee 的类型，这些操作通常最终采用任意指针类型 i8*，于是指针类型的转换 (bitcast) 会带来开销

注意 LLVM 并不存在 void*，可以参考下述代码

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/Type.cpp#L590
bool PointerType::isValidElementType(Type *ElemTy) {
return !ElemTy->isVoidTy() && !ElemTy->isLabelTy() &&
    !ElemTy->isMetadataTy() && !ElemTy->isTokenTy() &&
    !ElemTy->isX86_AMXTy();
}

社区最后达成的共识是，explicit pointee types 的成本大于收益，因此应该弃用它们

于是，LLVM 提出了 opaque pointer type，直译为不透明的指针类型，这种指针类型不携带 pointee 的类型信息

例如，对于下述 LLVM IR

load i64* %p

其对应的 opaque 版本为

load i64, ptr %p

在底层 APIs 上，构造这条指令的 API 从 LLVMBuildLoad 变为了 LLVMBuildLoad2

Array Type

array type 包含两个属性

number of elements
- 这里允许 number of elements 为 0，从而实现 flexible array member
underlying data type

下面是一些例子

Syntax	Semantics
`[40 x i32]`	Array of 40 32-bit integer values.
`[3 x [4 x i32]]`	3x4 array of 32-bit integer values.
`[2 x [3 x [4 x i16]]]`	2x3x4 array of 16-bit integer values.

可以通过如下代码获取 [40 x i32] 类型的单例

llvm::Type *type = llvm::ArrayType::get(llvm::Type::getInt32Ty(TheContext), 40);

类似 pointer type，array type 的 underlying data type 存储在 Type 类的 ContainedTys 中

LLVMContextImpl 使用了下述数据结构缓存了所有的 array type

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.h#L1536

DenseMap<std::pair<Type *, uint64_t>, ArrayType *> ArrayTypes;

Vector Type

vector type 类似 array type，但是用于 SIMD，并且不被认为是 aggregate types，而是 first class types

Values of these types are the only ones which can be produced by instructions.

vector type 包含三个属性

number of elements
- 这里不允许 number of elements 为 0
underlying primitive data type
- 只允许 integer, floating-point or pointer type
scalable property
- 若为 false，则为 FixedVectorType，否则为 ScalableVectorType

下面是一些例子

Syntax	Semantics
`<4 x i32>`	Vector of 4 32-bit integer values.
`<vscale x 4 x i32>`	Vector with a multiple of 4 32-bit integer values.

对于 ScalableVectorType 而言，其 vscale 在编译期由硬件环境决定

可以通过如下代码获取 <vscale x 4 x i32> 类型的单例

llvm::Type *type = llvm::VectorType::get(llvm::Type::getInt32Ty(TheContext), 4, true);

LLVMContextImpl 使用了下述数据结构缓存了所有的 vector type

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.h#L1537

DenseMap<std::pair<Type *, ElementCount>, VectorType *> VectorTypes;

注意此处的 ElementCount 类，其构造出现在

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/include/llvm/IR/DerivedTypes.h#L427

static VectorType *get(Type *ElementType, unsigned NumElements, bool Scalable) {
    return VectorType::get(ElementType, ElementCount::get(NumElements, Scalable));
}

其中调用了其父类 LinearPolySize 的下述方法

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/include/llvm/IR/DerivedTypes.h#L427

static LeafTy get(ScalarTy MinVal, bool Scalable) {
    return static_cast<LeafTy>(LinearPolySize(MinVal, Scalable ? 1 : 0));
}

这里有一段注释

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/include/llvm/IR/DerivedTypes.h#L427

/// UnivariateLinearPolyBase is a base class for ElementCount and TypeSize.
/// Like LinearPolyBase it tries to represent a linear polynomial
/// where only one dimension can be set at any time, e.g.
///   0 * scale0 + 0 * scale1 + ... + cJ * scaleJ + ... + 0 * scaleK
/// The dimension that is set is the univariate dimension.

大概含义是若 scalable property 为 true，则允许对应的 dimension 在不同的硬件环境下进行不同的 scale

在实际测试中，发现在给定的硬件环境下，使用 LLVM 生成的 vector type 通常为 FixedVectorType

例如，利用 AVX2 intrinsics，对包含 8 个 float 类型数据的 vector 执行 abs 操作

#include <immintrin.h>
__m256 _mm256_abs_ps(__m256 vec) {
  __m256 float_zero = _mm256_set1_ps(0);
  __m256 mask_lt_zero = _mm256_cmp_ps(vec, float_zero, _CMP_LT_OQ);
  __m256 vec_neg = _mm256_sub_ps(float_zero, vec);
  return _mm256_blendv_ps(vec, vec_neg, mask_lt_zero);
}

使用 clang -S -emit-llvm a.cpp -O3 -march=native 生成的中间代码如下

define dso_local noundef <8 x float> @_Z13_mm256_abs_psDv8_f(<8 x float> noundef %0) local_unnamed_addr #0 {
  %2 = fcmp olt <8 x float> %0, zeroinitializer
  %3 = fsub <8 x float> zeroinitializer, %0
  %4 = select <8 x i1> %2, <8 x float> %3, <8 x float> %0
  ret <8 x float> %4
}

注意这里 %0, %2, %3, %4 的类型均为 <8 x float>，这同时说明了 vector type 属于 first class types

Structure Type

structure type 有两种类型

literal struct type

匿名，在 context 内保证唯一性，必须包含 body

LLVMContextImpl 使用了下述数据结构缓存了所有的 literal struct type

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.h#L1528

using StructTypeSet = DenseSet<StructType *, AnonStructTypeKeyInfo>;
StructTypeSet AnonStructTypes;

这里的 AnonStructTypeKeyInfo 包含了下列成员

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.h#L94

ArrayRef<Type *> ETypes;
bool isPacked;

可以通过如下方式获取 { i32, i32, i32 } 类型的单例

llvm::Type *i32 = llvm::Type::getInt32Ty(TheContext);
std::array<llvm::Type *, 3> elems = {i32, i32, i32};
llvm::Type *type = llvm::StructType::get(TheContext, elems, false);

LLVM 为 ArrayRef 类提供了大量的 conversion constructors，支持从 pointer, vector, array, C-array 等多种类型构造 ArrayRef

identified struct type

可以匿名，不保证唯一性，可以不包含 body (opaque)

Prior to the LLVM 3.0 release, identified types were structurally uniqued. Only literal types are uniqued in recent versions of LLVM.

LLVMContextImpl 使用了下述数据结构缓存了所有的 identified struct type

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.h#L1530

StringMap<StructType *> NamedStructTypes;
unsigned NamedStructTypesUniqueID = 0;

可以通过如下方式构造 %struct.A = type { i32, i32, i32 } 类型

llvm::Type *i32 = llvm::Type::getInt32Ty(TheContext);
std::array<llvm::Type *, 3> elems = {i32, i32, i32};
llvm::Type *type = llvm::StructType::create(TheContext, elems, "A", false);

实际上，structure type 定义了下述属性，这些属性会被存储到 SubClassData 中

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/include/llvm/IR/DerivedTypes.h#L216

enum {
    /// This is the contents of the SubClassData field.
    SCDB_HasBody = 1,
    SCDB_Packed = 2,
    SCDB_IsLiteral = 4,
    SCDB_IsSized = 8
};

下面举几个例子

例一

struct A;
struct B {
  A* a;
};

生成的 LLVM IR 可能为

%struct.B = type { %struct.A* }
%struct.A = type opaque

其中 struct A 不包含 body，为 opaque structure type

由此可见，引入 opaque structure type 的目的是为了解决前置声明

对于 %struct.A 而言，SCDB_HasBody 和 SCDB_IsSized 对应的 bit 置 0

对于 isSized 的实现，可以参考 https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/Type.cpp#L554

例二

struct __attribute__((packed)) A {
    int i;
    short s;
    char c;
};

生成的 LLVM IR 可能为

%struct.A = type <{ i32, i16, i8 }>

注意这里多出的 < 和 >

对于 %struct.A 而言，SCDB_Packed 对应的 bit 置 1

例三

struct A {
  struct {
    int i;
    int j;
    int k;
  } x;
  struct {
    int i;
    int j;
    int k;
  } y;
};

生成的 LLVM IR 可能为

%struct.A = type { %struct.anon, %struct.anon.0 }
%struct.anon = type { i32, i32, i32 }
%struct.anon.0 = type { i32, i32, i32 }

注意这里匿名结构体的类型仍然为 identified struct type，LLVM 内部会自动处理无名和重名的情形

Function Type

函数签名，包含了返回值类型和参数类型列表

类似 literal struct type，LLVMContextImpl 使用了下述数据结构缓存了所有的 function type

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.h#L1526

using FunctionTypeSet = DenseSet<FunctionType *, FunctionTypeKeyInfo>;
FunctionTypeSet FunctionTypes;

这里的 AnonStructTypeKeyInfo 包含了下列成员

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.h#L142

const Type *ReturnType;
ArrayRef<Type *> Params;
bool isVarArg;

可以通过如下方式获取 i32 (i32) 类型的单例

llvm::Type *i32 = llvm::Type::getInt32Ty(TheContext);
std::array<llvm::Type *, 1> args = {i32};
llvm::Type *type = llvm::FunctionType::get(i32, args, false);

类似的

isVarArg 被存储到了 SubclassData 中

ReturnType 和 Params 被存储到了 ContainedTys 中

这里并没有显式给出 llvm::LLVMContext 参数，实际上这里对应的 context 为 return type 所属的 context

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/Type.cpp#L345

最后，这里的 isVarArg 字段用于指示该函数是否需要包含变长参数

例如

#include <stdio.h>
int main() { printf("hello world\n"); }

生成的 LLVM IR 可能为

@.str = private unnamed_addr constant [13 x i8] c"hello world\0A\00", align 1

define dso_local i32 @main() #0 {
  %1 = call i32 (i8*, ...) @printf(i8* noundef getelementptr inbounds ([13 x i8], [13 x i8]* @.str, i64 0, i64 0))
  ret i32 0
}

注意这里的函数签名 i32 (i8*, ...)

Value

Value 类是 LLVM 中一个非常重要的类，是很多核心类的基类

Value 类的部分继承关系如下图所示

flowchart LR Argument --> Value BasicBlock --> Value User --> Value Constant --> User Instruction --> User Operator --> User

每一个 Value 类对象都包含一个指向 Type 类的指针，以及一个 use list，记录了使用了该 value 的 users

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/include/llvm/IR/Value.h#L74

class Value {
  Type *VTy;
  Use *UseList;
  ...

Value 类内部为 users 实现了迭代器模式，可以使用下述接口访问 value 的 users

llvm::Value *value = ...
for (auto it = value->use_begin(); it != value->use_end(); ++it) {
    llvm::Value *user = it->get();
    ...
}

在对 LLVM IR 进行 transform 的时候，可能会将 value 替换为另一个 value，比如一条指令的结果恒为常数，那么就可以用常数替换这条指令，同时还需要修改引用这个 value 的 users

可以使用下述接口完成上述任务

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/include/llvm/IR/Value.h#L297

/// Change all uses of this to point to a new Value.
///
/// Go through the uses list for this definition and make each use point to
/// "V" instead of "this".  After this completes, 'this's use list is
/// guaranteed to be empty.
void replaceAllUsesWith(Value *V);

其内部实现利用了 ValueHandleBase 类

value handle 可以看作一个指向 value 的智能指针，可以在 value 被 delete 或者被 replaceAllUsesWith (RAUW) 时，触发特定的动作

ValueHandleBase 类有三个子类

WeakVH 当引用的 value 被 delete 或者被 RAUW 之后，置为 null
WeakTrackingVH 当引用的 value 被 delete 之后，置为 null
CallbackVH 当引用的 value 被 delete 或者被 RAUW 之后，会分别调用用户自定义的回调函数

Value 类对象可以拥有一个 name，在 Value 类中使用 HasName 字段记录

LLVMContextImpl 使用了下述数据结构存储了所有的 value name

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.h#L1447

DenseMap<const Value *, ValueName *> ValueNames;

其中

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/include/llvm/IR/Value.h#L55

using ValueName = StringMapEntry<Value *>;

User & Use

User 类继承自 Value 类，因为 user 自身也是一个 value，会被其他 users 使用

更具体的

一个 value 可以被多个 user 使用，即 def-use chain

上面已经举过例子了

一个 user 可以使用多个 value，即 use-def chain

例如访问一条指令对应的操作数

llvm::Instruction *ins = ...
for (auto it = ins->op_begin(); it != ins->op_end(); ++it) {
    llvm::Value *value = it->get();
    ...
}

所以 Use 类的核心就是如何让 value 和 user 高效地双向关联

代码细节略过

Constant

Constant 类继承自 User 类

Constant 类作为所有常量的基类，代表其 value 不会在运行时发生变化

函数和全局变量的常量性体现在它们的地址不会发生变化

所有结构等价的常量在全局只有一个对象实例 (单例)

Constant 类的部分继承关系如下图所示

flowchart LR BlockAddress --> Constant ConstantAggregate --> Constant ConstantArray --> ConstantAggregate ConstantStruct --> ConstantAggregate ConstantVector --> ConstantAggregate ConstantData --> Constant ConstantFP --> ConstantData ConstantInt --> ConstantData ConstantAggregateZero --> ConstantData ConstantPointerNull --> ConstantData ConstantDataSequential --> ConstantData ConstantDataArray --> ConstantDataSequential ConstantDataVector --> ConstantDataSequential ConstantExpr --> Constant GlobalValue --> Constant GlobalObject --> GlobalValue Function --> GlobalObject GlobalVariable --> GlobalObject

ConstantData

ConstantInt

表示任意位宽的整型常量

LLVMContextImpl 使用了下述数据结构缓存了所有的 int constant

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.h#L1449

using IntMapTy = DenseMap<APInt, std::unique_ptr<ConstantInt>, DenseMapAPIntKeyInfo>;
IntMapTy IntConstants;

可以通过如下代码获取 i32 100 常量的单例

llvm::Value *value = llvm::ConstantInt::get(TheContext, llvm::APInt(32, 100, false /* isSigned */));

使用 isSigned 参数提示 APInt 类处理符号问题

An analogous transition that happened earlier in LLVM is integer signedness. Currently there is no distinction between signed and unsigned integer types, but rather each integer operation (e.g. add) contains flags to signal how to treat the integer. Previously LLVM IR distinguished between unsigned and signed integer types and ran into similar issues of no-op casts. The transition from manifesting signedness in types to instructions happened early on in LLVM’s timeline to make LLVM easier to work with.

注意此处的辅助类 APInt，其内部使用 uint64_t 或 uint64_t * 存储原始数据

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/include/llvm/ADT/APInt.h#L1868

union {
    uint64_t VAL;   ///< Used to store the <= 64 bits integer value.
    uint64_t *pVal; ///< Used to store the >64 bits integer value.
} U;

另外 LLVMContextImpl 也为布尔常量值 i1 额外保存了其单例

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.h#L1510

ConstantInt *TheTrueVal = nullptr;
ConstantInt *TheFalseVal = nullptr;

可以通过如下代码获取

llvm::Value *value = llvm::ConstantInt::getTrue(TheContext);

ConstantFP

表示任意位宽的浮点常量

类似 ConstantInt，LLVMContextImpl 使用了下述数据结构缓存了所有的 float constant

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.h#L1453

using FPMapTy = DenseMap<APFloat, std::unique_ptr<ConstantFP>, DenseMapAPFloatKeyInfo>;
FPMapTy FPConstants;

可以通过如下代码获取 float 1.1 常量的单例

llvm::Value *value = llvm::ConstantFP::get(TheContext, llvm::APFloat(static_cast<float>(1.1)));

此处的浮点数遵循 IEEE 规范，其实现封装在 APFloat 等类中，例如

float foo() { return 1.1; }

其生成的 LLVM IR 为

define dso_local noundef float @_Z3foov() #0 {
  ret float 0x3FF19999A0000000
}

使用十六进制表示浮点常量

ConstantAggregateZero

表示复合零常量，通常用于全零初始化

例如

const int arr[42] = {0};

其生成的 LLVM IR 为

@_ZL3arr = internal constant [42 x i32] zeroinitializer, align 16

此处的 zeroinitializer 即为 i32 类型的 ConstantAggregateZero

llvm::Value *value = llvm::ConstantAggregateZero::get(llvm::Type::getInt32Ty(TheContext));

LLVMContextImpl 使用了下述数据结构缓存了所有的 constant aggregate zero

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.h#L1478

DenseMap<Type *, std::unique_ptr<ConstantAggregateZero>> CAZConstants;

ConstantPointerNull

表示空指针

例如

void *foo() { return nullptr; }

其生成的 LLVM IR 为

define dso_local noundef i8* @_Z3foov() #0 {
  ret i8* null
}

此处的 null 即为 i8* 类型的 ConstantPointerNull

llvm::Value *value = llvm::ConstantPointerNull::get(llvm::Type::getInt8PtrTy(TheContext));

LLVMContextImpl 使用了下述数据结构缓存了所有的 constant pointer null

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.h#L1489

DenseMap<PointerType *, std::unique_ptr<ConstantPointerNull>> CPNConstants;

ConstantDataArray

表示常量数组

限制 underlying data type 为 simple 1/2/4/8-byte integer 或 float/double

例如

const int arr[] = { 0, 1, 2 };

其生成的 LLVM IR 为

@_ZL3arr = internal constant [3 x i32] [i32 0, i32 1, i32 2], align 4

可以通过如下代码获取

std::array<int, 3> elems = {0, 1, 2};
llvm::Value *value = llvm::ConstantDataArray::get(TheContext, elems);

ConstantDataVector

表示常量向量

限制 underlying data type 为 simple 1/2/4/8-byte integer 或 float/double

例如

#include <immintrin.h>
__m256 foo() { return _mm256_set1_ps(1); }

使用 clang -S -emit-llvm a.cpp -O3 -march=native 生成的中间代码如下

define dso_local noundef <8 x float> @_Z3foov() local_unnamed_addr #0 {
  ret <8 x float> <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>
}

可以通过如下代码获取

std::array<float, 8> elems = {1, 1, 1, 1, 1, 1, 1, 1};
llvm::Value *value = llvm::ConstantDataVector::get(TheContext, elems);

LLVMContextImpl 使用了下述数据结构缓存了所有的 constant data array 和 constant data vector

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.h#L1497

StringMap<std::unique_ptr<ConstantDataSequential>> CDSConstants;

注意 ConstantDataSequential 是 ConstantDataArray 和 ConstantDataVector 的父类

另外，这里 mapping 的 key 是字符串类型，以上述调用为例

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/Constants.cpp#L3030

Constant *ConstantDataVector::get(LLVMContext &Context, ArrayRef<float> Elts) {
  auto *Ty = FixedVectorType::get(Type::getFloatTy(Context), Elts.size());
  const char *Data = reinterpret_cast<const char *>(Elts.data());
  return getImpl(StringRef(Data, Elts.size() * 4), Ty);
}

这里的字符串是由常量值构造的

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/Constants.cpp#L2891

Constant *ConstantDataSequential::getImpl(StringRef Elements, Type *Ty) {
  // If the elements are all zero or there are no elements, return a CAZ, which
  // is more dense and canonical.
  if (isAllZeros(Elements))
    return ConstantAggregateZero::get(Ty);

当元素全零时，ConstantDataSequential 会退化为 ConstantAggregateZero

ConstantAggregate

ConstantStruct

表示结构体常量

例如

struct A {
  int i;
  int j;
};
const A a = {1, 1};

其生成的 LLVM IR 为

%struct.A = type { i32, i32 }
@_ZL1a = internal constant %struct.A { i32 1, i32 1 }, align 4

可以通过如下代码获取

llvm::Type *i32 = llvm::Type::getInt32Ty(TheContext);
llvm::StructType *type = llvm::StructType::create(TheContext, {i32, i32}, "A", false);
llvm::Constant *one = llvm::ConstantInt::get(TheContext, llvm::APInt(32, 1, false /* isSigned */));
std::array<llvm::Constant *, 2> consts = {one, one};
llvm::Value *value = llvm::ConstantStruct::get(type, consts);

LLVMContextImpl 使用了下述数据结构缓存了所有的 constant struct

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.h#L1483

using StructConstantsTy = ConstantUniqueMap<ConstantStruct>;
StructConstantsTy StructConstants;

ConstantArray

表示常量数组

当 underlying data type 不为 simple 1/2/4/8-byte integer 或 float/double 时

例如

struct A {
  int i;
  int j;
};
const A a[] = {{1, 1},{1, 1}};

其生成的 LLVM IR 为

%struct.A = type { i32, i32 }
@_ZL1a = internal constant [2 x %struct.A] [%struct.A { i32 1, i32 1 }, %struct.A { i32 1, i32 1 }], align 16

LLVMContextImpl 使用了下述数据结构缓存了所有的 constant array

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.h#L1480

using ArrayConstantsTy = ConstantUniqueMap<ConstantArray>;
ArrayConstantsTy ArrayConstants;

参考

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/ConstantsContext.h#L551

template <class ConstantClass> class ConstantUniqueMap {
public:
  using ValType = typename ConstantInfo<ConstantClass>::ValType;
  using TypeClass = typename ConstantInfo<ConstantClass>::TypeClass;
  using LookupKey = std::pair<TypeClass *, ValType>;

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/ConstantsContext.h#L326

template <> struct ConstantInfo<ConstantArray> {
  using ValType = ConstantAggrKeyType<ConstantArray>;
  using TypeClass = ArrayType;
};

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/ConstantsContext.h#L339

template <class ConstantClass> struct ConstantAggrKeyType {
  ArrayRef<Constant *> Operands;

可知缓存的 mapping 中 key 形式如下

{ArrayType *, ArrayRef<Constant *>}

GlobalValue

用于表示全局定义的对象

再次强调，函数和全局变量的常量性体现在它们的地址不会发生变化，相当于一个顶层 const 指针指向这些对象

GlobalVariable

表示全局变量

例如

int a{1};

其生成的 LLVM IR 为

@a = dso_local global i32 1, align 4

这里的 dso_local 的含义如下

The compiler may assume that a function or variable marked as dso_local will resolve to a symbol within the same linkage unit. Direct access will be generated even if the definition is not within this compilation unit.

另一个例子，对于

static int a{1};

其生成的 LLVM IR 为

@_ZL1a = internal global i32 1, align 4

这里的 internal 的含义如下

Similar to private, but the value shows as a local symbol (STB_LOCAL in the case of ELF) in the object file. This corresponds to the notion of the ‘static’ keyword in C.

注意这里出现了 name mangling，对于 internal 链接类型的 value，其对应的符号名和目标文件中的一致

联系之前的 internal constant

此处目标文件的类型为 ELF

13: 0000000000004010     4 OBJECT  LOCAL  DEFAULT   22 _ZL1a

上述 IR 也许可以通过如下代码获取

auto *value = new llvm::GlobalVariable(llvm::Type::getInt32Ty(TheContext), false /* isConstant */, llvm::GlobalValue::LinkageTypes::InternalLinkage);
value->setInitializer(llvm::ConstantInt::get(TheContext, llvm::APInt(32, 1, false /* isSigned */)));

global variable 完整的 LLVM IR 语法如下

@<GlobalVarName> = [Linkage] [PreemptionSpecifier] [Visibility]
                   [DLLStorageClass] [ThreadLocal]
                   [(unnamed_addr|local_unnamed_addr)] [AddrSpace]
                   [ExternallyInitialized]
                   <global | constant> <Type> [<InitializerConstant>]
                   [, section "name"] [, partition "name"]
                   [, comdat [($name)]] [, align <Alignment>]
                   [, no_sanitize_address] [, no_sanitize_hwaddress]
                   [, sanitize_address_dyninit] [, sanitize_memtag]
                   (, !name !N)*

其余属性略去暂不介绍

源码层面，所有的 global variable 都存储在当前的 Module 中

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/include/llvm/IR/Module.h#L181

GlobalListType GlobalList;      ///< The Global Variables in the module

其中

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/include/llvm/IR/Module.h#L69

/// The type for the list of global variables.
using GlobalListType = SymbolTableList<GlobalVariable>;

可以使用下述代码遍历当前 module 所有的 global variable

for (auto it = TheModule->global_begin(); it != TheModule->global_end(); ++it) {
    llvm::GlobalVariable &value = *it;
    ...
}

这得益于 GlobalVariable 类还继承了 ilist_node<GlobalVariable>

class GlobalVariable : public GlobalObject, public ilist_node<GlobalVariable>

从而能够通过当前节点 (GlobalVariable)，遍历链表上其他节点 (GlobalVariable)

Function

表示函数定义和函数声明

对于函数定义

int foo(int) { return {}; }

在 clang -S -emit-llvm a.cpp -O3 下生成的 LLVM IR 为

; Function Attrs: mustprogress nofree norecurse nosync nounwind readnone sspstrong uwtable willreturn
define dso_local noundef i32 @_Z3fooi(i32 noundef %0) local_unnamed_addr #0 {
  ret i32 0
}

dso_local 上面已经介绍过了
noundef 作为 parameter attribute (函数参数和返回值的属性)，标识参数或者返回值不是 undef 的
local_unnamed_addr 标识函数地址在当前的 module 内不重要，只需要关心函数内容，这样 module 内相同的函数满足一定条件就可以合并
上面还有一些 function attributes，不多介绍了

函数定义完整的 LLVM IR 语法如下

define [linkage] [PreemptionSpecifier] [visibility] [DLLStorageClass]
       [cconv] [ret attrs]
       <ResultType> @<FunctionName> ([argument list])
       [(unnamed_addr|local_unnamed_addr)] [AddrSpace] [fn Attrs]
       [section "name"] [partition "name"] [comdat [($name)]] [align N]
       [gc] [prefix Constant] [prologue Constant] [personality Constant]
       (!name !N)* { ... }

上述 IR 也许可以通过如下代码获取

llvm::Type *i32 = llvm::Type::getInt32Ty(TheContext);
std::array<llvm::Type *, 1> args = {i32};
llvm::FunctionType *type = llvm::FunctionType::get(i32, args, false);
llvm::Value *func = llvm::Function::Create(type, llvm::GlobalValue::LinkageTypes::ExternalLinkage, 0 /* AddrSpace */);

对于函数声明，例如 printf

extern int printf (const char *__restrict __format, ...);

其对应的 LLVM IR 为

declare noundef i32 @_Z6printfPKcz(i8* noundef, ...) #1

函数声明完整的 LLVM IR 语法如下

declare [linkage] [visibility] [DLLStorageClass]
        [cconv] [ret attrs]
        <ResultType> @<FunctionName> ([argument list])
        [(unnamed_addr|local_unnamed_addr)] [align N] [gc]
        [prefix Constant] [prologue Constant]

源码层面，类似的，所有的 function 都存储在当前的 Module 中

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/include/llvm/IR/Module.h#L182

FunctionListType FunctionList;  ///< The Functions in the module

Function 类包含一些重要的成员

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/include/llvm/IR/Function.h#L72

using BasicBlockListType = SymbolTableList<BasicBlock>;
...
// Important things that make up a function!
BasicBlockListType BasicBlocks;             ///< The basic blocks
mutable Argument *Arguments = nullptr;      ///< The formal arguments
size_t NumArgs;
std::unique_ptr<ValueSymbolTable> SymTab;   ///< Symbol table of args/instructions
AttributeList AttributeSets;                ///< Parameter attributes

在此主要关注 Argument 类，即函数形参，记录了如下信息

Type
ArgNo
Attributes

Function 类提供了迭代器接口遍历 arguments 和 basic blocks

BlockAddress

用于唯一标识一组 (Function, BasicBlock) 的地址

由于没有介绍 BasicBlock，略过

ConstantExpr

表示常量表达式

其核心为下述方法

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/Constants.cpp#L2263

Constant *ConstantExpr::get(unsigned Opcode, Constant *C1, Constant *C2, unsigned Flags, Type *OnlyIfReducedTy)

相当于通过操作数和操作符构造常量表达式

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/Constants.cpp#L2311

if (Constant *FC = ConstantFoldBinaryInstruction(Opcode, C1, C2))
    return FC;

在构造常量表达式的过程中，会判断是否可以进行常量折叠

其中使用了大量 isa<> 等模板判断 value 是否为 undef 或者 poison

这里简单介绍一下 Undefined Values 和 Poison Values

相关的继承关系如下

引入这两种 value 的原因是，LLVM IR 存在 undefined behavior 这个概念，例如常见的 signed integer overflow
bool foo(int a) { return a + 1 > a; }
其对应的 LLVM IR 为
%4 = add nsw i32 %3, 1
注意这里的 nsw 符号，代表 No Signed Wrap，当 %3 的值为 INT_MAX 时，由于 INT_MAX + 1 会导致 signed integer overflow，此时的 %4 即为 poison value

在之前的 LLVM 实现中，上述情形下 %4 为 undefined value

在 undefined value 上进行运算将会产生 undefined value，而不是产生 undefined behavior，在某些情形下，可能会产生一些优化，例如编译器会认为 undef & 1 只有最低位是 undefined 的，于是 ((undef & 1) >> 1) 就会被认为是 0

A ‘poison’ value should be used instead of ‘undef’ whenever possible. Poison values are stronger than undef, and enable more optimizations. Just the existence of ‘undef’ blocks certain optimizations.

在 2016 年，LLVM 社区曾提议弃用 undef 而只使用 poison，不过目前看来 undef 和 poison 仍然是并存的

另一个出现常量折叠的地方是使用 IRBuilder 构建指令时，例如

llvm::Constant *one = llvm::ConstantInt::get(TheContext, llvm::APInt(32, 1, false /* isSigned */));
llvm::Value *value = Builder.CreateAdd(one, one);

追踪其可能的调用轨迹

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/include/llvm/IR/IRBuilder.h#L1242

Value *CreateAdd(Value *LHS, Value *RHS, const Twine &Name = "", bool HasNUW = false, bool HasNSW = false) {
    if (Value *V = Folder.FoldNoWrapBinOp(Instruction::Add, LHS, RHS, HasNUW, HasNSW))
        return V;
    return CreateInsertNUWNSWBinOp(Instruction::Add, LHS, RHS, Name, HasNUW, HasNSW);
}

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/include/llvm/IR/ConstantFolder.h#L68

Value *FoldNoWrapBinOp(Instruction::BinaryOps Opc, Value *LHS, Value *RHS, bool HasNUW, bool HasNSW) const override {
    auto *LC = dyn_cast<Constant>(LHS);
    auto *RC = dyn_cast<Constant>(RHS);
    if (LC && RC) {
        if (ConstantExpr::isDesirableBinOp(Opc)) {
            unsigned Flags = 0;
            if (HasNUW)
                Flags |= OverflowingBinaryOperator::NoUnsignedWrap;
            if (HasNSW)
                Flags |= OverflowingBinaryOperator::NoSignedWrap;
            return ConstantExpr::get(Opc, LC, RC, Flags);
        }
        return ConstantFoldBinaryInstruction(Opc, LC, RC);
    }
    return nullptr;
}

若操作数满足一定的条件，会调用 ConstantExpr::get 获取对应的常量表达式，从而实现可能的常量折叠优化

LLVMContextImpl 使用了下述数据结构缓存了所有的 constant expr

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.h#L1506

ConstantUniqueMap<ConstantExpr> ExprConstants;

TODO

Instruction
Operator
…

LLVM 类型系统简介

TOC

Type

Floating Point Types

Void Type

Label Type

Token Type

Metadata Type

Integer Type

Pointer Type

Array Type

Vector Type

Structure Type

Function Type

Value

User & Use

Constant

ConstantData

ConstantInt

ConstantFP

ConstantAggregateZero

ConstantPointerNull

ConstantDataArray

ConstantDataVector

ConstantAggregate

ConstantStruct

ConstantArray

GlobalValue

GlobalVariable

Function

BlockAddress

ConstantExpr

TODO

References