Skip to content

LLVM 类型系统简介

Posted on:2023.01.22

TOC

Open TOC

Type

LLVM 类型系统的基础为 Type

所有类型由如下列枚举定义

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/include/llvm/IR/Type.h#L54

enum TypeID {
// PrimitiveTypes
HalfTyID = 0, ///< 16-bit floating point type
BFloatTyID, ///< 16-bit floating point type (7-bit significand)
FloatTyID, ///< 32-bit floating point type
DoubleTyID, ///< 64-bit floating point type
X86_FP80TyID, ///< 80-bit floating point type (X87)
FP128TyID, ///< 128-bit floating point type (112-bit significand)
PPC_FP128TyID, ///< 128-bit floating point type (two 64-bits, PowerPC)
VoidTyID, ///< type with no size
LabelTyID, ///< Labels
MetadataTyID, ///< Metadata
X86_MMXTyID, ///< MMX vectors (64 bits, X86 specific)
X86_AMXTyID, ///< AMX vectors (8192 bits, X86 specific)
TokenTyID, ///< Tokens
// Derived types... see DerivedTypes.h file.
IntegerTyID, ///< Arbitrary bit width integers
FunctionTyID, ///< Functions
PointerTyID, ///< Pointers
StructTyID, ///< Structures
ArrayTyID, ///< Arrays
FixedVectorTyID, ///< Fixed width SIMD vector type
ScalableVectorTyID ///< Scalable SIMD vector type
};

其中

所有结构等价的类型在全局只有一个对象实例 (单例)

Type 类的继承关系如下图所示

91d29133152a4639b34af41dfb499375.png

LLVMContext 类中包含了一个顶层 const 指针,指向 LLVMContextImpl

经典 PImpl 设计

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/include/llvm/IR/LLVMContext.h#L69

LLVMContextImpl *const pImpl;

LLVMContextImpl 中包含了上述 primitive types 和 integer type 的单例,在构造函数中初始化

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.cpp#L40

LLVMContextImpl::LLVMContextImpl(LLVMContext &C)
: DiagHandler(std::make_unique<DiagnosticHandler>()),
VoidTy(C, Type::VoidTyID), LabelTy(C, Type::LabelTyID),
HalfTy(C, Type::HalfTyID), BFloatTy(C, Type::BFloatTyID),
FloatTy(C, Type::FloatTyID), DoubleTy(C, Type::DoubleTyID),
MetadataTy(C, Type::MetadataTyID), TokenTy(C, Type::TokenTyID),
X86_FP80Ty(C, Type::X86_FP80TyID), FP128Ty(C, Type::FP128TyID),
PPC_FP128Ty(C, Type::PPC_FP128TyID), X86_MMXTy(C, Type::X86_MMXTyID),
X86_AMXTy(C, Type::X86_AMXTyID), Int1Ty(C, 1), Int8Ty(C, 8),
Int16Ty(C, 16), Int32Ty(C, 32), Int64Ty(C, 64), Int128Ty(C, 128) {
if (OpaquePointersCL.getNumOccurrences()) {
OpaquePointers = OpaquePointersCL;
}
}

Type 类也提供了对应的静态方法,用于获取这些单例

Floating Point Types

primitive type

TypeDescription
half16-bit floating-point value
bfloat16-bit “brain” floating-point value (7-bit significand). Provides the same number of exponent bits as float, so that it matches its dynamic range, but with greatly reduced precision. Used in Intel’s AVX-512 BF16 extensions and Arm’s ARMv8.6-A extensions, among others.
float32-bit floating-point value
double64-bit floating-point value
fp128128-bit floating-point value (113-bit significand)
x86_fp8080-bit floating-point value (X87)
ppc_fp128128-bit floating-point value (two 64-bits)

通常使用 floatdouble 类型

Void Type

primitive type

可以通过如下代码获取 void 类型的单例

llvm::Type *type = llvm::Type::getVoidTy(TheContext);

void 类型不代表任何值,也没有大小,仅起到占位符的作用,如函数的返回值

define dso_local void @foo() #0 {
ret void
}

Label Type

primitive type

用于标记基本块,例如 max 函数可能对应的 LLVM IR

define dso_local i32 @max(i32 noundef %0, i32 noundef %1) #0 {
%3 = alloca i32, align 4
%4 = alloca i32, align 4
store i32 %0, i32* %3, align 4
store i32 %1, i32* %4, align 4
%5 = load i32, i32* %3, align 4
%6 = load i32, i32* %4, align 4
%7 = icmp sgt i32 %5, %6
br i1 %7, label %8, label %10
8: ; preds = %2
%9 = load i32, i32* %3, align 4
br label %12
10: ; preds = %2
%11 = load i32, i32* %4, align 4
br label %12
12: ; preds = %10, %8
%13 = phi i32 [ %9, %8 ], [ %11, %10 ]
ret i32 %13
}

注意这里隐式的 %2 编号

Token Type

primitive type

The token type is used when a value is associated with an instruction but all uses of the value must not attempt to introspect or obscure it. As such, it is not appropriate to have a phi or select of type token.

The identifier ‘none’ is recognized as an empty token constant and must be of token type.

略过

Metadata Type

primitive type

The metadata type represents embedded metadata. No derived types may be created from metadata except for function arguments.

LLVM IR allows metadata to be attached to instructions and global objects in the program that can convey extra information about the code to the optimizers and code generator. One example application of metadata is source-level debug information. There are two metadata primitives: strings and nodes.

Metadata does not have a type, and is not a value. If referenced from a call instruction, it uses the metadata type.

All metadata are identified in syntax by an exclamation point (‘!’).

例如

!llvm.module.flags = !{!0, !1, !2, !3, !4}
!llvm.ident = !{!5}
!0 = !{i32 1, !"wchar_size", i32 4}
!1 = !{i32 7, !"PIC Level", i32 2}
!2 = !{i32 7, !"PIE Level", i32 2}
!3 = !{i32 7, !"uwtable", i32 1}
!4 = !{i32 7, !"frame-pointer", i32 2}
!5 = !{!"clang version 14.0.6"}

Integer Type

语法结构为 iN,其中 N 为表示所需整数大小的位宽

可以通过如下代码获取 i32 类型的单例

llvm::Type *type = llvm::Type::getInt32Ty(TheContext);

在构造 i32 类型的过程中,向 Type 类中存储了 SubclassData 信息

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/include/llvm/IR/Type.h#L86

TypeID ID : 8; // The current base type of this type.
unsigned SubclassData : 24; // Space for subclasses to store data.
// Note that this should be synchronized with
// MAX_INT_BITS value in IntegerType class.

受其大小限制,integer type 的宽度范围为 [1,223][1, 2^{23}]

也就是说 LLVM 所能够表示的最大整数为 2223=283886082^{2^{23}}=2^{8388608}

注意这里的 integer type 并不包含符号信息

LLVMContextImpl 使用了下述数据结构缓存了所有的 integer type

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.h#L1524

DenseMap<unsigned, IntegerType *> IntegerTypes;

Pointer Type

pointer type 通常用于引用指定内存位置中的对象

pointer type 可以定义指向对象所在的地址空间编号,默认为 0

AddrSpace 同样被存储到了 SubclassData

可以通过如下代码获取 i32* 类型的单例

llvm::Type *type = llvm::Type::getInt32PtrTy(TheContext, 0);

上述方法封装了 PointerType::get 方法

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/Type.cpp#L301

PointerType *Type::getInt32PtrTy(LLVMContext &C, unsigned AS) {
return getInt32Ty(C)->getPointerTo(AS);
}

其中

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/Type.cpp#L776

PointerType *Type::getPointerTo(unsigned AddrSpace) const {
return PointerType::get(const_cast<Type*>(this), AddrSpace);
}

LLVMContextImpl 使用了下述数据结构缓存了所有的 pointer type

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.h#L1538

DenseMap<Type *, PointerType *> PointerTypes; // Pointers in AddrSpace = 0
DenseMap<std::pair<Type *, unsigned>, PointerType *> ASPointerTypes;

注意到这里的 pointer type 携带了 pointee 的类型信息

pointee 的类型存储在 Type 类的 ContainedTys

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/include/llvm/IR/Type.h#L106

/// Keeps track of how many Type*'s there are in the ContainedTys list.
unsigned NumContainedTys = 0;
/// A pointer to the array of Types contained by this Type. For example, this
/// includes the arguments of a function type, the elements of a structure,
/// the pointee of a pointer, the element type of an array, etc. This pointer
/// may be 0 for types that don't contain other types (Integer, Double,
/// Float).
Type * const *ContainedTys = nullptr;

社区的这种 explicit pointee types 的讨论如下

注意 LLVM 并不存在 void*,可以参考下述代码

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/Type.cpp#L590

bool PointerType::isValidElementType(Type *ElemTy) {
return !ElemTy->isVoidTy() && !ElemTy->isLabelTy() &&
!ElemTy->isMetadataTy() && !ElemTy->isTokenTy() &&
!ElemTy->isX86_AMXTy();
}

社区最后达成的共识是,explicit pointee types 的成本大于收益,因此应该弃用它们

于是,LLVM 提出了 opaque pointer type,直译为不透明的指针类型,这种指针类型不携带 pointee 的类型信息

例如,对于下述 LLVM IR

load i64* %p

其对应的 opaque 版本为

load i64, ptr %p

在底层 APIs 上,构造这条指令的 API 从 LLVMBuildLoad 变为了 LLVMBuildLoad2

Array Type

array type 包含两个属性

下面是一些例子

SyntaxSemantics
[40 x i32]Array of 40 32-bit integer values.
[3 x [4 x i32]]3x4 array of 32-bit integer values.
[2 x [3 x [4 x i16]]]2x3x4 array of 16-bit integer values.

可以通过如下代码获取 [40 x i32] 类型的单例

llvm::Type *type = llvm::ArrayType::get(llvm::Type::getInt32Ty(TheContext), 40);

类似 pointer type,array type 的 underlying data type 存储在 Type 类的 ContainedTys

LLVMContextImpl 使用了下述数据结构缓存了所有的 array type

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.h#L1536

DenseMap<std::pair<Type *, uint64_t>, ArrayType *> ArrayTypes;

Vector Type

vector type 类似 array type,但是用于 SIMD,并且不被认为是 aggregate types,而是 first class types

Values of these types are the only ones which can be produced by instructions.

vector type 包含三个属性

下面是一些例子

SyntaxSemantics
<4 x i32>Vector of 4 32-bit integer values.
<vscale x 4 x i32>Vector with a multiple of 4 32-bit integer values.

对于 ScalableVectorType 而言,其 vscale 在编译期由硬件环境决定

可以通过如下代码获取 <vscale x 4 x i32> 类型的单例

llvm::Type *type = llvm::VectorType::get(llvm::Type::getInt32Ty(TheContext), 4, true);

LLVMContextImpl 使用了下述数据结构缓存了所有的 vector type

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.h#L1537

DenseMap<std::pair<Type *, ElementCount>, VectorType *> VectorTypes;

注意此处的 ElementCount 类,其构造出现在

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/include/llvm/IR/DerivedTypes.h#L427

static VectorType *get(Type *ElementType, unsigned NumElements, bool Scalable) {
return VectorType::get(ElementType, ElementCount::get(NumElements, Scalable));
}

其中调用了其父类 LinearPolySize 的下述方法

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/include/llvm/IR/DerivedTypes.h#L427

static LeafTy get(ScalarTy MinVal, bool Scalable) {
return static_cast<LeafTy>(LinearPolySize(MinVal, Scalable ? 1 : 0));
}

这里有一段注释

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/include/llvm/IR/DerivedTypes.h#L427

/// UnivariateLinearPolyBase is a base class for ElementCount and TypeSize.
/// Like LinearPolyBase it tries to represent a linear polynomial
/// where only one dimension can be set at any time, e.g.
/// 0 * scale0 + 0 * scale1 + ... + cJ * scaleJ + ... + 0 * scaleK
/// The dimension that is set is the univariate dimension.

大概含义是若 scalable property 为 true,则允许对应的 dimension 在不同的硬件环境下进行不同的 scale

在实际测试中,发现在给定的硬件环境下,使用 LLVM 生成的 vector type 通常为 FixedVectorType

例如,利用 AVX2 intrinsics,对包含 8 个 float 类型数据的 vector 执行 abs 操作

a.cpp
#include <immintrin.h>
__m256 _mm256_abs_ps(__m256 vec) {
__m256 float_zero = _mm256_set1_ps(0);
__m256 mask_lt_zero = _mm256_cmp_ps(vec, float_zero, _CMP_LT_OQ);
__m256 vec_neg = _mm256_sub_ps(float_zero, vec);
return _mm256_blendv_ps(vec, vec_neg, mask_lt_zero);
}

使用 clang -S -emit-llvm a.cpp -O3 -march=native 生成的中间代码如下

define dso_local noundef <8 x float> @_Z13_mm256_abs_psDv8_f(<8 x float> noundef %0) local_unnamed_addr #0 {
%2 = fcmp olt <8 x float> %0, zeroinitializer
%3 = fsub <8 x float> zeroinitializer, %0
%4 = select <8 x i1> %2, <8 x float> %3, <8 x float> %0
ret <8 x float> %4
}

注意这里 %0, %2, %3, %4 的类型均为 <8 x float>,这同时说明了 vector type 属于 first class types

Structure Type

structure type 有两种类型

匿名,在 context 内保证唯一性,必须包含 body

LLVMContextImpl 使用了下述数据结构缓存了所有的 literal struct type

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.h#L1528

using StructTypeSet = DenseSet<StructType *, AnonStructTypeKeyInfo>;
StructTypeSet AnonStructTypes;

这里的 AnonStructTypeKeyInfo 包含了下列成员

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.h#L94

ArrayRef<Type *> ETypes;
bool isPacked;

可以通过如下方式获取 { i32, i32, i32 } 类型的单例

llvm::Type *i32 = llvm::Type::getInt32Ty(TheContext);
std::array<llvm::Type *, 3> elems = {i32, i32, i32};
llvm::Type *type = llvm::StructType::get(TheContext, elems, false);

LLVM 为 ArrayRef 类提供了大量的 conversion constructors,支持从 pointer, vector, array, C-array 等多种类型构造 ArrayRef

可以匿名,不保证唯一性,可以不包含 body (opaque)

Prior to the LLVM 3.0 release, identified types were structurally uniqued. Only literal types are uniqued in recent versions of LLVM.

LLVMContextImpl 使用了下述数据结构缓存了所有的 identified struct type

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.h#L1530

StringMap<StructType *> NamedStructTypes;
unsigned NamedStructTypesUniqueID = 0;

可以通过如下方式构造 %struct.A = type { i32, i32, i32 } 类型

llvm::Type *i32 = llvm::Type::getInt32Ty(TheContext);
std::array<llvm::Type *, 3> elems = {i32, i32, i32};
llvm::Type *type = llvm::StructType::create(TheContext, elems, "A", false);

实际上,structure type 定义了下述属性,这些属性会被存储到 SubClassData

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/include/llvm/IR/DerivedTypes.h#L216

enum {
/// This is the contents of the SubClassData field.
SCDB_HasBody = 1,
SCDB_Packed = 2,
SCDB_IsLiteral = 4,
SCDB_IsSized = 8
};

下面举几个例子

struct A;
struct B {
A* a;
};

生成的 LLVM IR 可能为

%struct.B = type { %struct.A* }
%struct.A = type opaque

其中 struct A 不包含 body,为 opaque structure type

由此可见,引入 opaque structure type 的目的是为了解决前置声明

对于 %struct.A 而言,SCDB_HasBodySCDB_IsSized 对应的 bit 置 0

对于 isSized 的实现,可以参考 https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/Type.cpp#L554

struct __attribute__((packed)) A {
int i;
short s;
char c;
};

生成的 LLVM IR 可能为

%struct.A = type <{ i32, i16, i8 }>

注意这里多出的 <>

对于 %struct.A 而言,SCDB_Packed 对应的 bit 置 1

struct A {
struct {
int i;
int j;
int k;
} x;
struct {
int i;
int j;
int k;
} y;
};

生成的 LLVM IR 可能为

%struct.A = type { %struct.anon, %struct.anon.0 }
%struct.anon = type { i32, i32, i32 }
%struct.anon.0 = type { i32, i32, i32 }

注意这里匿名结构体的类型仍然为 identified struct type,LLVM 内部会自动处理无名和重名的情形

Function Type

函数签名,包含了返回值类型和参数类型列表

类似 literal struct type,LLVMContextImpl 使用了下述数据结构缓存了所有的 function type

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.h#L1526

using FunctionTypeSet = DenseSet<FunctionType *, FunctionTypeKeyInfo>;
FunctionTypeSet FunctionTypes;

这里的 AnonStructTypeKeyInfo 包含了下列成员

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.h#L142

const Type *ReturnType;
ArrayRef<Type *> Params;
bool isVarArg;

可以通过如下方式获取 i32 (i32) 类型的单例

llvm::Type *i32 = llvm::Type::getInt32Ty(TheContext);
std::array<llvm::Type *, 1> args = {i32};
llvm::Type *type = llvm::FunctionType::get(i32, args, false);

类似的

  • isVarArg 被存储到了 SubclassData
  • ReturnTypeParams 被存储到了 ContainedTys

这里并没有显式给出 llvm::LLVMContext 参数,实际上这里对应的 context 为 return type 所属的 context

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/Type.cpp#L345

最后,这里的 isVarArg 字段用于指示该函数是否需要包含变长参数

例如

#include <stdio.h>
int main() { printf("hello world\n"); }

生成的 LLVM IR 可能为

@.str = private unnamed_addr constant [13 x i8] c"hello world\0A\00", align 1
define dso_local i32 @main() #0 {
%1 = call i32 (i8*, ...) @printf(i8* noundef getelementptr inbounds ([13 x i8], [13 x i8]* @.str, i64 0, i64 0))
ret i32 0
}

注意这里的函数签名 i32 (i8*, ...)

Value

Value 类是 LLVM 中一个非常重要的类,是很多核心类的基类

Value 类的部分继承关系如下图所示

flowchart LR Argument --> Value BasicBlock --> Value User --> Value Constant --> User Instruction --> User Operator --> User

每一个 Value 类对象都包含一个指向 Type 类的指针,以及一个 use list,记录了使用了该 value 的 users

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/include/llvm/IR/Value.h#L74

class Value {
Type *VTy;
Use *UseList;
...

Value 类内部为 users 实现了迭代器模式,可以使用下述接口访问 value 的 users

llvm::Value *value = ...
for (auto it = value->use_begin(); it != value->use_end(); ++it) {
llvm::Value *user = it->get();
...
}

在对 LLVM IR 进行 transform 的时候,可能会将 value 替换为另一个 value,比如一条指令的结果恒为常数,那么就可以用常数替换这条指令,同时还需要修改引用这个 value 的 users

可以使用下述接口完成上述任务

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/include/llvm/IR/Value.h#L297

/// Change all uses of this to point to a new Value.
///
/// Go through the uses list for this definition and make each use point to
/// "V" instead of "this". After this completes, 'this's use list is
/// guaranteed to be empty.
void replaceAllUsesWith(Value *V);

其内部实现利用了 ValueHandleBase

value handle 可以看作一个指向 value 的智能指针,可以在 value 被 delete 或者被 replaceAllUsesWith (RAUW) 时,触发特定的动作

ValueHandleBase 类有三个子类

Value 类对象可以拥有一个 name,在 Value 类中使用 HasName 字段记录

LLVMContextImpl 使用了下述数据结构存储了所有的 value name

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.h#L1447

DenseMap<const Value *, ValueName *> ValueNames;

其中

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/include/llvm/IR/Value.h#L55

using ValueName = StringMapEntry<Value *>;

User & Use

User 类继承自 Value 类,因为 user 自身也是一个 value,会被其他 users 使用

更具体的

上面已经举过例子了

例如访问一条指令对应的操作数

llvm::Instruction *ins = ...
for (auto it = ins->op_begin(); it != ins->op_end(); ++it) {
llvm::Value *value = it->get();
...
}

所以 Use 类的核心就是如何让 value 和 user 高效地双向关联

代码细节略过

Constant

Constant 类继承自 User

Constant 类作为所有常量的基类,代表其 value 不会在运行时发生变化

函数和全局变量的常量性体现在它们的地址不会发生变化

所有结构等价的常量在全局只有一个对象实例 (单例)

Constant 类的部分继承关系如下图所示

flowchart LR BlockAddress --> Constant ConstantAggregate --> Constant ConstantArray --> ConstantAggregate ConstantStruct --> ConstantAggregate ConstantVector --> ConstantAggregate ConstantData --> Constant ConstantFP --> ConstantData ConstantInt --> ConstantData ConstantAggregateZero --> ConstantData ConstantPointerNull --> ConstantData ConstantDataSequential --> ConstantData ConstantDataArray --> ConstantDataSequential ConstantDataVector --> ConstantDataSequential ConstantExpr --> Constant GlobalValue --> Constant GlobalObject --> GlobalValue Function --> GlobalObject GlobalVariable --> GlobalObject

ConstantData

ConstantInt

表示任意位宽的整型常量

LLVMContextImpl 使用了下述数据结构缓存了所有的 int constant

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.h#L1449

using IntMapTy = DenseMap<APInt, std::unique_ptr<ConstantInt>, DenseMapAPIntKeyInfo>;
IntMapTy IntConstants;

可以通过如下代码获取 i32 100 常量的单例

llvm::Value *value = llvm::ConstantInt::get(TheContext, llvm::APInt(32, 100, false /* isSigned */));

使用 isSigned 参数提示 APInt 类处理符号问题

An analogous transition that happened earlier in LLVM is integer signedness. Currently there is no distinction between signed and unsigned integer types, but rather each integer operation (e.g. add) contains flags to signal how to treat the integer. Previously LLVM IR distinguished between unsigned and signed integer types and ran into similar issues of no-op casts. The transition from manifesting signedness in types to instructions happened early on in LLVM’s timeline to make LLVM easier to work with.

注意此处的辅助类 APInt,其内部使用 uint64_tuint64_t * 存储原始数据

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/include/llvm/ADT/APInt.h#L1868

union {
uint64_t VAL; ///< Used to store the <= 64 bits integer value.
uint64_t *pVal; ///< Used to store the >64 bits integer value.
} U;

另外 LLVMContextImpl 也为布尔常量值 i1 额外保存了其单例

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.h#L1510

ConstantInt *TheTrueVal = nullptr;
ConstantInt *TheFalseVal = nullptr;

可以通过如下代码获取

llvm::Value *value = llvm::ConstantInt::getTrue(TheContext);

ConstantFP

表示任意位宽的浮点常量

类似 ConstantIntLLVMContextImpl 使用了下述数据结构缓存了所有的 float constant

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.h#L1453

using FPMapTy = DenseMap<APFloat, std::unique_ptr<ConstantFP>, DenseMapAPFloatKeyInfo>;
FPMapTy FPConstants;

可以通过如下代码获取 float 1.1 常量的单例

llvm::Value *value = llvm::ConstantFP::get(TheContext, llvm::APFloat(static_cast<float>(1.1)));

此处的浮点数遵循 IEEE 规范,其实现封装在 APFloat 等类中,例如

float foo() { return 1.1; }

其生成的 LLVM IR 为

define dso_local noundef float @_Z3foov() #0 {
ret float 0x3FF19999A0000000
}

使用十六进制表示浮点常量

ConstantAggregateZero

表示复合零常量,通常用于全零初始化

例如

const int arr[42] = {0};

其生成的 LLVM IR 为

@_ZL3arr = internal constant [42 x i32] zeroinitializer, align 16

此处的 zeroinitializer 即为 i32 类型的 ConstantAggregateZero

llvm::Value *value = llvm::ConstantAggregateZero::get(llvm::Type::getInt32Ty(TheContext));

LLVMContextImpl 使用了下述数据结构缓存了所有的 constant aggregate zero

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.h#L1478

DenseMap<Type *, std::unique_ptr<ConstantAggregateZero>> CAZConstants;

ConstantPointerNull

表示空指针

例如

void *foo() { return nullptr; }

其生成的 LLVM IR 为

define dso_local noundef i8* @_Z3foov() #0 {
ret i8* null
}

此处的 null 即为 i8* 类型的 ConstantPointerNull

llvm::Value *value = llvm::ConstantPointerNull::get(llvm::Type::getInt8PtrTy(TheContext));

LLVMContextImpl 使用了下述数据结构缓存了所有的 constant pointer null

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.h#L1489

DenseMap<PointerType *, std::unique_ptr<ConstantPointerNull>> CPNConstants;

ConstantDataArray

表示常量数组

限制 underlying data type 为 simple 1/2/4/8-byte integer 或 float/double

例如

const int arr[] = { 0, 1, 2 };

其生成的 LLVM IR 为

@_ZL3arr = internal constant [3 x i32] [i32 0, i32 1, i32 2], align 4

可以通过如下代码获取

std::array<int, 3> elems = {0, 1, 2};
llvm::Value *value = llvm::ConstantDataArray::get(TheContext, elems);

ConstantDataVector

表示常量向量

限制 underlying data type 为 simple 1/2/4/8-byte integer 或 float/double

例如

#include <immintrin.h>
__m256 foo() { return _mm256_set1_ps(1); }

使用 clang -S -emit-llvm a.cpp -O3 -march=native 生成的中间代码如下

define dso_local noundef <8 x float> @_Z3foov() local_unnamed_addr #0 {
ret <8 x float> <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>
}

可以通过如下代码获取

std::array<float, 8> elems = {1, 1, 1, 1, 1, 1, 1, 1};
llvm::Value *value = llvm::ConstantDataVector::get(TheContext, elems);

LLVMContextImpl 使用了下述数据结构缓存了所有的 constant data array 和 constant data vector

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.h#L1497

StringMap<std::unique_ptr<ConstantDataSequential>> CDSConstants;

注意 ConstantDataSequentialConstantDataArrayConstantDataVector 的父类

另外,这里 mapping 的 key 是字符串类型,以上述调用为例

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/Constants.cpp#L3030

Constant *ConstantDataVector::get(LLVMContext &Context, ArrayRef<float> Elts) {
auto *Ty = FixedVectorType::get(Type::getFloatTy(Context), Elts.size());
const char *Data = reinterpret_cast<const char *>(Elts.data());
return getImpl(StringRef(Data, Elts.size() * 4), Ty);
}

这里的字符串是由常量值构造的

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/Constants.cpp#L2891

Constant *ConstantDataSequential::getImpl(StringRef Elements, Type *Ty) {
// If the elements are all zero or there are no elements, return a CAZ, which
// is more dense and canonical.
if (isAllZeros(Elements))
return ConstantAggregateZero::get(Ty);

当元素全零时,ConstantDataSequential 会退化为 ConstantAggregateZero

ConstantAggregate

ConstantStruct

表示结构体常量

例如

struct A {
int i;
int j;
};
const A a = {1, 1};

其生成的 LLVM IR 为

%struct.A = type { i32, i32 }
@_ZL1a = internal constant %struct.A { i32 1, i32 1 }, align 4

可以通过如下代码获取

llvm::Type *i32 = llvm::Type::getInt32Ty(TheContext);
llvm::StructType *type = llvm::StructType::create(TheContext, {i32, i32}, "A", false);
llvm::Constant *one = llvm::ConstantInt::get(TheContext, llvm::APInt(32, 1, false /* isSigned */));
std::array<llvm::Constant *, 2> consts = {one, one};
llvm::Value *value = llvm::ConstantStruct::get(type, consts);

LLVMContextImpl 使用了下述数据结构缓存了所有的 constant struct

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.h#L1483

using StructConstantsTy = ConstantUniqueMap<ConstantStruct>;
StructConstantsTy StructConstants;

ConstantArray

表示常量数组

当 underlying data type 为 simple 1/2/4/8-byte integer 或 float/double 时

例如

struct A {
int i;
int j;
};
const A a[] = {{1, 1},{1, 1}};

其生成的 LLVM IR 为

%struct.A = type { i32, i32 }
@_ZL1a = internal constant [2 x %struct.A] [%struct.A { i32 1, i32 1 }, %struct.A { i32 1, i32 1 }], align 16

LLVMContextImpl 使用了下述数据结构缓存了所有的 constant array

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.h#L1480

using ArrayConstantsTy = ConstantUniqueMap<ConstantArray>;
ArrayConstantsTy ArrayConstants;

参考

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/ConstantsContext.h#L551

template <class ConstantClass> class ConstantUniqueMap {
public:
using ValType = typename ConstantInfo<ConstantClass>::ValType;
using TypeClass = typename ConstantInfo<ConstantClass>::TypeClass;
using LookupKey = std::pair<TypeClass *, ValType>;

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/ConstantsContext.h#L326

template <> struct ConstantInfo<ConstantArray> {
using ValType = ConstantAggrKeyType<ConstantArray>;
using TypeClass = ArrayType;
};

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/ConstantsContext.h#L339

template <class ConstantClass> struct ConstantAggrKeyType {
ArrayRef<Constant *> Operands;

可知缓存的 mapping 中 key 形式如下

{ArrayType *, ArrayRef<Constant *>}

GlobalValue

用于表示全局定义的对象

再次强调,函数和全局变量的常量性体现在它们的地址不会发生变化,相当于一个顶层 const 指针指向这些对象

GlobalVariable

表示全局变量

例如

int a{1};

其生成的 LLVM IR 为

@a = dso_local global i32 1, align 4

这里的 dso_local 的含义如下

The compiler may assume that a function or variable marked as dso_local will resolve to a symbol within the same linkage unit. Direct access will be generated even if the definition is not within this compilation unit.

另一个例子,对于

static int a{1};

其生成的 LLVM IR 为

@_ZL1a = internal global i32 1, align 4

这里的 internal 的含义如下

Similar to private, but the value shows as a local symbol (STB_LOCAL in the case of ELF) in the object file. This corresponds to the notion of the ‘static’ keyword in C.

注意这里出现了 name mangling,对于 internal 链接类型的 value,其对应的符号名和目标文件中的一致

联系之前的 internal constant

此处目标文件的类型为 ELF

13: 0000000000004010 4 OBJECT LOCAL DEFAULT 22 _ZL1a

上述 IR 也许可以通过如下代码获取

auto *value = new llvm::GlobalVariable(llvm::Type::getInt32Ty(TheContext), false /* isConstant */, llvm::GlobalValue::LinkageTypes::InternalLinkage);
value->setInitializer(llvm::ConstantInt::get(TheContext, llvm::APInt(32, 1, false /* isSigned */)));

global variable 完整的 LLVM IR 语法如下

@<GlobalVarName> = [Linkage] [PreemptionSpecifier] [Visibility]
[DLLStorageClass] [ThreadLocal]
[(unnamed_addr|local_unnamed_addr)] [AddrSpace]
[ExternallyInitialized]
<global | constant> <Type> [<InitializerConstant>]
[, section "name"] [, partition "name"]
[, comdat [($name)]] [, align <Alignment>]
[, no_sanitize_address] [, no_sanitize_hwaddress]
[, sanitize_address_dyninit] [, sanitize_memtag]
(, !name !N)*

其余属性略去暂不介绍

源码层面,所有的 global variable 都存储在当前的 Module

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/include/llvm/IR/Module.h#L181

GlobalListType GlobalList; ///< The Global Variables in the module

其中

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/include/llvm/IR/Module.h#L69

/// The type for the list of global variables.
using GlobalListType = SymbolTableList<GlobalVariable>;

可以使用下述代码遍历当前 module 所有的 global variable

for (auto it = TheModule->global_begin(); it != TheModule->global_end(); ++it) {
llvm::GlobalVariable &value = *it;
...
}

这得益于 GlobalVariable 类还继承了 ilist_node<GlobalVariable>

class GlobalVariable : public GlobalObject, public ilist_node<GlobalVariable>

从而能够通过当前节点 (GlobalVariable),遍历链表上其他节点 (GlobalVariable)

Function

表示函数定义和函数声明

对于函数定义

int foo(int) { return {}; }

clang -S -emit-llvm a.cpp -O3 下生成的 LLVM IR 为

; Function Attrs: mustprogress nofree norecurse nosync nounwind readnone sspstrong uwtable willreturn
define dso_local noundef i32 @_Z3fooi(i32 noundef %0) local_unnamed_addr #0 {
ret i32 0
}

函数定义完整的 LLVM IR 语法如下

define [linkage] [PreemptionSpecifier] [visibility] [DLLStorageClass]
[cconv] [ret attrs]
<ResultType> @<FunctionName> ([argument list])
[(unnamed_addr|local_unnamed_addr)] [AddrSpace] [fn Attrs]
[section "name"] [partition "name"] [comdat [($name)]] [align N]
[gc] [prefix Constant] [prologue Constant] [personality Constant]
(!name !N)* { ... }

上述 IR 也许可以通过如下代码获取

llvm::Type *i32 = llvm::Type::getInt32Ty(TheContext);
std::array<llvm::Type *, 1> args = {i32};
llvm::FunctionType *type = llvm::FunctionType::get(i32, args, false);
llvm::Value *func = llvm::Function::Create(type, llvm::GlobalValue::LinkageTypes::ExternalLinkage, 0 /* AddrSpace */);

对于函数声明,例如 printf

extern int printf (const char *__restrict __format, ...);

其对应的 LLVM IR 为

declare noundef i32 @_Z6printfPKcz(i8* noundef, ...) #1

函数声明完整的 LLVM IR 语法如下

declare [linkage] [visibility] [DLLStorageClass]
[cconv] [ret attrs]
<ResultType> @<FunctionName> ([argument list])
[(unnamed_addr|local_unnamed_addr)] [align N] [gc]
[prefix Constant] [prologue Constant]

源码层面,类似的,所有的 function 都存储在当前的 Module

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/include/llvm/IR/Module.h#L182

FunctionListType FunctionList; ///< The Functions in the module

Function 类包含一些重要的成员

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/include/llvm/IR/Function.h#L72

using BasicBlockListType = SymbolTableList<BasicBlock>;
...
// Important things that make up a function!
BasicBlockListType BasicBlocks; ///< The basic blocks
mutable Argument *Arguments = nullptr; ///< The formal arguments
size_t NumArgs;
std::unique_ptr<ValueSymbolTable> SymTab; ///< Symbol table of args/instructions
AttributeList AttributeSets; ///< Parameter attributes

在此主要关注 Argument 类,即函数形参,记录了如下信息

Function 类提供了迭代器接口遍历 arguments 和 basic blocks

BlockAddress

用于唯一标识一组 (Function, BasicBlock) 的地址

由于没有介绍 BasicBlock,略过

ConstantExpr

表示常量表达式

其核心为下述方法

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/Constants.cpp#L2263

Constant *ConstantExpr::get(unsigned Opcode, Constant *C1, Constant *C2, unsigned Flags, Type *OnlyIfReducedTy)

相当于通过操作数和操作符构造常量表达式

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/Constants.cpp#L2311

if (Constant *FC = ConstantFoldBinaryInstruction(Opcode, C1, C2))
return FC;

在构造常量表达式的过程中,会判断是否可以进行常量折叠

其中使用了大量 isa<> 等模板判断 value 是否为 undef 或者 poison

这里简单介绍一下 Undefined Values 和 Poison Values

相关的继承关系如下

236ebf2da5d5480aad70cd4994cd47d8.png

引入这两种 value 的原因是,LLVM IR 存在 undefined behavior 这个概念,例如常见的 signed integer overflow

bool foo(int a) { return a + 1 > a; }

其对应的 LLVM IR 为

%4 = add nsw i32 %3, 1

注意这里的 nsw 符号,代表 No Signed Wrap,当 %3 的值为 INT_MAX 时,由于 INT_MAX + 1 会导致 signed integer overflow,此时的 %4 即为 poison value

之前的 LLVM 实现中,上述情形下 %4 为 undefined value

在 undefined value 上进行运算将会产生 undefined value,而不是产生 undefined behavior,在某些情形下,可能会产生一些优化,例如编译器会认为 undef & 1 只有最低位是 undefined 的,于是 ((undef & 1) >> 1) 就会被认为是 0

A ‘poison’ value should be used instead of ‘undef’ whenever possible. Poison values are stronger than undef, and enable more optimizations. Just the existence of ‘undef’ blocks certain optimizations.

在 2016 年,LLVM 社区曾提议弃用 undef 而只使用 poison,不过目前看来 undef 和 poison 仍然是并存的

另一个出现常量折叠的地方是使用 IRBuilder 构建指令时,例如

llvm::Constant *one = llvm::ConstantInt::get(TheContext, llvm::APInt(32, 1, false /* isSigned */));
llvm::Value *value = Builder.CreateAdd(one, one);

追踪其可能的调用轨迹

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/include/llvm/IR/IRBuilder.h#L1242

Value *CreateAdd(Value *LHS, Value *RHS, const Twine &Name = "", bool HasNUW = false, bool HasNSW = false) {
if (Value *V = Folder.FoldNoWrapBinOp(Instruction::Add, LHS, RHS, HasNUW, HasNSW))
return V;
return CreateInsertNUWNSWBinOp(Instruction::Add, LHS, RHS, Name, HasNUW, HasNSW);
}

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/include/llvm/IR/ConstantFolder.h#L68

Value *FoldNoWrapBinOp(Instruction::BinaryOps Opc, Value *LHS, Value *RHS, bool HasNUW, bool HasNSW) const override {
auto *LC = dyn_cast<Constant>(LHS);
auto *RC = dyn_cast<Constant>(RHS);
if (LC && RC) {
if (ConstantExpr::isDesirableBinOp(Opc)) {
unsigned Flags = 0;
if (HasNUW)
Flags |= OverflowingBinaryOperator::NoUnsignedWrap;
if (HasNSW)
Flags |= OverflowingBinaryOperator::NoSignedWrap;
return ConstantExpr::get(Opc, LC, RC, Flags);
}
return ConstantFoldBinaryInstruction(Opc, LC, RC);
}
return nullptr;
}

若操作数满足一定的条件,会调用 ConstantExpr::get 获取对应的常量表达式,从而实现可能的常量折叠优化

LLVMContextImpl 使用了下述数据结构缓存了所有的 constant expr

https://github.com/llvm/llvm-project/blob/3665da3d0091ab765d54ce643bd82d353c040631/llvm/lib/IR/LLVMContextImpl.h#L1506

ConstantUniqueMap<ConstantExpr> ExprConstants;

TODO

References

  1. https://llvm.org/docs/LangRef.html#type-system
  2. https://llvm.org/docs/LangRef.html#constants
  3. https://llvm.org/docs/LangRef.html#linkage-types
  4. https://llvm.org/docs/LangRef.html#parameter-attributes
  5. https://llvm.org/docs/LangRef.html#function-attributes
  6. https://llvm.org/docs/LangRef.html#global-variables
  7. https://llvm.org/docs/LangRef.html#functions
  8. https://www.llvm.org/docs/ProgrammersManual.html#the-isa-cast-and-dyn-cast-templates
  9. https://llvm.org/doxygen/classllvm_1_1Type.html
  10. https://llvm.org/doxygen/classllvm_1_1Value.html
  11. https://llvm.org/doxygen/classllvm_1_1Constant.html
  12. https://llvm.org/docs/OpaquePointers.html
  13. https://llvm.org/docs/CMake.html#embedding-llvm-in-your-project
  14. https://groups.seas.harvard.edu/courses/cs153/2019fa/schedule.html
  15. https://github.com/llvm/llvm-project
  16. https://github.com/ghaiklor/llvm-kaleidoscope
  17. https://github.com/PacktPublishing/Learn-LLVM-12
  18. https://blog.csdn.net/weixin_42654107/article/details/122860584
  19. https://lowlevelbits.org/type-equality-in-llvm/
  20. https://www.youtube.com/watch?v=_-3Iiads1EM
  21. https://llvm.org/devmtg/2016-11/Slides/Lopes-LongLivePoison.pdf
  22. https://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html