Compiling code with the Clang API

Have you tried Clang yet? Clang is an open-source compiler, under active development, that aims to replace GCC for compiling C, C++, and Objective-C. Compared to GCC, Clang is faster, while generating comparably fast code, and prints more useful error messages.

Clang is also better for developers who want to compile code programmatically. Unlike GCC, Clang is designed to be both a tool and an API. That makes Clang’s source code easier to understand and reuse. And, for those of us working on projects incompatible with GCC’s GPL license, it’s good to know that Clang is distributed under the BSD license.

Kosada is working on a cool new project that’s built on top of Clang and its underlying framework, LLVM. While using Clang for this project, I’ve been pleased to see how simple it is to write code that builds other code. Simple in retrospect, anyway! The code I wrote turned out to be simple, but it took lots of digging through the Clang source code to figure out what to write. So here’s my first contribution to the Clang community: two examples of using the Clang API to build code programmatically. The program compiled by these examples is one of the libcurl examples, getinmemory.c. I picked it because it demonstrates including and linking a library.

The examples refer to the Clang source code. You can download it here. I’m using version 3.1.

You can download the code for the examples here.

Example: Build a .c file to an executable

You have a .c file. You want to compile and link it to create an executable.

Briefly, here’s how: You create a Driver object. You give it a list of arguments — the same arguments you’d pass if you were to run clang on the command line. You tell the driver to build a Compilation object and execute it. Congratulations, you just compiled and linked your .c file.

That’s basically what is done by the code that gets invoked when you run clang on the command line. In the Clang source code, that’s tools/driver/driver.cpp. The example below (build_executable.cpp) is a super simplified version of that.

This example compiles and links a .c file. The first step is to set up the arguments to the Driver:

// Path to the C file
string inputPath = "getinmemory.c";
 
// Path to the executable
string outputPath = "getinmemory";
 
// Path to clang (e.g. /usr/local/bin/clang)
llvm::sys::Path clangPath = llvm::sys::Program::FindProgramByName("clang");
 
// Arguments to pass to the clang driver:
//    clang getinmemory.c -lcurl -v
vector<const char *> args;
args.push_back(clangPath.c_str());
args.push_back(inputPath.c_str());
args.push_back("-l");
args.push_back("curl");

The Driver needs a DiagnosticsEngine so it can report problems, so construct one of those:

clang::TextDiagnosticPrinter *DiagClient = new clang::TextDiagnosticPrinter(llvm::errs(), clang::DiagnosticOptions());
clang::IntrusiveRefCntPtr<clang::DiagnosticIDs> DiagID(new clang::DiagnosticIDs());
clang::DiagnosticsEngine Diags(DiagID, DiagClient);

Construct the Driver itself:

clang::driver::Driver TheDriver(args[0], llvm::sys::getDefaultTargetTriple(), outputPath, true, Diags);

The Driver doesn’t know how to do the grunt work of compiling or linking the code. It’s more of a project manager. It figures out which tasks need to be done and tells other parts of Clang, or other tools like ld, to do them. The list of tasks is encapsulated in a Compilation object. You need to construct a Compilation and then execute it:

// Create the set of actions to perform
clang::OwningPtr<clang::driver::Compilation> C(TheDriver.BuildCompilation(args));
 
// Carry out the actions
int Res = 0;
const clang::driver::Command *FailingCommand = 0;
if (C)
    Res = TheDriver.ExecuteCompilation(*C, FailingCommand);

If anything went wrong with the execution, you can print the errors:

if (Res < 0)
    TheDriver.generateCompilationDiagnostics(*C, FailingCommand);

Bonus: Print the tasks of the Compilation

In case you’re wondering what exactly those “tasks” are in the Compilation object, you can print them like this:

TheDriver.PrintActions(*C);

The output is something like this:

0: input, "getinmemory.c", c
1: preprocessor, {0}, cpp-output
2: compiler, {1}, assembler
3: assembler, {2}, object
4: input, "curl", object
5: linker, {3, 4}, image
6: bind-arch, "x86_64", {5}, image

Bonus: Print “verbose” information for debugging

Whether running clang on the command line or through the Clang API, you can print extra information to help you debug by passing the -v flag.

args.push_back("-v");      // verbose

The output is something like this:

clang version 3.1 (branches/release_31)
Target: x86_64-apple-darwin10.8.0
Thread model: posix
 "/usr/local/Cellar/llvm/3.1/bin/clang" -cc1 -triple x86_64-apple-macosx10.6.0 -emit-obj -mrelax-all -disable-free -main-file-name getinmemory.c -pic-level 2 -mdisable-fp-elim -masm-verbose -munwind-tables -target-cpu core2 -target-linker-version 97.17 -v -resource-dir /usr/local/Cellar/llvm/3.1/bin/../lib/clang/3.1 -fmodule-cache-path /var/folders/l0/l0JTY1yrHVyI-wLWRDrCW++++TI/-Tmp-/clang-module-cache -fdebug-compilation-dir "/Users/jaymie/kosada/fdiv/Clang API" -ferror-limit 19 -fmessage-length 111 -stack-protector 1 -mstackrealign -fblocks -fobjc-dispatch-method=mixed -fobjc-default-synthesize-properties -fdiagnostics-show-option -fcolor-diagnostics -o /var/folders/l0/l0JTY1yrHVyI-wLWRDrCW++++TI/-Tmp-/getinmemory-azlq7U.o -x c getinmemory.c
clang -cc1 version 3.1 based upon LLVM 3.1 default target x86_64-apple-darwin10.8.0
#include "..." search starts here:
#include <...> search starts here:
 /usr/local/include
 /usr/local/Cellar/llvm/3.1/bin/../lib/clang/3.1/include
 /usr/include
 /System/Library/Frameworks (framework directory)
 /Library/Frameworks (framework directory)
End of search list.
 "/usr/bin/ld" -dynamic -arch x86_64 -macosx_version_min 10.6.0 -o getinmemory -lcrt1.10.6.o /var/folders/l0/l0JTY1yrHVyI-wLWRDrCW++++TI/-Tmp-/getinmemory-azlq7U.o -lcurl -lSystem

The output shows that the Driver is invoking clang to do the compiling and ld to do the linking. As you can see, the Driver adds arguments of its own to each invocation, in addition to the ones we passed in. The -v flag shows you exactly how the compiler and linker are being invoked.

Bonus: Build a C++ file

If the file you’re compiling is C++ instead of C, you can tell the Driver to act like clang++ instead of clang:

TheDriver.CCCIsCXX = true;

Conclusion

The Driver class lets your program interact with Clang in pretty much the same way that you would interact with it on the command line. Your program could accomplish exactly the same thing by forking/spawning a process that invokes command-line clang. The advantages of using the Clang API instead of command-line clang are:

You don’t have to fork/spawn a process yourself. That’s one less process, and it’s one less OS-dependent piece of code in your program.
You get more control when the build fails. You don’t just get a return code and a printout of errors and warnings. You get data structures representing the compilation, the errors and warnings, and the command that failed.

Example: Compile a .c file to a Module

This example illustrates another advantage of using the Clang API:

You get access to intermediate, in-memory representations of the program being compiled.

One intermediate, in-memory representation of a program that you’re likely to use is a Module object. A Module is a translation unit of an input program. Basically, you get one Module per .c file that you compile.

In this example, you have a .c file. You want to create a Module object.

You could do it with the Driver class. You’d compile the .c file to a .bc file and then read in the .bc file. That would give you a Module. But it would also give you the overhead of writing and reading the .bc file. You don’t need that file — all you need is the in-memory Module.

There’s a more efficient route from .c file to Module. In the Clang source code, it’s demonstrated in examples/clang-interpreter/main.cpp — an example that uses the Clang API to implement a C interpreter. Instead of using Clang’s “driver” classes, as in our previous example, the C interpreter example uses the “frontend” classes. That’s what we’ll do in the example below (compile_to_module.cpp).

This example compiles a .c file into an in-memory Module, then prints the names of all global symbols in the Module. Like the example above of compiling and linking an executable, this example begins by building a list of arguments and a DiagnosticsEngine.

// Path to the C file
string inputPath = "getinmemory.c";
 
// Arguments to pass to the clang frontend
vector<const char *> args;
args.push_back(inputPath.c_str());
 
// The compiler invocation needs a DiagnosticsEngine so it can report problems
clang::TextDiagnosticPrinter *DiagClient = new clang::TextDiagnosticPrinter(llvm::errs(), clang::DiagnosticOptions());
llvm::IntrusiveRefCntPtr<clang::DiagnosticIDs> DiagID(new clang::DiagnosticIDs());
clang::DiagnosticsEngine Diags(DiagID, DiagClient);

The arguments and DiagnosticsEngine get encapsulated in a CompilerInvocation:

// Create the compiler invocation
llvm::OwningPtr<clang::CompilerInvocation> CI(new clang::CompilerInvocation);
clang::CompilerInvocation::CreateFromArgs(*CI, &args[0], &args[0] + args.size(), Diags);

Now you need a CompilerInstance. (Yes, the Clang API has a class called Compilation, a class called CompilerInvocation, and a class called CompilerInstance.) The frontend classes, CompilerInvocation and CompilerInstance, play a similar role as the driver classes, Driver and Compilation, used in the above example. Both the frontend classes and the driver classes take some command-line-style arguments and then compile some code. One important difference between them is that the driver classes can invoke other tools like ld, whereas the frontend classes can only handle tasks native to Clang. Returning now to the example code, the next step is to construct the CompilerInstance and associate it with the CompilerInvocation:

clang::CompilerInstance Clang;
Clang.setInvocation(CI.take());

Set up diagnostics so the CompilerInstance can report problems:

Clang.createDiagnostics(args.size(), &args[0]);
if (!Clang.hasDiagnostics())
    return 1;

Create an action for the compiler to carry out. A frontend “action” is a little like a driver “task”, in that it’s a step to be carried out while building a program. A task is something like “compile”, “assemble”, “link”, whereas an action is something like “dump AST”, “emit assembly”, “emit bitcode”, “print preprocessed input”. In the Clang source code, you can see a list of all actions in lib/FrontendTool/ExecuteCompilerInvocation.cpp. For this example, the action is “emit LLVM only”:

llvm::OwningPtr<clang::CodeGenAction> Act(new clang::EmitLLVMOnlyAction());

Carry out the action:

if (!Clang.ExecuteAction(*Act))
    return 1;

Grab the resulting Module:

llvm::Module *module = Act->takeModule();

Just to make sure we got the Module correctly, print all functions defined or used in the Module:

for (llvm::Module::FunctionListType::iterator i = module->getFunctionList().begin(); i != module->getFunctionList().end(); ++i)
    printf("%s\n", i->getName().str().c_str());

Bonus: Return the `Module` from a function

What if you want to write a function that starts with a .c file and returns a Module? You could just take the example code above and wrap it up in a function, right? Actually, no. This doesn’t work:

// Bad example! Do not copy! 
llvm::Module * getModule(void)
{
    ...
 
    llvm::OwningPtr<clang::CodeGenAction> Act(new clang::EmitLLVMOnlyAction());
    if (!Clang.ExecuteAction(*Act))
        return NULL;
 
    llvm::Module *module = Act->takeModule();
 
    return module;
}

You’ll find that the returned Module doesn’t have any functions in it.

(Edit: Revised this section based on one of the comments.)

What went wrong? It turns out that CodeGenAction, because it’s wrapped in an OwningPtr, gets automatically destroyed when the OwningPtr goes out of scope at the end of getModule(). Everything owned by the CodeGenAction also gets destroyed — including the LLVMContext that was created by the constructor of the CodeGenAction and became the context for the Module. This leaves the Module without a valid context.

One fix is to construct the CodeGenAction with an LLVMContext that will still be around after the CodeGenAction is destroyed. For example:

clang::CodeGenAction * getAction(void)
{
    ...
 
    llvm::OwningPtr<clang::CodeGenAction> Act(new clang::EmitLLVMOnlyAction(&llvm::getGlobalContext()));
    if (!Clang.ExecuteAction(*Act))
        return NULL;
 
    llvm::Module *module = Act->takeModule();
 
    return module;
}

Bonus: Print the arguments of the `CompilerInvocation`

When you construct a CompilerInvocation, you give it a list of arguments — the same arguments you’d pass on the command line. The CompilerInvocation adds some arguments of its own to that list. You can print the complete list of arguments like this:

printf("clang ");
vector<string> argsFromInvocation;
CI->toArgs(argsFromInvocation);
for (vector<string>::iterator i = argsFromInvocation.begin(); i != argsFromInvocation.end(); ++i)
    printf("%s ", (*i).c_str());
printf("\n");

The output is something like this:

clang -fdiagnostics-format=clang getinmemory.c -fsyntax-only -fdollars-in-identifiers -fno-operator-names -triple x86_64-apple-darwin10.8.0

Conclusion

Using the Clang driver classes, as in the previous example, you can interact with the Clang API in pretty much the same way that you would interact with command-line clang. Using the Clang frontend classes, as in this example, you get even more control. You can access the data structures that LLVM uses internally to compile a program. Using the frontend classes, we were able to get a Module from a .c file without the overhead of writing and reading additional files.

Jaymie Strecker is a software developer at Kosada, Inc. and one of the creators of Vuo.

Comments

Helpful article. If you have an error in getinmemory.c, remove a ; for example, how to get the eror message? It is printed on the screen, but the Diag-instance shows 0 errors.

I’m very interested in CLang for creating some code that can be extended at runtime. Thanks a lot for the sharing!

Regarding “Bonus: Return the Module from a function”: I think the problem is not that CodeGenAction destroys the module in its destructor (the name takeModule() indicates that ownership is transfered out of the object) but rather that CodeGenAction creates its own LLVMContext if you don’t provide one in the constructor. This new LLVMContext is passed to the module but will be destroyed in the destructor of CodeGenAction so the module’s reference to the context becomes invalid. The solution is to pass a LLVMContext to the constructor of CodeGenAction that outlives the module:

llvm::OwningPtr<clang::CodeGenAction> Act(new clang::EmitLLVMOnlyAction(&llvm::getGlobalContext()));

Ah, that makes more sense, and explains why takeModule() didn’t seem to be working. I revised the article accordingly. Thanks, Anonymous, whoever you are :)

Compiling code with the Clang API

Example: Build a .c file to an executable

Bonus: Print the tasks of the Compilation

Bonus: Print “verbose” information for debugging

Bonus: Build a C++ file

Conclusion

Example: Compile a .c file to a Module

Bonus: Return the Module from a function

Bonus: Print the arguments of the CompilerInvocation

Conclusion

Bonus: Return the `Module` from a function

Bonus: Print the arguments of the `CompilerInvocation`