Stalker

介绍

Stalker 是Frida的核心追踪引擎。它使得追踪线程，捕获所有函数，所有代码块，甚至执行的每个指令。这里提供了一个不错的概览。我们强烈建议你先仔细阅读一遍。很显然，其实现是很依赖具体架构的，不过也是有不少相同之处的。 Stalker 目前对 AArch64 有广泛的支持。该架构广泛应用于运行着Android或iOS的移动端手机和平板上。这就像Intel 64 和 IA-32架构更广泛的应用于桌面端和笔记本端。而本篇文章会更详细的叙述这些细节，仔细剖析 ARM64 结构上 Stalker 的具体实现，并解释其究竟是如何运作的。希望这篇文章对未来Stalker移植到其他硬件架构有所帮助。

声明

本篇文章会讲到很多Stalker内部工作的细节，但并不会包含修补代码的具体细节。本文章意图作为一个帮助你理解这项技术，而单单一个Stalker就足够复杂了。不过这种复杂并不是毫无来由的，正是这些复杂才极大降低了哪些昂贵的操作。最后，这篇文章会讲解一些关键的概念，在一些重要逻辑上会逐行讲解代码。还有一些具体的细节，你可能需要阅读源码。总之，希望这篇文章能对你有帮助。

案例

为了更好的理解Stalker的实现，我们首先要理解它提供给使用者的接口。Stalker可以从原生Gum接口直接调用，但大多数用户还是更习惯通过 JavaScript接口调用Gum方法。这里很推荐Gum的TypeScript 的类型定义。

JavaScript 调用 Stalker 的主要接口是：

Stalker.follow([threadId, options])

stalking起始线程id threadId (默认为当前线程id)

我们先思考一下什么时候会使用这些调用。通常会因为你对某个线程感到好奇，想知道它在干啥。也许是单纯这个进程的名字很有意思？线程名可以使用 cat /proc/PID/tasks/TID/comm查看。又或者你在使用 Frida 的 JavaScript 接口Process.enumerateThreads() 遍历线程的时候，调用了一个原生函数：

int pthread_getname_np(pthread_t thread,
                       char *name, size_t len);

这个函数和 Thread.backtrace() 一起用来转储线程堆栈，分析这个堆栈，你可以很好的知晓这个进程在干啥。

另一个可能需要调用Stalker.follow() 的场景是，你已经植入或替换了目标函数。在这个场景里，你已经找到了你感兴趣的函数，想知道该函数是如何运作的。你想查看某个函数调用后，哪些函数，甚至哪些代码块被调用了。也许你想比较代码在接受不同输入后的不同反应，也许你想修改输入来看看是否能让代码按照特定的逻辑运行。

在这些场景里，Stalker工作的方式会有些许的不同，但他们的调用方式都是一致的， Stalker.follow().

追踪

当用户调用Stalker.follow()时, 在代码之下, JavaScript 引擎会通过调用 gum_stalker_follow_me() 来追踪当前线程，或调用 gum_stalker_follow(thread_id) 来追踪当前进程的其他线程。

gum_stalker_follow_me

使用 gum_stalker_follow_me()时, 链接寄存器会决定从哪里开始追踪。在 AArch64 架构里，链接寄存器 (LR) 用于设置继续执行的指令的地址。由于只有一个链接寄存器，当另一个函数被调用时，原先的LR的值需要暂存下来(通常存在栈上)。当RET指令执行后，这些暂存的值会随着函数返回，逐步从栈上加载回来。

先看 gum_stalker_follow_me()的代码。代码原型如下：

GUM_API void gum_stalker_follow_me (GumStalker * self,
    GumStalkerTransformer * transformer, GumEventSink * sink);

我们可以看到QuickJS或V8运行时在调用该函数时传递了三个参数。第一个参数是 Stalker 实例本身。需要注意的是，若果有多个脚本同时注入，那么可能会有多个 Stalker 实例存在。第二个参数是一个转换器，该转换器用来将注入代码转换为如同原先写上去的一样（这个后面会讨论）最后一个参数是一个时间接收器，当Stalker引擎运行时，生成的事件会传递到这里。

#ifdef __APPLE__
  .globl _gum_stalker_follow_me
_gum_stalker_follow_me:
#else
  .globl gum_stalker_follow_me
  .type gum_stalker_follow_me, %function
gum_stalker_follow_me:
#endif
  stp x29, x30, [sp, -16]!
  mov x29, sp
  mov x3, x30
#ifdef __APPLE__
  bl __gum_stalker_do_follow_me
#else
  bl _gum_stalker_do_follow_me
#endif
  ldp x29, x30, [sp], 16
  br x0

我们可以看到，第一个指令 STP 将一组寄存器的值存储到了栈上。注意表达式[sp, -16]!. 这是一个自减操作，也就是栈顶先压进去16字节的空间，再存储2个8字节的寄存器值。我们可以在函数底部看到相对的指令: ldp x29, x30, [sp], 16。这里是将原本存储在栈上的连个寄存器值重新恢复到寄存器上。不过这俩寄存器是干啥用的？

X30 是链接寄存器，而 X29 是帧指针寄存器。回想一下，在调用另一个函数之前，我们必须先把链接寄存器原先的值存储到栈上，在函数调用结束并返回调用者那里后，我们还需要恢复该值。

帧指针用来指明函数调用时的栈顶，而所有函数调用时传递的参数和栈变量都会以此为基准，分配一个固定的偏移量。和前面类似，我们同样需要保存，并在必要的时候恢复这个值，原因是每个函数调用里，该寄存器的值都不一样。我们需要在开始调用时，将其值保存起来，并在返回后恢复。你可以看到接下来的指令 mov x29, sp ，我们设置当前栈指针的值为帧指针值。

继续往下看，下条指令 mov x3, x30, 将链接寄存器的值存储到 X3内。调用函数时，前8个参数会分别保存到 X0-X7 这8个寄存器内。因此这里是使用了前4个寄存器保存参数。然后我们调用(branch with link)了函数 _gum_stalker_do_follow_me(). 可以发现，前三个用 X0-X2 传递的三个参数没有任何修改，因此 _gum_stalker_do_follow_me() 接受的参数和之前的那个完全相同。最后，该函数返回时，我们让其返回到作为第四个参数的位置上。(在 AArch64 上，函数返回值存储在X0上)。

gpointer
_gum_stalker_do_follow_me (GumStalker * self,
                           GumStalkerTransformer * transformer,
                           GumEventSink * sink,
                           gpointer ret_addr)

gum_stalker_follow

该函数原型和 gum_stalker_follow_me()类似, 但拥有一个额外的参数 thread_id。若该值为当前线程id，那么就会调用前面的那个函数。现在看一下另一个线程id提供时会发生什么。

void
gum_stalker_follow (GumStalker * self,
                    GumThreadId thread_id,
                    GumStalkerTransformer * transformer,
                    GumEventSink * sink)
{
  if (thread_id == gum_process_get_current_thread_id ())
  {
    gum_stalker_follow_me (self, transformer, sink);
  }
  else
  {
    GumInfectContext ctx;

    ctx.stalker = self;
    ctx.transformer = transformer;
    ctx.sink = sink;

    gum_process_modify_thread (thread_id, gum_stalker_infect, &ctx);
  }
}

我们可以看到首先调用了函数 gum_process_modify_thread(). 这个函数不属于 Stalker ，而是 Gum 自己的函数。这个函数接受一个回调函数作为其参数，而该回调函数接受一个上下文参数来传递线程上下文。该回调函数可以修改 GumCpuContext 结构，紧接着 gum_process_modify_thread() 会将变动写回。我们会在后面看到这个上下文结构，该结构包含了 AArch64 CPU 全部寄存器值。后面我们还会进一步介绍回调函数的原型。

typedef GumArm64CpuContext GumCpuContext;

struct _GumArm64CpuContext
{
  guint64 pc;
  guint64 sp;

  guint64 x[29];
  guint64 fp;
  guint64 lr;
  guint8 q[128];
};

static void
gum_stalker_infect (GumThreadId thread_id,
                    GumCpuContext * cpu_context,
                    gpointer user_data)

所以，函数 gum_process_modify_thread() 究竟是怎么工作的? 这取决于具体的平台。在 Linux (和 Android) 上，该函数使用 ptrace 接口 (也就是 GDB 用的那个) 来附加到线程并读写寄存器。不过还是有一些其他麻烦。 But there are a host of complexities. 在 Linux上，你不能 ptrace 自己的进程（更准确的说不能ptrace相同进程组的进程），因此Frida会创建一个当前进程的克隆进程，将该克隆进程放到Frida的进程组中，然后共享内存空间。他们之间使用 UNIX 套接字进行交流。这个克隆进程就相当于一个调试器，从原始的目标进程中读取寄存器，然后将其保存到共享的内存空间内，并在需要的时候写回进程。对了，有两个环境变量 PR_SET_DUMPABLE 和 PR_SET_PTRACER 用来控制ptrace原进程的权限。

现在你可以看出来函数gum_stalker_infect() 的功能和我们早前提到的 _gum_stalker_do_follow_me() 很相似。两个函数都做着相同的工作，不过 _gum_stalker_do_follow_me() 运行在目标线程上，而 gum_stalker_infect() 不是, so it must write some code to be called by the target thread using the GumArm64Writer rather than calling functions directly.

我们稍后回更详细的介绍这些函数的细节，在此之前，我们先做一些必要的背景说明。

基础操作

代码的本质是一系列的指令块。每个块通常会以某种特定的指令序列作为开始（比如两个连续的分支跳转语句），然后逐条指令按顺序执行，直到遇到某些跳转指令。

Stalker 每次以一个代码块为单位进行工作。Stalker 会在调用gum_stalker_follow_me()返回后的位置上的代码块上工作，或在gum_stalker_follow()调用后，在目标线程的指令寄存器指向的代码块上工作。

Stalker 先拷贝原始代码块，插桩后，将插桩代码写入新分配的内存中。插入指令时可能会生成一些事件，或者其他Stalker引擎提供的特性。必要时，Stalker需要重定位指令。比如下面的指令：

ADR 代码块标签相对 PC寄存器的地址

ADR Xd, label

Xd 用来指代 64 位计算机上的通用寄存器，范围从0到31.

label (标签) 是程序标签，其地址是相对当前指令的偏移量，范围是±1MB。

若该指令被拷贝到内存的其他位置并执行，那么由于标签的地址是由当前地址加上偏移量计算得来，那么拷贝后的代码计算出来的地址和原始值必定不同。不过没关系，Gum提供了一个重定位器Relocator。重定位器的就是用来修复类似地址的专用程序。

我们前面提到，Stalker每次以一个代码块为单位工作。那么，我们如何给下一个代码块进行插桩呢？前面我们提到了，每个代码块都是以一个分支跳转指令作为结束。那么也就是说，如果我们现将原先的分支保存下来，然后用一个新的跳转回Stalker引擎的分支跳转来替换掉之前的分支。这样我们就可以给下一个代码块进行插桩。同样的，这个过程可以不断追踪下去。

不过这个过程会有一些慢，在有些情况下，我们可以做一些优化。首先，如果我们需要反复运行同一个代码块（比如循环，或单纯的多次调用同一个代码块）时，我们不用每次都重新插桩再运行。我们可以只插桩一次，然后反复执行插桩后的代码。因此，我们需要一个记录所有遇到的代码块的哈希表，当我们进入一个代码块时，我们会首先查询该表，尝试寻找已经插桩的代码。

其次，当我们遇到一个调用指令时，在发射一个插桩后的调用后，我们还会发出一个降落点，这样我们可以在不返回Stalker的情况下返回调用。 Stalker内置了一个辅助栈，使用GumExecFrame结构来记录返回地址(real_address) 和降落地址(code_address)。当函数返回后，我们会将返回地址与辅助栈里的real_address进行比对，如果匹配，那么函数就直接返回到对应的code_address，而不用重新进入运行时。该降落点包含进入Stalker引擎，继续插桩下一块代码块的代码，同时该降落点后续可以修复，使其直接返回原代码块。换句话说，整个函数返回地址序列在处理的时候，都可以避免进入和离开Stalker的开销。

若返回地址不匹配任何存储在GumExecFrame的real_address，或者我们用光了辅助栈的空间，我们就直接构建一个新的辅助栈即可。我们需要再程序代码执行时保存LR的值，以防止程序检测到Stalker（反调试）的存在，或者是其他非常规返回（比如代码节中对內联数据的饮用）。此外，我们还想让Stalker可以在任何时刻停止追踪，我们不想一路备份回我们修改的LR值。

最后一点，前面我们会将每个分支跳转指令替换为返回Stalker的代码，以便对下一个代码块进行插桩。但是现在，我们会根据 Stalker.trustThreshold的配置，尝试修补已插桩的代码，将其直接分支跳转替换成下一个插桩代码块。确定性的分支跳转（比如跳转点是固定值，而且跳转是非条件性的）比较简单，我们直接将跳转回Stalker的分支替换为跳转到下一个代码块的分支。不过，我们也可以处理条件分支，比如我们可以同时给两个分支插桩（其中一个会跳转，另一个不会）。然后我们将原始的条件分支替换为另一个条件分支，这个新的条件分支直接控制代码流，将其跳转到已遇到的代码的插桩版本。另一个分支同样也是插桩后的版本。当目标分支不是静态分支时，我们也可以进行部分处理。比如这样的分支：

br x0

调用函数指针或者类方法时，这种形式的指令非常常见。理论上来说 X0 的值时变化的，但实际情况下大多数时候其值是不变的。这种情况下，我们可以根据运行时实际的分支指令将 X0 值和我们已知的函数进行对比，如果匹配，那么直接跳转到对应的已插桩代码地址。如果不匹配，那么直接跳转回Stalker引擎。因此，若函数指针发生了变化，那么代码依旧可以正常运行，因为此时我们重新返回Stalker，然后插桩。不过如果函数指针按照我们设想的，没有变化，那么就可以直接跳过Stalker引擎，直接运行已插桩代码。

可选项

现在我们看一下使用Stalker跟踪线程时的一些可选配置。当被追踪线程执行时，Stalker会生成一系列事件，这些事件会放到一个队列中，然后周期性地，或由用户手动的冲刷。这个工作不属于Stalker，而是由EventSink::process虚函数在重新进入JavaScript运行时，处理事件时负责的，而这一行为的开销很大。该队列的大小和刷新周期可以通过选项进行配置。以指令为基础，事件会在每个指令的调用和返回时产生；以代码块为基础，事件会在每个代码块执行时，或者插桩时，由Stalker引擎产生。

我们可以提供两个回调函数onReceive 和 onCallSummary中的一个。前者相对来说更简单，单纯按顺序传递一个包含原始事件的二进制块。 (Stalker.parse()可以用来将该二进制块转换为一个JS数组，该数组中包含用于表示事件的元组。) 第二个函数汇集了一些统计结果，比如每个函数调用次数等等。这个函数通常比onReceive效率更高，但数据没那么精准。

术语一览

在更详细的介绍Stalker实现之前，我们需要理解一些关键的术语和概念，这些稍后会有帮助。

Probes

你可能已经对Interceptor.attach()的使用很熟悉了。当线程运行在Stalker外部时，该函数可以在指定函数调用时触发一个回调函数。但若线程运行在Stalker内部，那么这个拦截器可能没办法正常工作。这些拦截器的工作原理是修补目标函数的前几个指令（序言）来将控制权重定向到Frida手里。 Frida拷贝并重定位这些前几个指令，在onEnter回调实现的前提下，执行回调并重定向控制权回到原始函数中。

拦截器在Stalker内部没法正常运作的原因很简单，那就是原始的函数没有被调用。每个代码块，在执行前会被插桩，然后拷贝到内存的某个地方，再运行。因此实际执行的是这部分指令，而不是原始的指令。 Stalker提供了接口函数Stalker.addCallProbe(address, callback[, data])来处理这个问题。若我们的Interceptor(拦截器)在目标代码插桩前就已经依附，或Stalker的trustThreshold 被合理配置，让我们的代码块重新插桩，那么我们的Interceptor就可以正常运作（因为修补过的指令会被拷贝并重新插桩）。当然，我们希望这些条件不满足的时候也能正常hook函数。多数API用户可能对这个层级的设计细节并不熟悉，而probes就是用来解决这个问题的。

当probe回调注册时，会有一个可选的数据参数被传递给该回调函数。这个指针，也因此需要存储在Stalker引擎里。地址也需要存储起来，这样当调用函数的指令运行时，目标代码可以被插桩，先运行函数。由于可能存在多个函数会调用你增加probe的函数，因此很多以插桩代码块都会包函额外的指令，用来调用probe函数。因此每当一个probe增加或移除，已缓存的代码块就会被销毁，然后重新插桩。需要注意的是，只有在该callback是C回调的时候，才会传递该数据参数。比如使用CModule实现的C回调。使用JavaScript时，使用闭包来捕获需要的状态。

信任阈值

之前我们提到了一个简单的优化，在我们尝试多次调用同一个代码块时，我们只是简单的重复调用对应的已插桩代码块。这个优化工作的前提是，这段时间内我们插桩的原代码没有发生变化。在一些自修改代码（通常用来对抗反调试/反编译的安全代码）中，代码经常回自我变异，从而导致我们插桩过的代码失去作用。那我们怎么检测原代码是否发生变化了呢？我们的处理方法很简单，将原始代码的插桩版本保存在数据结构里。那么当我们再次遇到一个代码块时，我们会将需要插桩的代码的版本和我们上次插桩的时间和版本进行对比，如果匹配，那么我们可以重用该块。但是每次都比对代码会拖累代码运行。因此，这里设定了一个参数用于调优。

Stalker.trustThreshold: 一个整型数，用来指明，运行多少次后信任代码。 -1表示不信任代码（慢），0表示完全信任代码，N表示在执行N此后信任。默认值是1。

实际上，信任代码前，代码每次重执行都要与之前的插桩版本做对比。若 N 此执行后，原代码若一直没问题，则信任该代码，并不再做对比。需要注意的是，原始版本的代码会始终保存起来，不论信任阈值是-1 还是 0。信任阈值是这俩值的时候，这个保存的原始代码实际上并没有什么用，这么做仅仅是为了保持代码简单一致。

排除范围

Stalker 还提供了接口 Stalker.exclude(range)，提供基础地址和偏移范围，用来阻止Stalker对该范围内的代码进行插桩。举个例子，你的线程调用了libc库内的malloc()函数，那么通常来说你不会想关心这个堆里的工作细节，而且追踪这部分代码还会降低代码效率，还会生成一大堆你不关心的事件。需要考虑的一件事是，当调用排除范围内的函数时，Stalker会会停止工作，直到其返回。也就是说，如果该线程调用了一个不在规定范围内的函数，比如一个回调，那么这件事不会被Stalker捕获。正如其可以用来阻止追踪整个库，这个接口也可以用来阻止追中某个特定的函数。当你的目标应用使用静态链接时，这个功能非常有用。有时我们不能简单的跳过 libc中的所有调用，我们可以使用Module.enumerateSymbols()来查找malloc()的符号，然后忽略该单个函数。

冻结/解冻

有些系统中有DEP，用来标记页的写权限和执行权限。因此，Frida必须在写入插桩代码时装换页的权限，并让插入的代码有执行权限。当页具有可执行权限时，我们称其被冻结（此时它们不能改变），而当它们可写时，我们称其解冻。

调用指令

AArch64 和 Intel 不同，没有 Intel 里显式的 CALL 函数调用指令。而是有一堆不同的，用来应对不同情况的函数调用。这些指令都会分支跳转到一个指定的位置，然后更新链接寄存器，LR,

BL
BLR
BLRAA
BLRAAZ
BLRAB
BLRABZ

简单起见，在本篇文章的余下部分，我们会将这些指令都称为"调用指令"

帧

当Stalker遇到一个调用时，Stalker会将返回地址，插桩代码的地址，放到一个数据结构中，然后存入Stalker自身维护的栈里。 Stalker使用该栈做一些优化，并在发出调用和返回事件时，有一些启发式的估计。

typedef struct _GumExecFrame GumExecFrame;

struct _GumExecFrame
{
  gpointer real_address;
  gpointer code_address;
};

转换器

GumStalkerTransformer 用来生成已插桩代码。默认的转换器实现看起来类似这样：

static void
gum_default_stalker_transformer_transform_block (
    GumStalkerTransformer * transformer,
    GumStalkerIterator * iterator,
    GumStalkerOutput * output)
{
  while (gum_stalker_iterator_next (iterator, NULL))
  {
    gum_stalker_iterator_keep (iterator);
  }
}

该函数由负责生成插桩代码的函数调用， gum_exec_ctx_obtain_block_for() 和该函数的工作都是生成插桩代码。我们可以看到，该函数使用一个循环，然后在每个循环里处理一个指令。首先从迭代器中获取指令，然后（不加修改的）通知Stalker。这两个函数是由Stalker自身实现的。第一个函数负责解析cs_insn和更新内部状态。 cs_insn是内部由Capstone提供的表示指令的反汇编器。第二个函数负责写出已插桩代码（或代码集）。我们后面会更详细的介绍。

除了默认的转换器，用户也可以提供定制的转换器实现。这里有一个不错的例子。

调用

转换器也可以创建调用。转换器指导Stalker发出指令，用来调用一个JavaScript函数，或者C回调，比如使用 CModule实现的回调函数。同时会给函数传递表示 CPU 状态的上下文和一个可选的上下文参数。该函数可以按需求修改或检视寄存器。这些信息都被存储在GumCallOutEntry里。

typedef void (* GumStalkerCallout) (GumCpuContext * cpu_context,
    gpointer user_data);

typedef struct _GumCalloutEntry GumCalloutEntry;

struct _GumCalloutEntry
{
  GumStalkerCallout callout;
  gpointer data;
  GDestroyNotify data_destroy;

  gpointer pc;

  GumExecCtx * exec_context;
};

EOB/EOI

回想一下，Relocator，在生成插桩代码时非常重要。Relocator有两个重要的属性来控制其状态。

块结束（EOB）指明Relocator遇到了代码块的结尾。当遇到任何分支指令，就是块结束（EOB）状态。任何分支跳转，调用，或者返回指令。

输入结束（EOI）指明Relocator不仅遇到了代码块结尾，而且可能遇到了输入的结尾，也就是紧跟着当前指令的下一条未必是指令。这种情况通常不是调用指令，因为代码控制通常会在调用返回时，继续执行紧跟其后的若干指令。（注意，编译器通常会给无返回函数生成一个分支跳转指令，比如exit()）由于调用指令后面不一定紧跟着其他合法指令，我么这里可以做一些特殊的优化。如果我们遇到一个非条件分支跳转指令，或者遇到一个返回指令，很有可能在这之后就不再有代码了。

序言/尾声

当控制流从程序跳转到Stalker引擎时，CPU的寄存器必须保存，以使得Stalker可以运行以及利用寄存器，并在控制权返还给程序时恢复，以防止原程序上下文信息丢失。

AArch64 的

函数调用标准规定了一些寄存器（比较有名的是 X19 到 X29）是由被调用者存储使用的寄存器。也就是说，编译器生成使用这些寄存器的代码时，必须先把它们保存起来。因此没必要严格的讲这些寄存器存储在上下文结构中，因为如果Stalker引擎需要的话，可以直接恢复。因此一个"极小"的上下文就足够大多时候的需求。

然而，如果Stalker引擎是要调用由Stalker.addCallProbe()注册的prob，或者由iterator.putCallout()创建的调用

（由转换器调用），那么这些回调将会期待接受全部的CPU上下文作为参数。而且它们会期待可以修改该上下文，以便控制权交回后发挥作用。因此在这里情况里，我们必须要写一个 "完全"上下文，而且其值一定要和GumArm64CpuContext指明的格式相同。

typedef struct _GumArm64CpuContext GumArm64CpuContext;

struct _GumArm64CpuContext
{
  guint64 pc;
  guint64 sp; /* X31 */
  guint64 x[29];
  guint64 fp; /* X29 - frame pointer */
  guint64 lr; /* X30 */
  guint8 q[128]; /* FPU, NEON (SIMD), CRYPTO regs */
};

另外需要注意的是，前面提到的那些需要写入必要CPU寄存器的必要代码（序言），实际上是相当长的（大概数十行指令）。而之后恢复他们的代码（尾声）实际上也差不多长。我们不想每次插桩时，都在代码的开始和结尾都插上这么两段长指令。因此我们会将这些指令写到一个公共的内存里（和我们写插桩代码的方式类似），然后需要时，我们只要直接调用。这些公共内存称之为helpers。下面的函数就是用来创建这些序言和尾声：

static void gum_exec_ctx_write_minimal_prolog_helper (
    GumExecCtx * ctx, GumArm64Writer * cw);

static void gum_exec_ctx_write_minimal_epilog_helper (
    GumExecCtx * ctx, GumArm64Writer * cw);

static void gum_exec_ctx_write_full_prolog_helper (
    GumExecCtx * ctx, GumArm64Writer * cw);

static void gum_exec_ctx_write_full_epilog_helper (
    GumExecCtx * ctx, GumArm64Writer * cw);

最后需要注意的是，在AArch64架构下，直接分支跳转的范围是调用者的±128 MB偏移范围，而间接分支跳转的开销更加昂贵（内存开销以及时间开销）。因此，随着我们插桩代码块的增加，我们会距离共享的序言和尾声代码越来越远。当我们超过128MB范围后，我们会直接再拷贝一个序言和尾声来使用。这个交易非常划得来。

计数器

最后，你会在插桩代码块的后面，看到一些计数器，这些计数器保记录了遇到的每种指令的数量。这些东西只是用于单元测试和性能调优，指明哪些分支类型需要完整的上下文切换。

Slabs

接下来我们介绍Stalker存储其插桩代码的地方——slabs。下面的数据结构用来处理所有细节。

typedef guint8 GumExecBlockFlags;
typedef struct _GumExecBlock GumExecBlock;
typedef struct _GumSlab GumSlab;

struct _GumExecBlock
{
  GumExecCtx * ctx;
  GumSlab * slab;

  guint8 * real_begin;
  guint8 * real_end;
  guint8 * real_snapshot;
  guint8 * code_begin;
  guint8 * code_end;

  GumExecBlockFlags flags;
  gint recycle_count;
};

struct _GumSlab
{
  guint8 * data;
  guint offset;
  guint size;
  GumSlab * next;

  guint num_blocks;
  GumExecBlock blocks[];
};

enum _GumExecBlockFlags
{
  GUM_EXEC_ACTIVATION_TARGET = (1 << 0),
};

我们先看一些在初始化阶段，Stalker配置的大小：

#define GUM_CODE_SLAB_MAX_SIZE  (4 * 1024 * 1024)
#define GUM_EXEC_BLOCK_MIN_SIZE 1024

static void
gum_stalker_init (GumStalker * self)
{
  ...

  self->page_size = gum_query_page_size ();
  self->slab_size =
      GUM_ALIGN_SIZE (GUM_CODE_SLAB_MAX_SIZE, self->page_size);
  self->slab_header_size =
      GUM_ALIGN_SIZE (GUM_CODE_SLAB_MAX_SIZE / 12, self->page_size);
  self->slab_max_blocks = (self->slab_header_size -
      G_STRUCT_OFFSET (GumSlab, blocks)) / sizeof (GumExecBlock);

  ...
}

我们可以看到，每个slab的大小是 4MB。slab个数的整数倍数存在了其头部，而the GumSlab 结构本身包含了其自身的GumExecBlock数组。注意该数组在GumSlab结构中定义时长度为0，但其实际可存储的容量是在 slab_max_blocks 内计算并存储的。

所以slab余下的部分用来做什么呢？slab的头部用来存储所有的"会计信息"，余下部分（也就是尾部）用来存储插桩指令（在slab中內联存储）。

So why is a 12th of the slab allocated for the header and the remainder for the instructions? Well the length of each block to be instrumented will vary considerably and may be affected by the compiler being used and its optimization settings. Some rough empirical testing showed that given the average length of each block this might be a reasonable ratio to ensure we didn’t run out of space for new GumExecBlock entries before we ran out of space for new instrumented blocks in the tail and vice versa.

我们下面看一下用于创建这些结构的代码：

static GumSlab *
gum_exec_ctx_add_slab (GumExecCtx * ctx)
{
  GumSlab * slab;
  GumStalker * stalker = ctx->stalker;

  slab = gum_memory_allocate (NULL, stalker->slab_size,
      stalker->page_size,
      stalker->is_rwx_supported ? GUM_PAGE_RWX : GUM_PAGE_RW);

  slab->data = (guint8 *) slab + stalker->slab_header_size;
  slab->offset = 0;
  slab->size = stalker->slab_size - stalker->slab_header_size;
  slab->next = ctx->code_slab;

  slab->num_blocks = 0;

  ctx->code_slab = slab;

  return slab;
}

Here, we can see that the data field points to the start of the tail where instructions can be written after the header. The offset field keeps track of our offset into the tail. The size field keeps track of the total number of bytes available in the tail. The num_blocks field keeps track of how many instrumented blocks have been written to the slab.

Note that where possible we allocate the slab with RWX permissions so that we don’t have to freeze and thaw it all of the time. On systems which support RWX the freeze and thaw functions become no-ops.

Lastly, we can see that each slab contains a next pointer which can be used to link slabs together to form a singly-linked list. This is used so we can walk them and dispose them all when Stalker is finished.

Blocks

Now we understand how the slabs work. Let’s look in more detail at the blocks. As we know, we can store multiple blocks in a slab and write their instructions to the tail. Let’s look at the code to allocate a new block:

static GumExecBlock *
gum_exec_block_new (GumExecCtx * ctx)
{
  GumStalker * stalker = ctx->stalker;
  GumSlab * slab = ctx->code_slab;
  gsize available;

  available = (slab != NULL) ? slab->size - slab->offset : 0;
  if (available >= GUM_EXEC_BLOCK_MIN_SIZE &&
      slab->num_blocks != stalker->slab_max_blocks)
  {
    GumExecBlock * block = slab->blocks + slab->num_blocks;

    block->ctx = ctx;
    block->slab = slab;

    block->code_begin = slab->data + slab->offset;
    block->code_end = block->code_begin;

    block->flags = 0;
    block->recycle_count = 0;

    gum_stalker_thaw (stalker, block->code_begin, available);
    slab->num_blocks++;

    return block;
  }

  if (stalker->trust_threshold < 0 && slab != NULL)
  {
    slab->offset = 0;

    return gum_exec_block_new (ctx);
  }

  gum_exec_ctx_add_slab (ctx);

  gum_exec_ctx_ensure_inline_helpers_reachable (ctx);

  return gum_exec_block_new (ctx);
}

The function first checks if there is space for a minimally sized block in the tail of the slab (1024 bytes) and whether there is space in the array of GumExecBlocks in the slab header for a new entry. If it does then a new entry is created in the array and its pointers are set to reference the GumExecCtx (the main Stalker session context) and the GumSlab, The code_begin and code_end pointers are both set to the first free byte in the tail. The recycle_count used by the trust threshold mechanism to determine how many times the block has been encountered unmodified is reset to zero, and the remainder of the tail is thawed to allow code to be written to it.

Next if the trust threshold is set to less than zero (recall -1 means blocks are never trusted and always re-written) then we reset the slab offset (the pointer to the first free byte in the tail) and start over. This means that any instrumented code written for any blocks within the slab will be overwritten.

Finally, as there is no space left in the current slab and we can’t overwrite it because the trust threshold means blocks may be re-used, then we must allocate a new slab by calling gum_exec_ctx_add_slab(), which we looked at above. We then call gum_exec_ctx_ensure_inline_helpers_reachable(), more on that in a moment, and then we allocate our block from the new slab.

Recall, that we use helpers (such as the prologues and epilogues that save and restore the CPU context) to prevent having to duplicate these instructions at the beginning and end of every block. As we need to be able to call these from instrumented code we are writing to the slab, and we do so with a direct branch that can only reach ±128 MB from the call site, we need to ensure we can get to them. If we haven’t written them before, then we write them to our current slab. Note that these helper funtions need to be reachable from any instrumented instruction written in the tail of the slab. Because our slab is only 4 MB in size, then if our helpers are written in our current slab then they will be reachable just fine. If we are allocating a subsequent slab and it is close enough to the previous slab (we only retain the location we last wrote the helper functions to) then we might not need to write them out again and can just rely upon the previous copy in the nearby slab. Note that we are at the mercy of mmap() for where our slab is allocated in virtual memory and ASLR may dictate that our slab ends up nowhere near the previous one.

We can only assume that either this is unlikely to be a problem, or that this has been factored into the size of the slabs to ensure that writing the helpers to each slab isn’t much of an overhead because it doesn’t use a significant proportion of their space. An alternative could be to store every location every time we have written out a helper function so that we have more candidates to choose from (maybe our slab isn’t allocated nearby the one previously allocated, but perhaps it is close enough to one of the others). Otherwise, we could consider making a custom allocator using mmap() to reserve a large (e.g. 128 MB) region of virtual address space and then use mmap() again to commit the memory one slab at a time as needed. But these ideas are perhaps both overkill.

Instrumenting Blocks

The main function which instruments a code block is called gum_exec_ctx_obtain_block_for(). It first looks for an existing block in the hash table which is indexed on the address of the original block which was instrumented. If it finds one and the aforementioned constraints around the trust threshold are met then it can simply be returned.

The fields of the GumExecBlock are used as follows. The real_begin is set to the start of the original block of code to be instrumented. The code_begin field points to the first free byte of the tail (remember this was set by the gum_exec_block_new() function discussed above). A GumArm64Relocator is initialized to read code from the original code at real_begin and a GumArm64Writer is initialized to write its output to the slab starting at code_begin. Each of these items is packaged into a GumGeneratorContext and finally this is used to construct a GumStalkerIterator.

This iterator is then passed to the transformer. Recall the default implementations is as follows:

static void
gum_default_stalker_transformer_transform_block (
    GumStalkerTransformer * transformer,
    GumStalkerIterator * iterator,
    GumStalkerOutput * output)
{
  while (gum_stalker_iterator_next (iterator, NULL))
  {
    gum_stalker_iterator_keep (iterator);
  }
}

We will gloss over the details of gum_stalker_iterator_next() and gum_stalker_iterator_keep() for now. But in essence, this causes the iterator to read code one instruction at a time from the relocator, and write the relocated instruction out using the writer. Following this process, the GumExecBlock structure can be updated. Its field real_end can be set to the address where the relocator read up to, and its field code_end can be set to the address which the writer wrote up to. Thus real_begin and real_end mark the limits of the original block, and code_begin and code_end mark the limits of the newly instrumented block. Finally, gum_exec_ctx_obtain_block_for() calls gum_exec_block_commit() which takes a copy of the original block and places it immediately after the instrumented copy. The field real_snapshot points to this (and is thus identical to code_end). Next the slab’s offset field is updated to reflect the space used by our instrumented block and our copy of the original code. Finally, the block is frozen to allow it to be executed.

static void
gum_exec_block_commit (GumExecBlock * block)
{
  gsize code_size, real_size;

  code_size = block->code_end - block->code_begin;
  block->slab->offset += code_size;

  real_size = block->real_end - block->real_begin;
  block->real_snapshot = block->code_end;
  memcpy (block->real_snapshot, block->real_begin, real_size);
  block->slab->offset += real_size;

  gum_stalker_freeze (block->ctx->stalker, block->code_begin,
      code_size);
}

Now let’s just return to a few more details of the function gum_exec_ctx_obtain_block_for(). First we should note that each block has a single instruction prefixed.

gum_arm64_writer_put_ldp_reg_reg_reg_offset (cw, ARM64_REG_X16,
    ARM64_REG_X17, ARM64_REG_SP, 16 + GUM_RED_ZONE_SIZE,
    GUM_INDEX_POST_ADJUST);

This instruction is the restoration prolog (denoted by GUM_RESTORATION_PROLOG_SIZE). This is skipped in “bootstrap” usage – hence you will note this constant is added on by _gum_stalker_do_follow_me() and gum_stalker_infect() when returning the address of the instrumented code. When return instructions are instrumented, however, if the return is to a block which has already been instrumented, then we can simply return to that block rather than returning back into the Stalker engine. This code is written by gum_exec_block_write_ret_transfer_code(). In a worst-case scenario, where we may need to use registers to perform the final branch to the instrumented block, this function stores them into the stack, and the code to restore these from the stack is prefixed in the block itself. Hence, in the event that we can return directly to an instrumented block, we return to this first instruction rather than skipping GUM_RESTORATION_PROLOG_SIZE bytes.

Secondly, we can see gum_exec_ctx_obtain_block_for() does the following after the instrumented block is written:

gum_arm64_writer_put_brk_imm (cw, 14);

This inserts a break instruction which is intended to simplify debugging.

Lastly, if Stalker is configured to, gum_exec_ctx_obtain_block_for() will generate an event of type GUM_COMPILE when compiling the block.

Helpers

We can see from gum_exec_ctx_ensure_inline_helpers_reachable() that we have a total of 6 helpers. These helpers are common fragments of code which are needed repeatedly by our instrumented blocks. Rather than emitting the code they contain repeatedly, we instead write it once and place a call or branch instruction to have our instrumented code execute it. Recall that the helpers are written into the same slabs we are writing our instrumented code into and that if possible we can re-use the helper written into a previous nearby slab rather than putting a copy in each one.

This function calls gum_exec_ctx_ensure_helper_reachable() for each helper which in turn calls gum_exec_ctx_is_helper_reachable() to check if the helper is within range, or otherwise calls the callback passed as the second argument to write out a new copy.

static void
gum_exec_ctx_ensure_inline_helpers_reachable (GumExecCtx * ctx)
{
  gum_exec_ctx_ensure_helper_reachable (ctx,
      &ctx->last_prolog_minimal,
      gum_exec_ctx_write_minimal_prolog_helper);

  gum_exec_ctx_ensure_helper_reachable (ctx,
      &ctx->last_epilog_minimal,
      gum_exec_ctx_write_minimal_epilog_helper);

  gum_exec_ctx_ensure_helper_reachable (ctx,
      &ctx->last_prolog_full,
      gum_exec_ctx_write_full_prolog_helper);

  gum_exec_ctx_ensure_helper_reachable (ctx,
      &ctx->last_epilog_full,
      gum_exec_ctx_write_full_epilog_helper);

  gum_exec_ctx_ensure_helper_reachable (ctx,
      &ctx->last_stack_push,
      gum_exec_ctx_write_stack_push_helper);

  gum_exec_ctx_ensure_helper_reachable (ctx,
      &ctx->last_stack_pop_and_go,
      gum_exec_ctx_write_stack_pop_and_go_helper);
}

So, what are our 6 helpers. We have 2 for writing prologues which store register context, one for a full context and one for a minimal context. We will cover these later. We also have 2 for their corresponding epilogues for restoring the registers. The other two, the last_stack_push and last_stack_pop_and_go are used when instrumenting call instructions.

Before we analyze these two in detail, we first need to understand the frame structures. We can see from the code snippets below that we allocate a page to contain GumExecFrame structures. These structures are stored sequentially in the page like an array and are populated starting with the entry at the end of the page. Each frame contains the address of the original block and the address of the instrumented block which we generated to replace it:

typedef struct _GumExecFrame GumExecFrame;
typedef struct _GumExecCtx GumExecCtx;

struct _GumExecFrame
{
  gpointer real_address;
  gpointer code_address;
};

struct _GumExecCtx
{
  ...
  GumExecFrame * current_frame;
  GumExecFrame * first_frame;
  GumExecFrame * frames;
  ...
};

static GumExecCtx *
gum_stalker_create_exec_ctx (GumStalker * self,
                             GumThreadId thread_id,
                             GumStalkerTransformer * transformer,
                             GumEventSink * sink)
{
  ...

  ctx->frames = gum_memory_allocate (
      NULL, self->page_size, self->page_size, GUM_PAGE_RW);
  ctx->first_frame = (GumExecFrame *) ((guint8 *) ctx->frames +
      self->page_size - sizeof (GumExecFrame));
  ctx->current_frame = ctx->first_frame;

  ...

  return ctx;
}

last_stack_push

Much of the complexity in understanding Stalker and the helpers in particular is that some functions – let’s call them writers – write code which is executed at a later point. These writers have branches in themselves which determine exactly what code to write, and the written code can also sometimes have branches too. The approach I will take for these two helpers therefore is to show pseudo code for the assembly which is emitted into the slab which will be called by instrumented blocks.

The pseudo code for this helper is shown below:

void
last_stack_push_helper (gpointer x0,
                        gpointer x1)
{
  GumExecFrame ** x16 = &ctx->current_frame
  GumExecFrame * x17 = *x16
  gpointer x2 = x17 & (ctx->stalker->page_size - 1)
  if x2 != 0:
    x17--
    x17->real_address = x0
    x17->code_address = x1
    *x16 = x17
  return
}

As we can see, this helper is actually a simple function which takes two arguments, the real_address and the code_address to store in the next GumExecFrame structure. Note that our stack is written backwards from the end of the page in which they are stored towards the start and that current_frame points to the last used entry (so our stack is full and descending). Also note we have a conditional check to see whether we are on the last entry (the one at the very beginning of the page will be page-aligned) and if we have run out of space for more entries (we have space for 512) then we simply do nothing. If we have space, we write the values from the parameters into the entry and retard the current_frame pointer to point to it.

This helper is used when virtualizing call instructions. Virtualizing is the name given to the replacement of an instruction typically those relating to branching with a series of instructions which instead of executing the intended block allow Stalker to manage the control-flow. Recall as our transformer walks the instructions using the iterator and calls iterator.keep() we output our transformed instruction. When we encounter a branch, we need to emit code to call back into the Stalker engine so that it can instrument that block, but if the branch statement is a call instruction (BL, BLX etc) we also need to emit a call to the above helper to store the stack frame information. This information is used when emitting call events as well as later when optimizing the return.

last_stack_pop_and_go

Now lets look at the last_stack_pop_and_go helper. To understand this, we also need to understand the code written by gum_exec_block_write_ret_transfer_code() (the code that calls it), as well as that written by gum_exec_block_write_exec_generated_code() which it calls. We will skip over pointer authentication for now.

void
ret_transfer_code (arm64_reg ret_reg)
{
  gpointer x16 = ret_reg
  goto last_stack_pop_and_go_helper
}

void
last_stack_pop_and_go_helper (gpointer x16)
{
  GumExecFrame ** x0 = &ctx->current_frame
  GumExecFrame * x1 = *x0
  gpointer x17 = x0.real_address
  if x17 == x16:
    x17 = x0->code_address
    x1++
    *x0 = x1
    goto x17
  else:
    x1 = ctx->first_frame
    *x0 = x1
    gpointer * x0 = &ctx->return_at
    *x0 = x16
    last_prologue_minimal()
    x0 = &ctx->return_at
    x1 = *x0
    gum_exec_ctx_replace_current_block_from_ret(ctx, x1)
    last_epilogue_minimal()
    goto exec_generated_code
}

void
exec_generated_code (void)
{
  gpointer * x16 = &ctx->resume_at
  gpointer x17 = *x16
  goto x17
}

So this code is a little harder. It isn’t really a function and the actual assembly written by it is muddied a little by the need to save and restore registers. But the essence of it is this: When virtualizing a return instruction this helper is used to optimize passing control back to the caller. ret_reg contains the address of the block to which we are intending to return.

Lets take a look at the definition of the return instruction:

RET Return from subroutine, branches unconditionally to an address in a register, with a hint that this is a subroutine return.

RET {Xn} Where:

Xn Is the 64-bit name of the general-purpose register holding the address to be branched to, in the range 0 to 31. Defaults to X30 if absent.

As we can see, we are going to return to an address passed in a register. Typically, we can predict the register value and where we will return to, as the compiler will emit assembly code so that the register is set to the address of the instruction immediately following the call which got us there. After emitting an instrumented call, we emit directly after a little landing pad which will call back into Stalker to instrument the next block. This landing pad can later be backpatched (if the conditions are right) to avoid re-entering Stalker altogether. We store the addresses of both the original block following the call and this landing pad in the GumExecFrame structure, so we can simply virtualize our return instruction by replacing it with instructions which simply branch to this landing pad. We don’t need to re-enter the Stalker engine each time we see a return instruction and get a nice performance boost. Simple!

However, we must bear in mind that not all calls will result in a return. A common technique for hostile or specialized code is to make a call in order to use the LR to determine the current position of the instruction pointer. This value may then be used for introspection purposes (e.g. to validate code to detect modification, to decrypt or unscramble code, etc.).

Also, remember that the user can use a custom transform to modify instructions as they see fit, they can insert instructions which modify register values, or perhaps a callout function which is passed the context structure which allows them to modify register values as they like. Now consider what if they modify the value in the return register!

So we can see that the helper checks the value of the return register against the value of the real_address stored in the GumExecFrame. If it matches, then all is well and we can simply branch directly back to the landing pad. Recall on the first instance, this simply re-enters Stalker to instrument the next block and branches to it, but at a later point backpatching may be used to directly branch to this instrumented block and avoid re-entering Stalker altogether.

Otherwise, we follow a different path. First the array of GumExecFrame is cleared, now our control-flow has deviated, we will start again building our stack again. We accept that we will take this same slower path for any previous frames in the call-stack we recorded so far if we ever return to them, but will have the possibility of using the fast path for new calls we encounter from here on out (until the next time a call instruction is used in an unconventional manner).

We make a minimal prologue (our instrumented code is now going to have to re-enter Stalker) and we need to be able to restore the application’s registers before we return control back to it. We call the entry gate for return, gum_exec_ctx_replace_current_block_from_ret() (more on entry gates later). We then execute the corresponding epilogue before branching to the ctx->resume_at pointer which is set by Stalker during the above call to gum_exec_ctx_replace_current_block_from_ret() to point to the new instrumented block.

Context

Let’s look at the prologues and epilogues now.

static void
gum_exec_ctx_write_prolog (GumExecCtx * ctx,
                           GumPrologType type,
                           GumArm64Writer * cw)
{
  gpointer helper;

  helper = (type == GUM_PROLOG_MINIMAL)
      ? ctx->last_prolog_minimal
      : ctx->last_prolog_full;

  gum_arm64_writer_put_stp_reg_reg_reg_offset (cw, ARM64_REG_X19,
      ARM64_REG_LR, ARM64_REG_SP, -(16 + GUM_RED_ZONE_SIZE),
      GUM_INDEX_PRE_ADJUST);
  gum_arm64_writer_put_bl_imm (cw, GUM_ADDRESS (helper));
}

static void
gum_exec_ctx_write_epilog (GumExecCtx * ctx,
                           GumPrologType type,
                           GumArm64Writer * cw)
{
  gpointer helper;

  helper = (type == GUM_PROLOG_MINIMAL)
      ? ctx->last_epilog_minimal
      : ctx->last_epilog_full;

  gum_arm64_writer_put_bl_imm (cw, GUM_ADDRESS (helper));
  gum_arm64_writer_put_ldp_reg_reg_reg_offset (cw, ARM64_REG_X19,
      ARM64_REG_X20, ARM64_REG_SP, 16 + GUM_RED_ZONE_SIZE,
      GUM_INDEX_POST_ADJUST);
}

We can see that these do little other than call the corresponding prologue or epilogue helpers. We can see that the prologue will store X19 and the link register onto the stack. These are then restored into X19 and X20 at the end of the epilogue. This is because X19 is needed as scratch space to write the context blocks and the link register needs to be preserved as it will be clobbered by the call to the helper.

The LDP and STP instructions load and store a pair of registers respectively and have the option to increment or decrement the stack pointer. This increment or decrement can be carried out either before, or after the values are loaded or stored.

Note also the offset at which these registers are placed. They are stored at 16 bytes + GUM_RED_ZONE_SIZE beyond the top of the stack. Note that our stack on AArch64 is full and descending. This means that the stack grows toward lower addresses and the stack pointer points to the last item pushed (not to the next empty space). So, if we subtract 16 bytes from the stack pointer, then this gives us enough space to store the two 64-bit registers. Note that the stack pointer must be decremented before the store (pre-decrement) and incremented after the load (post-increment).

So what is GUM_RED_ZONE_SIZE? The redzone is a 128 byte area beyond the stack pointer which a function can use to store temporary variables. This allows a function to store data in the stack without the need to adjust the stack pointer all of the time. Note that this call to the prologue is likely the first thing to be carried out in our instrumented block, we don’t know what local variables the application code has stored in the redzone and so we must ensure that we advance the stackpointer beyond it before we start using the stack to store information for the Stalker engine.

Context Helpers

Now that we have looked at how these helpers are called, let us now have a look at the helpers themselves. Although there are two prologues and two epilogues (full and minimal), they are both written by the same function as they have much in common. The version which is written is based on the function parameters. The easiest way to present these is with annotated code:

static void
gum_exec_ctx_write_prolog_helper (GumExecCtx * ctx,
                                  GumPrologType type,
                                  GumArm64Writer * cw)
{
  // Keep track of how much we are pushing onto the stack since we
  // will want to store in the exec context where the original app
  // stack was. At present the call to our helper already skipped
  // the red zone and stored LR and X19.
  gint immediate_for_sp = 16 + GUM_RED_ZONE_SIZE;

  // This instruction is used to store the CPU flags into X15.
  const guint32 mrs_x15_nzcv = 0xd53b420f;

  // Note that only the full prolog has to look like the C struct
  // definition, since this is the data structure passed to
  // callouts and the like.

  // Save Return address to our instrumented block in X19. We will
  // preserve this throughout and branch back there at the end.
  // This will take us back to the code written by
  // gum_exec_ctx_write_prolog()
  gum_arm64_writer_put_mov_reg_reg (cw, ARM64_REG_X19, ARM64_REG_LR);

  // LR = SP[8] Save return address of previous block (or user-code)
  // in LR. This was pushed there by the code written by
  // gum_exec_ctx_write_prolog(). This is the one which will remain in
  // LR once we have returned to our instrumented code block. Note
  // the use of SP+8 is a little asymmetric on entry (prolog) as it is
  // used to pass LR. On exit (epilog) it is used to pass X20
  // and accordingly gum_exec_ctx_write_epilog() restores it there.
  gum_arm64_writer_put_ldr_reg_reg_offset (cw,
      ARM64_REG_LR, ARM64_REG_SP, 8);

  // Store SP[8] = X20. We have read the value of LR which was put
  // there by gum_exec_ctx_write_prolog() and are writing X20 there
  // so that it can be restored by code written by
  // gum_exec_ctx_write_epilog()
  gum_arm64_writer_put_str_reg_reg_offset (cw,
      ARM64_REG_X20, ARM64_REG_SP, 8);

  if (type == GUM_PROLOG_MINIMAL)
  {
    // Store all of the FP/NEON registers. NEON is the SIMD engine
    // on the ARM core which allows operations to be carried out
    // on multiple inputs at once.
    gum_arm64_writer_put_push_reg_reg (cw,
        ARM64_REG_Q6, ARM64_REG_Q7);

    gum_arm64_writer_put_push_reg_reg (cw,
        ARM64_REG_Q4, ARM64_REG_Q5);

    gum_arm64_writer_put_push_reg_reg (cw,
        ARM64_REG_Q2, ARM64_REG_Q3);

    gum_arm64_writer_put_push_reg_reg (cw,
        ARM64_REG_Q0, ARM64_REG_Q1);

    immediate_for_sp += 4 * 32;

    // X29 is Frame Pointer
    // X30 is the Link Register
    gum_arm64_writer_put_push_reg_reg (cw,
        ARM64_REG_X29, ARM64_REG_X30);

    // We are using STP here to push pairs of registers. We actually
    // have an odd number to push, so we just push STALKER_REG_CTX
    // as padding to make up the numbers
    /* X19 - X28 are callee-saved registers */

    // If we are only calling compiled C code, then the compiler
    // will ensure that should a function use registers X19
    // through X28 then their values will be preserved. Hence,
    // we don't need to store them here as they will not be
    // modified. If however, we make a callout, then we want
    // the Stalker end user to have visibility of the full
    // register set and to be able to make any modifications
    // they see fit to them.
    gum_arm64_writer_put_push_reg_reg (cw,
        ARM64_REG_X18, ARM64_REG_X30);
    gum_arm64_writer_put_push_reg_reg (cw,
        ARM64_REG_X16, ARM64_REG_X17);
    gum_arm64_writer_put_push_reg_reg (cw,
        ARM64_REG_X14, ARM64_REG_X15);
    gum_arm64_writer_put_push_reg_reg (cw,
        ARM64_REG_X12, ARM64_REG_X13);
    gum_arm64_writer_put_push_reg_reg (cw,
        ARM64_REG_X10, ARM64_REG_X11);
    gum_arm64_writer_put_push_reg_reg (cw,
        ARM64_REG_X8, ARM64_REG_X9);
    gum_arm64_writer_put_push_reg_reg (cw,
        ARM64_REG_X6, ARM64_REG_X7);
    gum_arm64_writer_put_push_reg_reg (cw,
        ARM64_REG_X4, ARM64_REG_X5);
    gum_arm64_writer_put_push_reg_reg (cw,
       ARM64_REG_X2, ARM64_REG_X3);
    gum_arm64_writer_put_push_reg_reg (cw,
       ARM64_REG_X0, ARM64_REG_X1);
    immediate_for_sp += 11 * 16;
  }
  else if (type == GUM_PROLOG_FULL)
  {
    /* GumCpuContext.q[128] */
    gum_arm64_writer_put_push_reg_reg (cw,
        ARM64_REG_Q6, ARM64_REG_Q7);
    gum_arm64_writer_put_push_reg_reg (cw,
        ARM64_REG_Q4, ARM64_REG_Q5);
    gum_arm64_writer_put_push_reg_reg (cw,
        ARM64_REG_Q2, ARM64_REG_Q3);
    gum_arm64_writer_put_push_reg_reg (cw,
        ARM64_REG_Q0, ARM64_REG_Q1);

    /* GumCpuContext.x[29] + fp + lr + padding */
    // X29 is Frame Pointer
    // X30 is the Link Register
    // X15 is pushed just for padding again
    gum_arm64_writer_put_push_reg_reg (cw,
        ARM64_REG_X30, ARM64_REG_X15);
    gum_arm64_writer_put_push_reg_reg (cw,
        ARM64_REG_X28, ARM64_REG_X29);
    gum_arm64_writer_put_push_reg_reg (cw,
        ARM64_REG_X26, ARM64_REG_X27);
    gum_arm64_writer_put_push_reg_reg (cw,
        ARM64_REG_X24, ARM64_REG_X25);
    gum_arm64_writer_put_push_reg_reg (cw,
        ARM64_REG_X22, ARM64_REG_X23);
    gum_arm64_writer_put_push_reg_reg (cw,
        ARM64_REG_X20, ARM64_REG_X21);

    // Store X19 (currently holding the LR value for this function
    // to return to, the address of the caller written by
    // gum_exec_ctx_write_prolog()) in X20 temporarily. We have
    // already pushed X20 so we can use it freely, but we want to
    // push the app's value of X19 into the context. This was
    // pushed onto the stack by the code in
    // gum_exec_ctx_write_prolog() so we can restore it from there
    // before we push it.
    gum_arm64_writer_put_mov_reg_reg (cw,
        ARM64_REG_X20, ARM64_REG_X19);

    // Restore X19 from the value pushed by the prolog before the
    // call to the helper.
    gum_arm64_writer_put_ldr_reg_reg_offset (cw,
        ARM64_REG_X19, ARM64_REG_SP,
        (6 * 16) + (4 * 32));

    // Push the app's values of X18 and X19. X18 was unmodified. We
    // have corrected X19 above.
    gum_arm64_writer_put_push_reg_reg (cw,
        ARM64_REG_X18, ARM64_REG_X19);

    // Restore X19 from X20
    gum_arm64_writer_put_mov_reg_reg (cw,
        ARM64_REG_X19, ARM64_REG_X20);

    gum_arm64_writer_put_push_reg_reg (cw,
        ARM64_REG_X16, ARM64_REG_X17);
    gum_arm64_writer_put_push_reg_reg (cw,
        ARM64_REG_X14, ARM64_REG_X15);
    gum_arm64_writer_put_push_reg_reg (cw,
        ARM64_REG_X12, ARM64_REG_X13);
    gum_arm64_writer_put_push_reg_reg (cw,
        ARM64_REG_X10, ARM64_REG_X11);
    gum_arm64_writer_put_push_reg_reg (cw,
        ARM64_REG_X8, ARM64_REG_X9);
    gum_arm64_writer_put_push_reg_reg (cw,
        ARM64_REG_X6, ARM64_REG_X7);
    gum_arm64_writer_put_push_reg_reg (cw,
        ARM64_REG_X4, ARM64_REG_X5);
    gum_arm64_writer_put_push_reg_reg (cw,
        ARM64_REG_X2, ARM64_REG_X3);
    gum_arm64_writer_put_push_reg_reg (cw,
        ARM64_REG_X0, ARM64_REG_X1);

    /* GumCpuContext.pc + sp */

    // We are going to store the PC and SP here. The PC is set to
    // zero, for the SP, we have to calculate the original SP
    // before we stored all of this context information. Note we
    // use the zero register here (a special register in AArch64
    // which always has the value 0).
    gum_arm64_writer_put_mov_reg_reg (cw,
        ARM64_REG_X0, ARM64_REG_XZR);
    gum_arm64_writer_put_add_reg_reg_imm (cw,
        ARM64_REG_X1, ARM64_REG_SP,
        (16 * 16) + (4 * 32) + 16 + GUM_RED_ZONE_SIZE);
    gum_arm64_writer_put_push_reg_reg (cw,
        ARM64_REG_X0, ARM64_REG_X1);

    immediate_for_sp += sizeof (GumCpuContext) + 8;
  }

  // Store the Arithmetic Logic Unit flags into X15. Whilst it might
  // appear that the above add instruction used to calculate the
  // original stack pointer may have changed the flags, AArch64 has
  // an ADD instruction which doesn't modify the condition flags
  // and an ADDS instruction which does.
  gum_arm64_writer_put_instruction (cw, mrs_x15_nzcv);

  /* conveniently point X20 at the beginning of the saved
     registers */
  // X20 is used later by functions such as
  // gum_exec_ctx_load_real_register_from_full_frame_into() to emit
  // code which references the saved frame.
  gum_arm64_writer_put_mov_reg_reg (cw, ARM64_REG_X20, ARM64_REG_SP);

  /* padding + status */
  // This pushes the flags to ensure that they can be restored
  // correctly after executing inside of Stalker.
  gum_arm64_writer_put_push_reg_reg (cw,
      ARM64_REG_X14, ARM64_REG_X15);
  immediate_for_sp += 1 * 16;

  // We saved our LR into X19 on entry so that we could branch back
  // to the instrumented code once this helper has run. Although
  // the instrumented code called us, we restored LR to its previous
  // value before the helper was called (the app code). Although the
  // LR is not callee-saved (e.g. it is not our responsibility to
  // save and restore it on return, but rather that of our caller),
  // it is done here to minimize the code size of the inline stub in
  // the instrumented block.
  gum_arm64_writer_put_br_reg_no_auth (cw, ARM64_REG_X19);
}

Now let’s look at the epilogue:

static void
gum_exec_ctx_write_epilog_helper (GumExecCtx * ctx,
                                  GumPrologType type,
                                  GumArm64Writer * cw)
{
  // This instruction is used to restore the value of X15 back into
  // the ALU flags.
  const guint32 msr_nzcv_x15 = 0xd51b420f;

  /* padding + status */
  // Note that we don't restore the flags yet, since we must wait
  // until we have finished all operations (e.g. additions,
  // subtractions etc) which may modify the flags. However, we
  // must do so before we restore X15 back to its original value.
  gum_arm64_writer_put_pop_reg_reg (cw,
      ARM64_REG_X14, ARM64_REG_X15);

  if (type == GUM_PROLOG_MINIMAL)
  {
    // Save the LR in X19 so we can return back to our caller in the
    // instrumented block. Note that we must restore the link
    // register X30 back to its original value (the block in the app
    // code) before we return. This is carried out below. Recall our
    // value of X19 is saved to the stack by the inline prolog
    // itself and restored by the inline prolog to which we are
    // returning. So we can continue to use it as scratch space
    // here.
    gum_arm64_writer_put_mov_reg_reg (cw,
        ARM64_REG_X19, ARM64_REG_LR);

    /* restore status */
    // We have completed all of our instructions which may alter the
    // flags.
    gum_arm64_writer_put_instruction (cw, msr_nzcv_x15);

    // Restore all of the registers we saved in the context. We
    // pushed X30 earlier as padding, but we will
    // pop it back there before we pop the actual pushed value
    // of X30 immediately after.
    gum_arm64_writer_put_pop_reg_reg (cw,
        ARM64_REG_X0, ARM64_REG_X1);
    gum_arm64_writer_put_pop_reg_reg (cw,
        ARM64_REG_X2, ARM64_REG_X3);
    gum_arm64_writer_put_pop_reg_reg (cw,
        ARM64_REG_X4, ARM64_REG_X5);
    gum_arm64_writer_put_pop_reg_reg (cw,
        ARM64_REG_X6, ARM64_REG_X7);
    gum_arm64_writer_put_pop_reg_reg (cw,
        ARM64_REG_X8, ARM64_REG_X9);
    gum_arm64_writer_put_pop_reg_reg (cw,
        ARM64_REG_X10, ARM64_REG_X11);
    gum_arm64_writer_put_pop_reg_reg (cw,
        ARM64_REG_X12, ARM64_REG_X13);
    gum_arm64_writer_put_pop_reg_reg (cw,
        ARM64_REG_X14, ARM64_REG_X15);
    gum_arm64_writer_put_pop_reg_reg (cw,
        ARM64_REG_X16, ARM64_REG_X17);
    gum_arm64_writer_put_pop_reg_reg (cw,
        ARM64_REG_X18, ARM64_REG_X30);
    gum_arm64_writer_put_pop_reg_reg (cw,
        ARM64_REG_X29, ARM64_REG_X30);

    gum_arm64_writer_put_pop_reg_reg (cw,
        ARM64_REG_Q0, ARM64_REG_Q1);
    gum_arm64_writer_put_pop_reg_reg (cw,
        ARM64_REG_Q2, ARM64_REG_Q3);
    gum_arm64_writer_put_pop_reg_reg (cw,
        ARM64_REG_Q4, ARM64_REG_Q5);
    gum_arm64_writer_put_pop_reg_reg (cw,
        ARM64_REG_Q6, ARM64_REG_Q7);
  }
  else if (type == GUM_PROLOG_FULL)
  {
    /* GumCpuContext.pc + sp */
    // We stored the stack pointer and PC in the stack, but we don't
    // want to restore the PC back to the user code, and our stack
    // pointer should be naturally restored as all of the data
    // pushed onto it are popped back off.
    gum_arm64_writer_put_add_reg_reg_imm (cw,
        ARM64_REG_SP, ARM64_REG_SP, 16);

    /* restore status */
    // Again, we have finished any flag affecting operations now that the
    // above addition has been completed.
    gum_arm64_writer_put_instruction (cw, msr_nzcv_x15);

    /* GumCpuContext.x[29] + fp + lr + padding */
    gum_arm64_writer_put_pop_reg_reg (cw,
        ARM64_REG_X0, ARM64_REG_X1);
    gum_arm64_writer_put_pop_reg_reg (cw,
        ARM64_REG_X2, ARM64_REG_X3);
    gum_arm64_writer_put_pop_reg_reg (cw,
        ARM64_REG_X4, ARM64_REG_X5);
    gum_arm64_writer_put_pop_reg_reg (cw,
        ARM64_REG_X6, ARM64_REG_X7);
    gum_arm64_writer_put_pop_reg_reg (cw,
        ARM64_REG_X8, ARM64_REG_X9);
    gum_arm64_writer_put_pop_reg_reg (cw,
        ARM64_REG_X10, ARM64_REG_X11);
    gum_arm64_writer_put_pop_reg_reg (cw,
        ARM64_REG_X12, ARM64_REG_X13);
    gum_arm64_writer_put_pop_reg_reg (cw,
        ARM64_REG_X14, ARM64_REG_X15);
    gum_arm64_writer_put_pop_reg_reg (cw,
        ARM64_REG_X16, ARM64_REG_X17);
    gum_arm64_writer_put_pop_reg_reg (cw,
        ARM64_REG_X18, ARM64_REG_X19);
    gum_arm64_writer_put_pop_reg_reg (cw,
        ARM64_REG_X20, ARM64_REG_X21);

    // Recall that X19 and X20 are actually restored by the epilog
    // itself since X19 is used as scratch space during the
    // prolog/epilog helpers and X20 is repurposed by the prolog as
    // a pointer to the context structure. If we have a full prolog
    // then this means that it was so that we could enter a callout
    // which allows the Stalker end user to inspect and modify all
    // of the registers. This means that any changes to the
    // registers in the context structure above must be reflected
    // at runtime. Thus since these values are restored from
    // higher up the stack by the epilog, we must overwrite their
    // values there with those from the context structure.
    gum_arm64_writer_put_stp_reg_reg_reg_offset (cw, ARM64_REG_X19,
        ARM64_REG_X20, ARM64_REG_SP, (5 * 16) + (4 * 32),
        GUM_INDEX_SIGNED_OFFSET);

    // Save the LR in X19 so we can return back to our caller in the
    // instrumented code. Note that we must restore the link
    // register X30 back to its original value before we return.
    // This is carried out below. Recall our value of X19 is saved
    // to the stack by the inline prolog itself and restored by the
    // inline epilogue to which we are returning.
    gum_arm64_writer_put_mov_reg_reg (cw,
        ARM64_REG_X19, ARM64_REG_LR);

    gum_arm64_writer_put_pop_reg_reg (cw,
        ARM64_REG_X22, ARM64_REG_X23);
    gum_arm64_writer_put_pop_reg_reg (cw,
        ARM64_REG_X24, ARM64_REG_X25);
    gum_arm64_writer_put_pop_reg_reg (cw,
        ARM64_REG_X26, ARM64_REG_X27);
    gum_arm64_writer_put_pop_reg_reg (cw,
        ARM64_REG_X28, ARM64_REG_X29);

    // Recall that X15 was also pushed as padding alongside X30 when
    // building the prolog. However, the Stalker end user can modify
    // the context and hence the value of X15. However this would
    // not affect the duplicate stashed here as padding and hence
    // X15 would be clobbered. Therefore we copy the now restored
    // value of X15 to the location where this copy was stored for
    // padding before restoring both registers from the stack.
    gum_arm64_writer_put_str_reg_reg_offset (cw,
        ARM64_REG_X15, ARM64_REG_SP, 8);
    gum_arm64_writer_put_pop_reg_reg (cw,
        ARM64_REG_X30, ARM64_REG_X15);

    /* GumCpuContext.q[128] */
    gum_arm64_writer_put_pop_reg_reg (cw,
        ARM64_REG_Q0, ARM64_REG_Q1);
    gum_arm64_writer_put_pop_reg_reg (cw,
        ARM64_REG_Q2, ARM64_REG_Q3);
    gum_arm64_writer_put_pop_reg_reg (cw,
        ARM64_REG_Q4, ARM64_REG_Q5);
    gum_arm64_writer_put_pop_reg_reg (cw,
        ARM64_REG_Q6, ARM64_REG_Q7);
  }

  // Now we can return back to to our caller (the inline part of the
  // epilogue) with the LR still set to the original value of the
  // app code.
  gum_arm64_writer_put_br_reg_no_auth (cw, ARM64_REG_X19)
}

This is all quite complicated. Partly this is because we have only a single register to use as scratch space, partly because we want to keep the prologue and epilogue code stored inline in the instrumented block to a bare minimum, and partly because our context values can be changed by callouts and the like. But hopefully it all now makes sense.

Reading/Writing Context

Now that we have our context saved, whether it was a full context, or just the minimal one, Stalker may need to read registers from the context to see what state of the application code was. For example to find the address which a branch or return instruction was going to branch to so that we can instrument the block.

When Stalker writes the prologue and epilogue code, it does so by calling gum_exec_block_open_prolog() and gum_exec_block_close_prolog(). These store the type of prologue which has been written in gc->opened_prolog.

static void
gum_exec_block_open_prolog (GumExecBlock * block,
                            GumPrologType type,
                            GumGeneratorContext * gc)
{
  if (gc->opened_prolog >= type)
    return;

  /* We don't want to handle this case for performance reasons */
  g_assert (gc->opened_prolog == GUM_PROLOG_NONE);

  gc->opened_prolog = type;

  gum_exec_ctx_write_prolog (block->ctx, type, gc->code_writer);
}

static void
gum_exec_block_close_prolog (GumExecBlock * block,
                             GumGeneratorContext * gc)
{
  if (gc->opened_prolog == GUM_PROLOG_NONE)
    return;

  gum_exec_ctx_write_epilog (block->ctx, gc->opened_prolog,
      gc->code_writer);

  gc->opened_prolog = GUM_PROLOG_NONE;
}

Therefore when we want to read a register, this can be achieved with the single function gum_exec_ctx_load_real_register_into(). This determines which kind of prologue is in use and calls the relevant routine accordingly. Note that these routines don’t actually read the registers, they emit code which reads them.

static void
gum_exec_ctx_load_real_register_into (GumExecCtx * ctx,
                                      arm64_reg target_register,
                                      arm64_reg source_register,
                                      GumGeneratorContext * gc)
{
  if (gc->opened_prolog == GUM_PROLOG_MINIMAL)
  {
    gum_exec_ctx_load_real_register_from_minimal_frame_into (ctx,
        target_register, source_register, gc);
    return;
  }
  else if (gc->opened_prolog == GUM_PROLOG_FULL)
  {
    gum_exec_ctx_load_real_register_from_full_frame_into (ctx,
        target_register, source_register, gc);
    return;
  }

  g_assert_not_reached ();
}

Reading registers from the full frame is actually the simplest. We can see the code closely matches the structure used to pass the context to callouts etc. Remember that in each case register X20 points to the base of the context structure.

typedef GumArm64CpuContext GumCpuContext;

struct _GumArm64CpuContext
{
  guint64 pc;
  guint64 sp;

  guint64 x[29];
  guint64 fp;
  guint64 lr;
  guint8 q[128];
};

static void
gum_exec_ctx_load_real_register_from_full_frame_into (
    GumExecCtx * ctx,
    arm64_reg target_register,
    arm64_reg source_register,
    GumGeneratorContext * gc)
{
  GumArm64Writer * cw;

  cw = gc->code_writer;

  if (source_register >= ARM64_REG_X0 &&
      source_register <= ARM64_REG_X28)
  {
    gum_arm64_writer_put_ldr_reg_reg_offset (cw,
        target_register, ARM64_REG_X20,
        G_STRUCT_OFFSET (GumCpuContext, x) +
        ((source_register - ARM64_REG_X0) * 8));
  }
  else if (source_register == ARM64_REG_X29)
  {
    gum_arm64_writer_put_ldr_reg_reg_offset (cw,
        target_register, ARM64_REG_X20,
        G_STRUCT_OFFSET (GumCpuContext, fp));
  }
  else if (source_register == ARM64_REG_X30)
  {
    gum_arm64_writer_put_ldr_reg_reg_offset (cw,
        target_register, ARM64_REG_X20,
        G_STRUCT_OFFSET (GumCpuContext, lr));
  }
  else
  {
    gum_arm64_writer_put_mov_reg_reg (cw,
        target_register, source_register);
  }
}

Reading from the minimal context is actually a little harder. X0 through X18 are simple, they are stored in the context block. After X18 is 8 bytes padding (to make a total of 10 pairs of registers) followed by X29 and X30. This makes a total of 11 pairs of registers. Immediately following this is the NEON/floating point registers (totaling 128 bytes). Finally X19 and X20, are stored above this as they are restored by the inline epilogue code written by gum_exec_ctx_write_epilog().

static void
gum_exec_ctx_load_real_register_from_minimal_frame_into (
    GumExecCtx * ctx,
    arm64_reg target_register,
    arm64_reg source_register,
    GumGeneratorContext * gc)
{
  GumArm64Writer * cw;

  cw = gc->code_writer;

  if (source_register >= ARM64_REG_X0 &&
      source_register <= ARM64_REG_X18)
  {
    gum_arm64_writer_put_ldr_reg_reg_offset (cw,
        target_register, ARM64_REG_X20,
        (source_register - ARM64_REG_X0) * 8);
  }
  else if (source_register == ARM64_REG_X19 ||
      source_register == ARM64_REG_X20)
  {
    gum_arm64_writer_put_ldr_reg_reg_offset (cw,
        target_register, ARM64_REG_X20,
        (11 * 16) + (4 * 32) +
        ((source_register - ARM64_REG_X19) * 8));
  }
  else if (source_register == ARM64_REG_X29 ||
      source_register == ARM64_REG_X30)
  {
    gum_arm64_writer_put_ldr_reg_reg_offset (cw,
        target_register, ARM64_REG_X20,
        (10 * 16) + ((source_register - ARM64_REG_X29) * 8));
  }
  else
  {
    gum_arm64_writer_put_mov_reg_reg (cw,
        target_register, source_register);
  }
}

Control flow

Execution of Stalker begins at one of 3 entry points:

_gum_stalker_do_follow_me()
gum_stalker_infect()
gum_exec_ctx_replace_current_block_with()

The first two we have already covered, these initialize the Stalker engine and start instrumenting the first block of execution. gum_exec_ctx_replace_current_block_with() is used to instrument subsequent blocks. In fact, the main difference between this function and the preceding two is that the Stalker engine has already been initialized and hence this work doesn’t need to be repeated. All three call gum_exec_ctx_obtain_block_for() to generate the instrumented block.

We covered gum_exec_ctx_obtain_block_for() previously in our section on transformers. It calls the transformed implementation in use, which by default calls gum_stalker_iterator_next() which calls the relocator using gum_arm64_relocator_read_one() to read the next relocated instruction. Then it calls gum_stalker_iterator_keep() to generate the instrumented copy. It does this in a loop until gum_stalker_iterator_next() returns FALSE as it has reached the end of the block.

Most of the time gum_stalker_iterator_keep() will simply call gum_arm64_relocator_write_one() to emit the relocated instruction as is. However, if the instruction is a branch or return instruction it will call gum_exec_block_virtualize_branch_insn() or gum_exec_block_virtualize_ret_insn() respectively. These two virtualization functions which we will cover in more detail later, emit code to transfer control back into gum_exec_ctx_replace_current_block_with() via an entry gate ready to process the next block (unless there is an optimization where we can bypass Stalker and go direct to the next instrumented block, or we are entering into an excluded range).

Gates

Entry gates are generated by macro, one for each of the different instruction types found at the end of a block. When we virtualize each of these types of instruction, we direct control flow back to the gum_exec_ctx_replace_current_block_with() function via one of these gates. We can see that the implementation of the gate is quite simple, it updates a counter of how many times it has been called and passes control to gum_exec_ctx_replace_current_block_with() passing through the parameters it was called with, the GumExecCtx and the start_address of the next block to be instrumented.

static gboolean counters_enabled = FALSE;
static guint total_transitions = 0;

#define GUM_ENTRYGATE(name) \
  gum_exec_ctx_replace_current_block_from_##name
#define GUM_DEFINE_ENTRYGATE(name) \
  static guint total_##name##s = 0; \
  \
  static gpointer GUM_THUNK \
  GUM_ENTRYGATE (name) ( \
      GumExecCtx * ctx, \
      gpointer start_address) \
  { \
    if (counters_enabled) \
      total_##name##s++; \
    \
    return gum_exec_ctx_replace_current_block_with (ctx, \
        start_address); \
  }
#define GUM_PRINT_ENTRYGATE_COUNTER(name) \
  g_printerr ("\t" G_STRINGIFY (name) "s: %u\n", total_##name##s)

These counters can be displayed by the following routine. They are only meant to be used by the test-suite rather than being exposed to the user through the API.

#define GUM_PRINT_ENTRYGATE_COUNTER(name) \
  g_printerr ("\t" G_STRINGIFY (name) "s: %u\n", total_##name##s)

void
gum_stalker_dump_counters (void)
{
  g_printerr ("\n\ntotal_transitions: %u\n", total_transitions);

  GUM_PRINT_ENTRYGATE_COUNTER (call_imm);
  GUM_PRINT_ENTRYGATE_COUNTER (call_reg);
  GUM_PRINT_ENTRYGATE_COUNTER (post_call_invoke);
  GUM_PRINT_ENTRYGATE_COUNTER (excluded_call_imm);
  GUM_PRINT_ENTRYGATE_COUNTER (excluded_call_reg);
  GUM_PRINT_ENTRYGATE_COUNTER (ret);

  GUM_PRINT_ENTRYGATE_COUNTER (jmp_imm);
  GUM_PRINT_ENTRYGATE_COUNTER (jmp_reg);

  GUM_PRINT_ENTRYGATE_COUNTER (jmp_cond_cc);
  GUM_PRINT_ENTRYGATE_COUNTER (jmp_cond_cbz);
  GUM_PRINT_ENTRYGATE_COUNTER (jmp_cond_cbnz);
  GUM_PRINT_ENTRYGATE_COUNTER (jmp_cond_tbz);
  GUM_PRINT_ENTRYGATE_COUNTER (jmp_cond_tbnz);

  GUM_PRINT_ENTRYGATE_COUNTER (jmp_continuation);
}

Virtualize functions

Let’s now look in more detail at the virtualizing we have for replacing the branch instruction we find at the end of each block. We have four of these functions:

gum_exec_block_virtualize_branch_insn()
gum_exec_block_virtualize_ret_insn()
gum_exec_block_virtualize_sysenter_insn()
gum_exec_block_virtualize_linux_sysenter()

We can see that two of these relate to to syscalls (and in fact, one calls the other), we will cover these later. Let’s look at the ones for branches and returns.

gum_exec_block_virtualize_branch_insn

This routine first determines whether the destination of the branch comes from an immediate offset in the instruction, or a register. In the case of the latter, we don’t extract the value just yet, we only determine which register. This is referred to as the target. The next section of the function deals with branch instructions. This includes both conditional and non-conditional branches. For conditional targets the destination if the branch is not taken is referred to as cond_target, this is set to the address of the next instruction in the original block.

Likewise regular_entry_func and cond_entry_func are used to hold the entry gates which will be used to handle the branch. The former is used to hold the gate used for non-conditional branches and cond_entry_func holds the gate to be used for a conditional branch (whether it is taken or not).

The function gum_exec_block_write_jmp_transfer_code() is used to write the code required to branch to the entry gate. For non-conditional branches this is simple, we call the function passing the target and the regular_entry_func. For conditional branches things are slightly more complicated. Our output looks like the following pseudo-code:

  INVERSE_OF_ORIGINAL_BRANCH(is_false)
  jmp_transfer_code(target, cond_entry_func)
is_false:
  jmp_transfer_code(cond_target, cond_entry_func)

Here, we can see that we first write a branch instruction into our instrumented block, as in our instrumented block, we also need to determine whether we should take the branch or not. But instead of branching directly to the target, just like for the non-conditional branches we use gum_exec_block_write_jmp_transfer_code() to write code to jump back into Stalker via the relevant entry gate passing the real address we would have branched to. Note, however that the branch is inverted from the original (e.g. CBZ would be replaced by CBNZ).

Now, let’s look at how gum_exec_block_virtualize_branch_insn() handles calls. First we emit code to generate the call event if we are configured to. Next we check if there are any probes in use. If there are, then we call gum_exec_block_write_call_probe_code() to emit the code necessary to call any registered probe callback. Next, we check if the call is to an excluded range (note that we can only do this if the call is to an immediate address), if it is then we emit the instruction as is. But we follow this by using gum_exec_block_write_jmp_transfer_code() as we did when handling branches to emit code to call back into Stalker right after to instrument the block at the return address. Note that here we use the excluded_call_imm entry gate.

Finally, if it is just a normal call expression, then we use the function gum_exec_block_write_call_invoke_code() to emit the code to handle the call. This function is pretty complicated as a result of all of the optimization for backpatching, so we will only look at the basics.

Remember earlier that in gum_exec_block_virtualize_branch_insn(), we could only check if our call was to an excluded range if the target was specified in an immediate? Well if the target was specified in a register, then here we emit code to check whether the target is in an excluded range. This is done by loading the target register using gum_exec_ctx_write_push_branch_target_address() (which in turn calls gum_exec_ctx_load_real_register_into() which we covered ealier to read the context) and emitting code to call gum_exec_block_check_address_for_exclusion() whose implementation is quite self-explanatory. If it is excluded then a branch is taken and similar code to that described when handling excluded immediate calls discussed above is used.

Next we emit code to call the entry gate and generate the instrumented block of the callee. Then call the helper last_stack_push to add our GumExecFrame to our context containing the original and instrumented block address. The real and instrumented code addresses are read from the current cursor positions of the GeneratorContext and CodeWriter respectively, and we then generate the required landing pad for the return address (this is the optimization we covered earlier, we can jump straight to this block when executing the virtualized return statement rather than re-entering Stalker). Lastly we use gum_exec_block_write_exec_generated_code() to emit code to branch to the instrumented callee.

gum_exec_block_virtualize_ret_insn

After looking at the virtualization of call instructions, you will be pleased to know that this one is relatively simple! If configured, this function calls gum_exec_block_write_ret_event_code() to generate an event for the return statement. Then it calls gum_exec_block_write_ret_transfer_code() to generate the code required to handle the return instruction. This one is simple too, it emits code to call the last_stack_pop_and_go helper we covered earlier.

Emitting events

Events are one of the key outputs of the Stalker engine. They are emitted by the following functions. Their implementation again is quite self-explanatory:

gum_exec_ctx_emit_call_event()
gum_exec_ctx_emit_ret_event()
gum_exec_ctx_emit_exec_event()
gum_exec_ctx_emit_block_event()

One thing to note with each of these functions, however, is that they all call gum_exec_block_write_unfollow_check_code() to generate code for checking if Stalker is to stop following the thread. We’ll have a look at this in more detail next.

Unfollow and tidy up

If we look at the function which generates the instrumented code to check if we are being asked to unfollow, we can see it cause the thread to call gum_exec_ctx_maybe_unfollow() passing the address of the next instruction to be instrumented. We can see that if the state has been set to stop following, then we simply branch back to the original code.

static void
gum_exec_block_write_unfollow_check_code (GumExecBlock * block,
                                          GumGeneratorContext * gc,
                                          GumCodeContext cc)
{
  GumExecCtx * ctx = block->ctx;
  GumArm64Writer * cw = gc->code_writer;
  gconstpointer beach = cw->code + 1;
  GumPrologType opened_prolog;

  if (cc != GUM_CODE_INTERRUPTIBLE)
    return;

  gum_arm64_writer_put_call_address_with_arguments (cw,
      GUM_ADDRESS (gum_exec_ctx_maybe_unfollow), 2,
      GUM_ARG_ADDRESS, GUM_ADDRESS (ctx),
      GUM_ARG_ADDRESS, GUM_ADDRESS (gc->instruction->begin));
  gum_arm64_writer_put_cbz_reg_label (cw, ARM64_REG_X0, beach);

  opened_prolog = gc->opened_prolog;
  gum_exec_block_close_prolog (block, gc);
  gc->opened_prolog = opened_prolog;

  gum_arm64_writer_put_ldr_reg_address (cw, ARM64_REG_X16,
      GUM_ADDRESS (&ctx->resume_at));
  gum_arm64_writer_put_ldr_reg_reg_offset (cw,
      ARM64_REG_X17, ARM64_REG_X16, 0);
  gum_arm64_writer_put_br_reg_no_auth (cw, ARM64_REG_X17);

  gum_arm64_writer_put_label (cw, beach);
}

static gboolean
gum_exec_ctx_maybe_unfollow (GumExecCtx * ctx,
                             gpointer resume_at)
{
  if (g_atomic_int_get (&ctx->state) !=
      GUM_EXEC_CTX_UNFOLLOW_PENDING)
    return FALSE;

  if (ctx->pending_calls > 0)
    return FALSE;

  gum_exec_ctx_unfollow (ctx, resume_at);

  return TRUE;
}

static void
gum_exec_ctx_unfollow (GumExecCtx * ctx,
                       gpointer resume_at)
{
  ctx->current_block = NULL;

  ctx->resume_at = resume_at;

  gum_tls_key_set_value (ctx->stalker->exec_ctx, NULL);

  ctx->destroy_pending_since = g_get_monotonic_time ();
  g_atomic_int_set (&ctx->state, GUM_EXEC_CTX_DESTROY_PENDING);
}

A quick note about pending calls. If we have a call to an excluded range, then we emit the original call in the instrumented code followed by a call back to Stalker. Whilst the thread is running in the excluded range, however, we cannot control the instruction pointer until it returns. We therefore need to simply keep track of these and wait for the thread to exit the excluded range.

Now we can see how a running thread gracefully goes back to running normal uninstrumented code, let’s see how we stop stalking in the first place. We have two possible ways to stop stalking:

gum_stalker_unfollow_me()
gum_stalker_unfollow()

The first is quite simple, we set the state to stop following. Then call gum_exec_ctx_maybe_unfollow() to attempt to stop the current thread from being followed, and then dispose of the Stalker context.

void
gum_stalker_unfollow_me (GumStalker * self)
{
  GumExecCtx * ctx;

  ctx = gum_stalker_get_exec_ctx (self);
  if (ctx == NULL)
    return;

  g_atomic_int_set (&ctx->state, GUM_EXEC_CTX_UNFOLLOW_PENDING);

  if (!gum_exec_ctx_maybe_unfollow (ctx, NULL))
    return;

  g_assert (ctx->unfollow_called_while_still_following);

  gum_stalker_destroy_exec_ctx (self, ctx);
}

We notice here that we pass NULL as the address to gum_exec_ctx_maybe_unfollow() which may seem odd, but we can see that in this instance it isn’t used as when we instrument a block (remember gum_exec_ctx_replace_current_block_with() is where the entry gates direct us to instrument subsequent blocks) we check to see if we are about to call gum_unfollow_me(), and if so then we return the original block from the function rather than the address of the instrumented block generated by gum_exec_ctx_obtain_block_for(). Therefore we can see that this is a special case and this function isn’t stalked. We simply jump to the real function so at this point we have stopped stalking the thread forever. This handling differs from excluded ranges as for those we retain the original call instruction in an instrumented block, but then follow it with a call back into Stalker. In this case, we are just vectoring back to an original uninstrumented block:

static gpointer gum_unfollow_me_address;

static void
gum_stalker_class_init (GumStalkerClass * klass)
{
  ...
  gum_unfollow_me_address = gum_strip_code_pointer (
      gum_stalker_unfollow_me);
  ...
}

static gpointer
gum_exec_ctx_replace_current_block_with (GumExecCtx * ctx,
                                         gpointer start_address)
{
  ...

  if (start_address == gum_unfollow_me_address ||
      start_address == gum_deactivate_address)
  {
    ctx->unfollow_called_while_still_following = TRUE;
    ctx->current_block = NULL;
    ctx->resume_at = start_address;
  }
  ...

  else
  {
    ctx->current_block = gum_exec_ctx_obtain_block_for (ctx,
        start_address, &ctx->resume_at);

    ...
  }

  return ctx->resume_at;

  ...
}

Let’s look at gum_stalker_unfollow() now:

void
gum_stalker_unfollow (GumStalker * self,
                      GumThreadId thread_id)
{
  if (thread_id == gum_process_get_current_thread_id ())
  {
    gum_stalker_unfollow_me (self);
  }
  else
  {
    GSList * cur;

    GUM_STALKER_LOCK (self);

    for (cur = self->contexts; cur != NULL; cur = cur->next)
    {
      GumExecCtx * ctx = (GumExecCtx *) cur->data;

      if (ctx->thread_id == thread_id &&
          g_atomic_int_compare_and_exchange (&ctx->state,
              GUM_EXEC_CTX_ACTIVE,
              GUM_EXEC_CTX_UNFOLLOW_PENDING))
      {
        GUM_STALKER_UNLOCK (self);

        if (!gum_exec_ctx_has_executed (ctx))
        {
          GumDisinfectContext dc;

          dc.exec_ctx = ctx;
          dc.success = FALSE;

          gum_process_modify_thread (thread_id,
              gum_stalker_disinfect, &dc);

          if (dc.success)
            gum_stalker_destroy_exec_ctx (self, ctx);
        }

        return;
      }
    }

    GUM_STALKER_UNLOCK (self);
  }
}

This function looks through the list of contexts looking for the one for the requested thread. Again, it sets the state of the context to GUM_EXEC_CTX_UNFOLLOW_PENDING. If the thread has already run, we must wait for it to check this flag and return to normal execution. However, if it has not run (perhaps it was in a blocking syscall when we asked to follow it and never got infected in the first instance) then we can disinfect it ourselves by calling gum_process_modify_thread() to modify the thread context (this function was described in detail earlier) and using gum_stalker_disinfect() as our callback to perform the changes. This simply checks to see if the program counter was set to point to the infect_thunk and resets the program pointer back to its original value. The infect_thunk is created by gum_stalker_infect() which is the callback used by gum_stalker_follow() to modify the context. Recall that whilst some of the setup can be carried out on behalf of the target thread, some has to be done in the context of the target thread itself (in particular setting variables in thread-local storage). Well, it is the infect_thunk which contains that code.

Miscellaneous

Hopefully we have now covered the most important aspects of Stalker and have provided a good background on how it works. We do have a few other observations though, which may be of interest.

Exclusive Store

The AArch64 architecture has support for exclusive load/store instructions. These instructions are intended to be used for synchronization. If an exclusive load is performed from a given address, then later attempts an exclusive store to the same location, then the CPU is able to detect any other stores (exclusive or otherwise) to the same location in the intervening period and the store fails.

Obviously, these types of primitives are likely to be used for constructs such as mutexes and semaphores. Multiple threads may attempt to load the current count of the semaphore, test whether is it already full, then increment and store the new value back to take the semaphore. These exclusive operations are ideal for just such a scenario. Consider though what would happen if multiple threads are competing for the same resource. If one of those threads were being traced by Stalker, it would always lose the race. Also these instructions are easily disturbed by other kinds of CPU operations and so if we do something complex like emit an event between a load and a store we are going to cause it to fail every time, and end up looping indefinitely. Stalker, however, deals with such a scenario:

gboolean
gum_stalker_iterator_next (GumStalkerIterator * self,
                           const cs_insn ** insn)
{

  ...

    switch (instruction->ci->id)
    {
      case ARM64_INS_STXR:
      case ARM64_INS_STXP:
      case ARM64_INS_STXRB:
      case ARM64_INS_STXRH:
      case ARM64_INS_STLXR:
      case ARM64_INS_STLXP:
      case ARM64_INS_STLXRB:
      case ARM64_INS_STLXRH:
        gc->exclusive_load_offset = GUM_INSTRUCTION_OFFSET_NONE;
        break;
      default:
        break;
    }

    if (gc->exclusive_load_offset != GUM_INSTRUCTION_OFFSET_NONE)
    {
      gc->exclusive_load_offset++;
      if (gc->exclusive_load_offset == 4)
        gc->exclusive_load_offset = GUM_INSTRUCTION_OFFSET_NONE;
    }
  }

  ...
  ...
}

void
gum_stalker_iterator_keep (GumStalkerIterator * self)
{
  ...

  switch (insn->id)
  {
    case ARM64_INS_LDAXR:
    case ARM64_INS_LDAXP:
    case ARM64_INS_LDAXRB:
    case ARM64_INS_LDAXRH:
    case ARM64_INS_LDXR:
    case ARM64_INS_LDXP:
    case ARM64_INS_LDXRB:
    case ARM64_INS_LDXRH:
      gc->exclusive_load_offset = 0;
      break;
    default:
      break;
  }

  ...
}

Here, we can see that the iterator records when it sees an exclusive load and tracks how many instructions have passed since. This is continued for up to four instructions – as this was determined by empirical testing based on how many instructions would be needed to load, test, modify and store the value. This is then used to prevent any instrumentation being emitted which isn’t strictly necessary:

  if ((ec->sink_mask & GUM_EXEC) != 0 &&
      gc->exclusive_load_offset == GUM_INSTRUCTION_OFFSET_NONE)
  {
    gum_exec_block_write_exec_event_code (block, gc,
        GUM_CODE_INTERRUPTIBLE);
  }

Exhausted Blocks

Whilst we check to ensure a minimum amount of space for our current instrumented block is left in the slab before we start (and allocate a new one if we fall below this minimum), we cannot predict how long a sequence of instructions we are likely to encounter in our input block. Nor is it simple to detemine exactly how many instructions in output we will need to write the necessary instrumentation (we have possible code for emitting the different types of event, checking for excluded ranges, virtualizing instructions found at the end of the block etc.). Also, trying to allow for the instrumented code to be non-sequential is fraught with difficulty. So the approach taken is to ensure that each time we read a new instruction from the iterator there is at least 1024 bytes of space in the slab for our output. If it is not the case, then we store the current address in continuation_real_address and return FALSE so that the iterator ends.

#define GUM_EXEC_BLOCK_MIN_SIZE 1024

static gboolean
gum_exec_block_is_full (GumExecBlock * block)
{
  guint8 * slab_end = block->slab->data + block->slab->size;

  return slab_end - block->code_end < GUM_EXEC_BLOCK_MIN_SIZE;
}

gboolean
gum_stalker_iterator_next (GumStalkerIterator * self,
                           const cs_insn ** insn)
{
  ...

    if (gum_exec_block_is_full (block))
    {
      gc->continuation_real_address = instruction->end;
      return FALSE;
    }

  ...
}

Our caller gum_exec_ctx_obtain_block_for() which is walking the iterator to generate the block then acts exactly as if there was a branch instruction to the next instruction, essentially terminating the current block and starting the next one.

static GumExecBlock *
gum_exec_ctx_obtain_block_for (GumExecCtx * ctx,
                               gpointer real_address,
                               gpointer * code_address_ptr)
{
  ...

  if (gc.continuation_real_address != NULL)
  {
    GumBranchTarget continue_target = { 0, };

    continue_target.absolute_address = gc.continuation_real_address;
    continue_target.reg = ARM64_REG_INVALID;
    gum_exec_block_write_jmp_transfer_code (block, &continue_target,
        GUM_ENTRYGATE (jmp_continuation), &gc);
  }

  ...
}

It is as if the following instructions had been encountered in the input right before the instruction which would have not had sufficient space:

  B label
label:

Syscall Virtualization

Syscalls are entry points from user-mode into kernel-mode. It is how applications ask the kernel carry out operations on its behalf, whether that be opening files or reading network sockets. On AArch64 systems, this is carried out using the SVC instruction, whereas on Intel the instruction is sysenter. Hence the terms syscall and sysenter here are used synonymously.

Syscall virtualization is carried out by the following routine. We can see we only do anything on Linux systems:

static GumVirtualizationRequirements
gum_exec_block_virtualize_sysenter_insn (GumExecBlock * block,
                                         GumGeneratorContext * gc)
{
#ifdef HAVE_LINUX
  return gum_exec_block_virtualize_linux_sysenter (block, gc);
#else
  return GUM_REQUIRE_RELOCATION;
#endif
}

This is required because of the clone syscall. This syscall creates a new process which shares execution context with the parent, such as file handles, virtual address space, and signal handlers. In essence, this effectively creates a new thread. But the current thread is being traced by Stalker, and clone is going to create an exact replica of it. Given that Stalker contexts are on a per-thread basis, we should not be stalking this new child.

Note that for syscalls in AArch64 the first 8 arguments are passed in registers X0 through X7 and the syscall number is passed in X8, additional arguments are passed on the stack. The return value for the syscall is returned in X0. The function gum_exec_block_virtualize_linux_sysenter() generates the necessary instrumented code to deal with such a syscall. We will look at the pseudo code below:

if x8 == __NR_clone:
  x0 = do_original_syscall()
  if x0 == 0:
    goto gc->instruction->begin
  return x0
else:
  return do_original_syscall()

We can see that it first checks if we are dealing with a clone syscall, otherwise it simply performs the original syscall and that is all (the original syscall instruction is copied from the original block). Otherwise if it is a clone syscall, then we again perform the original syscall. At this point, we have two threads of execution, the syscall determines that each thread will return a different value. The original thread will receive the child’s PID as its return value, whereas the child will receive the value of 0.

If we receive a non-zero value, we can simply continue as we were. We want to continue stalking the thread and allow execution to carry on with the next instruction. If, however, we receive a return value of 0, then we are in the child thread. We therefore carry out a branch to the next instruction in the original block ensuring that the child continues to run without any interruption from Stalker.

Pointer Authentication

Last of all, we should note that newer versions of iOS have introduced pointer authentication codes. Pointer authentication codes (PACs) make use of unused bits in pointers (the high bits of virtual addresses are commonly unused as most systems have a maximum of 48-bits of virtual address space) to store authentication values. These values are calculated by using the original pointer, a context parameter (typically the contents of another register) and a cryptographic key. The idea is that the key cannot be read or written from user-mode, and the resulting pointer authentication code cannot be guessed without having access to it.

Let’s look at the following fragment of code:

pacia lr, sp
stp fp, lr, [sp, #-FRAME_SIZE]!
mov fp, sp

...

ldp fp, lr, [sp], #FRAME_SIZE
autia lr, sp
ret lr

The pacia instruction combines the values of LR, SP and the key to generate a version of LR with the authentication code LR' and stores back into the LR register. This value is stored in the stack and later restored at the end of the function. The autia instruction validates the value of LR'. This is possible since the PAC in the high bits of LR can be stripped to give the original LR value and the pointer authentication code can be regenerated as it was before using SP and the key. The result is checked against LR'. If the value doesn’t match then the instruction generates a fault. Thus if the value of LR stored in the stack is modified, or the stack pointer itself is corrupted then the validation will fail. This is useful to prevent the building of ROP chains which require return addresses to be stored in the stack. Since LR' is now stored in the stack instead of LR, valid return addresses cannot be forged without the key.

Frida needs to take this into account also when generating code. When reading pointers from registers used by the application (e.g. to determine the destination of an indirect branch or return), it is necessary to strip these pointer authentication codes from the address before it is used. This is achieved using the function gum_arm64_writer_put_xpaci_reg().

Back