Stalker
介绍
Stalker 是Frida的核心追踪引擎。它使得追踪线程,捕获所有函数,所有代码块,甚至执行的每个指令。 这里提供了一个不错的概览。我们强烈建议你先仔细阅读一遍。 很显然,其实现是很依赖具体架构的,不过也是有不少相同之处的。 Stalker 目前对 AArch64 有广泛的支持。该架构广泛应用于运行着Android或iOS的移动端手机和平板上。这就像Intel 64 和 IA-32架构更广泛的 应用于桌面端和笔记本端。而本篇文章会更详细的叙述这些细节,仔细剖析 ARM64 结构上 Stalker 的具体实现,并解释其究竟是如何运作的。 希望这篇文章对未来Stalker移植到其他硬件架构有所帮助。
声明
本篇文章会讲到很多Stalker内部工作的细节,但并不会包含修补代码的具体细节。 本文章意图作为一个帮助你理解这项技术,而单单一个Stalker就足够复杂了。不过这种复杂并不是毫无来由的,正是这些复杂才极大降低了哪些昂贵的操作。 最后,这篇文章会讲解一些关键的概念,在一些重要逻辑上会逐行讲解代码。还有一些具体的细节,你可能需要阅读源码。 总之,希望这篇文章能对你有帮助。
目录
- 介绍
- 声明
- 案例
- 追踪
- 基础操作
- 选项
- 技术栈
- 内存块
- 块
- Instrumenting Blocks
- Helpers
- 上下文
- Context Helpers
- 读/写 上下文
- 控制流
- Gates
- 函数虚拟化
- Emitting events
- 取消追踪与清理
- 杂项
案例
为了更好的理解Stalker的实现,我们首先要理解它提供给使用者的接口。Stalker可以从原生Gum接口直接调用,但大多数用户还是更习惯通过 JavaScript接口调用Gum方法。这里很推荐Gum的TypeScript 的类型定义 。
JavaScript 调用 Stalker 的主要接口是:
stalking起始线程id
threadId
(默认为当前线程id)
我们先思考一下什么时候会使用这些调用。通常会因为你对某个线程感到好奇,想知道它在干啥。也许是单纯这个进程的名字很有意思?线程名可以使用
cat /proc/PID/tasks/TID/comm
查看。
又或者你在使用 Frida 的 JavaScript 接口Process.enumerateThreads()
遍历线程的时候,调用了一个原生函数:
这个函数和 Thread.backtrace() 一起用来转储线程堆栈,分析这个堆栈,你可以很好的知晓这个进程在干啥。
另一个可能需要调用Stalker.follow()
的场景是,
你已经 植入 或替换了目标函数。
在这个场景里,你已经找到了你感兴趣的函数,想知道该函数是如何运作的。你想查看某个函数调用后,哪些函数,甚至哪些代码块被调用了。
也许你想比较代码在接受不同输入后的不同反应,也许你想修改输入来看看是否能让代码按照特定的逻辑运行。
在这些场景里,Stalker工作的方式会有些许的不同,但他们的调用方式都是一致的,
Stalker.follow()
.
追踪
当用户调用Stalker.follow()
时, 在代码之下, JavaScript 引擎
会通过调用 gum_stalker_follow_me()
来追踪当前线程,
或调用 gum_stalker_follow(thread_id)
来追踪当前进程的其他线程。
gum_stalker_follow_me
使用 gum_stalker_follow_me()
时,
链接寄存器会决定从哪里开始追踪。在 AArch64 架构里,链接寄存器 (LR) 用于设置继续执行的指令的地址。由于只有一个链接寄存器,当另一个函数被调用时,
原先的LR的值需要暂存下来(通常存在栈上)。当RET指令执行后,这些暂存的值会随着函数返回,逐步从栈上加载回来。
先看 gum_stalker_follow_me()
的代码。代码原型如下:
我们可以看到QuickJS或V8运行时在调用该函数时传递了三个参数。第一个参数是 Stalker 实例本身。需要注意的是,若果有多个脚本同时注入,那么可能会有多个 Stalker 实例存在。第二个参数是一个转换器,该转换器用来将注入代码转换为如同原先写上去的一样(这个后面会讨论) 最后一个参数是一个时间接收器,当Stalker引擎运行时,生成的事件会传递到这里。
我们可以看到,第一个指令 STP 将一组寄存器的值存储到了栈上。注意表达式[sp, -16]!
.
这是一个
自减操作,也就是栈顶先压进去16字节的空间,再存储2个8字节的寄存器值。
我们可以在函数底部看到相对的指令:
ldp x29, x30, [sp], 16
。这里是将原本存储在栈上的连个寄存器值重新恢复到寄存器上。
不过这俩寄存器是干啥用的?
X30
是链接寄存器,而 X29
是帧指针寄存器。回想一下,在调用另一个函数之前,我们必须先把链接寄存器原先的值存储到栈上,在函数调用结束并返回调用者那里后,我们还需要恢复该值。
帧指针用来指明函数调用时的栈顶,而所有函数调用时传递的参数和栈变量都会以此为基准,分配一个固定的偏移量。
和前面类似,我们同样需要保存,并在必要的时候恢复这个值,原因是每个函数调用里,该寄存器的值都不一样。我们需要在开始调用时,将其值保存起来,并在返回后恢复。
你可以看到接下来的指令
mov x29, sp
,我们设置当前栈指针的值为帧指针值。
继续往下看,下条指令 mov x3, x30
, 将链接寄存器的值存储到 X3内。
调用函数时,前8个参数会分别保存到 X0-X7 这8个寄存器内。因此这里是使用了前4个寄存器保存参数。然后我们调用(branch with link)了函数
_gum_stalker_do_follow_me()
.
可以发现,前三个用 X0-X2 传递的三个参数没有任何修改,因此
_gum_stalker_do_follow_me()
接受的参数和之前的那个完全相同。
最后,该函数返回时,我们让其返回到作为第四个参数的位置上。(在 AArch64 上,函数返回值存储在X0上)。
gum_stalker_follow
该函数原型和 gum_stalker_follow_me()
类似,
但拥有一个额外的参数
thread_id
。 若该值为当前线程id,那么就会调用前面的那个函数。
现在看一下另一个线程id提供时会发生什么。
我们可以看到首先调用了函数 gum_process_modify_thread()
.
这个函数不属于 Stalker ,而是 Gum 自己的函数。这个函数接受一个回调函数作为其参数,而该回调函数接受一个上下文参数来传递线程上下文。
该回调函数可以修改
GumCpuContext
结构,紧接着
gum_process_modify_thread()
会将变动写回。
我们会在后面看到这个上下文结构,该结构包含了 AArch64 CPU 全部寄存器值。后面我们还会进一步介绍回调函数的原型。
所以,函数 gum_process_modify_thread()
究竟是怎么工作的?
这取决于具体的平台。
在 Linux (和 Android) 上,该函数使用 ptrace
接口
(也就是 GDB 用的那个) 来附加到线程并读写寄存器。不过还是有一些其他麻烦。
But there are a host of complexities.
在 Linux上,你不能 ptrace 自己的进程(更准确的说不能ptrace相同进程组的进程),因此Frida会创建一个当前进程的克隆进程,将该克隆进程放到Frida的进程组中,然后共享内存空间。
他们之间使用 UNIX 套接字进行交流。这个克隆进程就相当于一个调试器,从原始的目标进程中读取寄存器,然后将其保存到共享的内存空间内,并在需要的时候写回进程。对了,有两个环境变量
PR_SET_DUMPABLE
和 PR_SET_PTRACER
用来控制ptrace原进程的权限。
现在你可以看出来函数gum_stalker_infect()
的功能和我们早前提到的
_gum_stalker_do_follow_me()
很相似。
两个函数都做着相同的工作,不过
_gum_stalker_do_follow_me()
运行在目标线程上,而
gum_stalker_infect()
不是,
so it must write some code to be called by the
target thread using the
GumArm64Writer
rather than calling functions directly.
我们稍后回更详细的介绍这些函数的细节,在此之前,我们先做一些必要的背景说明。
基础操作
代码的本质是一系列的指令块。每个块通常会以某种特定的指令序列作为开始(比如两个连续的分支跳转语句),然后逐条指令按顺序执行, 直到遇到某些跳转指令。
Stalker 每次以一个代码块为单位进行工作。Stalker 会在调用gum_stalker_follow_me()
返回后的位置上的代码块上工作,
或在gum_stalker_follow()
调用后,在目标线程的指令寄存器指向的代码块上工作。
Stalker 先拷贝原始代码块,插桩后,将插桩代码写入新分配的内存中。插入指令时可能会生成一些事件,或者其他Stalker引擎提供的特性。 必要时,Stalker需要重定位指令。比如下面的指令:
ADR 代码块标签相对 PC寄存器 的地址
ADR Xd, label
Xd 用来指代 64 位计算机上的通用寄存器,范围从0到31.
label (标签) 是程序标签,其地址是相对当前指令的偏移量,范围是±1MB。
若该指令被拷贝到内存的其他位置并执行,那么由于标签的地址是由当前地址加上偏移量计算得来,那么拷贝后的代码计算出来的地址和原始值必定不同。 不过没关系,Gum提供了一个重定位器Relocator。 重定位器的就是用来修复类似地址的专用程序。
我们前面提到,Stalker每次以一个代码块为单位工作。那么,我们如何给下一个代码块进行插桩呢?前面我们提到了,每个代码块都是以 一个分支跳转指令作为结束。那么也就是说,如果我们现将原先的分支保存下来,然后用一个新的跳转回Stalker引擎的分支跳转来替换掉之前的分支。 这样我们就可以给下一个代码块进行插桩。同样的,这个过程可以不断追踪下去。
不过这个过程会有一些慢,在有些情况下,我们可以做一些优化。首先,如果我们需要反复运行同一个代码块(比如循环,或单纯的多次调用同一个代码块)时, 我们不用每次都重新插桩再运行。我们可以只插桩一次,然后反复执行插桩后的代码。因此,我们需要一个记录所有遇到的代码块的哈希表,当我们进入一个代码块时, 我们会首先查询该表,尝试寻找已经插桩的代码。
其次,当我们遇到一个调用指令时,在发射一个插桩后的调用后,我们还会发出一个降落点,这样我们可以在不返回Stalker的情况下返回调用。
Stalker内置了一个辅助栈,使用GumExecFrame
结构来记录返回地址(real_address
)
和降落地址(code_address
)。
当函数返回后,我们会将返回地址与辅助栈里的real_address
进行比对,如果匹配,那么
函数就直接返回到对应的code_address
,而不用重新进入运行时。
该降落点包含进入Stalker引擎,继续插桩下一块代码块的代码,同时该降落点后续可以修复,使其直接返回原代码块。
换句话说,整个函数返回地址序列在处理的时候,都可以避免进入和离开Stalker的开销。
若返回地址不匹配任何存储在GumExecFrame
的real_address
,
或者我们用光了辅助栈的空间,我们就直接构建一个新的辅助栈即可。我们需要再程序代码执行时保存LR的值,以防止程序检测到Stalker(反调试)的存在,或者
是其他非常规返回(比如代码节中对內联数据的饮用)。此外,我们还想让Stalker可以在任何时刻停止追踪,我们不想一路备份回我们修改的LR值。
最后一点,前面我们会将每个分支跳转指令替换为返回Stalker的代码,以便对下一个代码块进行插桩。但是现在,我们会根据
Stalker.trustThreshold
的配置,尝试修补已插桩的代码,
将其直接分支跳转替换成下一个插桩代码块。确定性的分支跳转(比如跳转点是固定值,而且跳转是非条件性的)比较简单,我们直接将跳转回Stalker的分支替换为跳转到下一个代码块的分支。
不过,我们也可以处理条件分支,比如我们可以同时给两个分支插桩(其中一个会跳转,另一个不会)。然后我们将原始的条件分支替换为另一个条件分支,
这个新的条件分支直接控制代码流,将其跳转到已遇到的代码的插桩版本。另一个分支同样也是插桩后的版本。当目标分支不是静态分支时,我们也可以进行部分处理。
比如这样的分支:
调用函数指针或者类方法时,这种形式的指令非常常见。理论上来说 X0 的值时变化的,但实际情况下大多数时候其值是不变的。 这种情况下,我们可以根据运行时实际的分支指令将 X0 值 和我们已知的函数进行对比,如果匹配,那么直接跳转到对应的已插桩代码地址。 如果不匹配,那么直接跳转回Stalker引擎。因此,若函数指针发生了变化,那么代码依旧可以正常运行,因为此时我们重新返回Stalker,然后插桩。 不过如果函数指针按照我们设想的,没有变化,那么就可以直接跳过Stalker引擎,直接运行已插桩代码。
可选项
现在我们看一下使用Stalker跟踪线程时的一些可选配置。当被追踪线程执行时,Stalker会生成一系列事件,这些事件会放到一个队列中,然后周期性地,或由用户手动的冲刷。
这个工作不属于Stalker,而是由EventSink::process
虚函数在重新进入JavaScript运行时,
处理事件时负责的,而这一行为的开销很大。该队列的大小和刷新周期可以通过选项进行配置。以指令为基础,事件会在每个指令的调用和返回时产生;以代码块为基础,事件会在每个代码块执行时,
或者插桩时,由Stalker引擎产生。
我们可以提供两个回调函数onReceive
和 onCallSummary
中的一个。
前者相对来说更简单,单纯按顺序传递一个包含原始事件的二进制块。
(Stalker.parse()
可以用来将该二进制块转换为一个JS数组,该数组中包含用于表示事件的元组。)
第二个函数汇集了一些统计结果,比如每个函数调用次数等等。这个函数通常比onReceive
效率更高,
但数据没那么精准。
术语一览
在更详细的介绍Stalker实现之前,我们需要理解一些关键的术语和概念,这些稍后会有帮助。
Probes
你可能已经对Interceptor.attach()
的使用很熟悉了。当线程运行在Stalker外部时,该函数可以在指定函数调用时触发一个回调函数。
但若线程运行在Stalker内部,那么这个拦截器可能没办法正常工作。这些拦截器的工作原理是修补目标函数的前几个指令(序言)来将控制权重定向到Frida手里。
Frida拷贝并重定位这些前几个指令,在onEnter
回调实现的前提下,执行回调并重定向控制权回到原始函数中。
拦截器在Stalker内部没法正常运作的原因很简单,那就是原始的函数没有被调用。每个代码块,在执行前会被插桩,然后拷贝到内存的某个地方,再运行。因此实际执行的是这部分指令,而不是原始的指令。
Stalker提供了接口函数Stalker.addCallProbe(address, callback[, data])
来处理这个问题。
若我们的Interceptor(拦截器)
在目标代码插桩前就已经依附,或Stalker的trustThreshold
被合理配置,让我们的代码块重新插桩,那么我们的Interceptor
就可以正常运作(因为修补过的指令
会被拷贝并重新插桩)。当然,我们希望这些条件不满足的时候也能正常hook函数。多数API用户可能对这个层级的设计细节并不熟悉,而probes就是用来解决这个问题的。
当probe回调注册时,会有一个可选的数据参数被传递给该回调函数。这个指针,也因此需要存储在Stalker引擎里。地址也需要存储起来,这样当调用函数的指令运行时,
目标代码可以被插桩,先运行函数。由于可能存在多个函数会调用你增加probe的函数,因此很多以插桩代码块都会包函额外的指令,用来调用probe函数。
因此每当一个probe增加或移除,已缓存的代码块就会被销毁,然后重新插桩。需要注意的是,只有在该callback
是C回调的时候,
才会传递该数据参数。比如使用CModule
实现的C回调。使用JavaScript时,使用闭包来捕获需要的状态。
信任阈值
之前我们提到了一个简单的优化,在我们尝试多次调用同一个代码块时,我们只是简单的重复调用对应的已插桩代码块。这个优化工作的前提是, 这段时间内我们插桩的原代码没有发生变化。在一些自修改代码(通常用来对抗反调试/反编译的安全代码)中,代码经常回自我变异,从而导致我们插桩过的代码失去作用。 那我们怎么检测原代码是否发生变化了呢?我们的处理方法很简单,将原始代码的插桩版本保存在数据结构里。那么当我们再次遇到一个代码块时,我们会将需要插桩的代码的版本 和我们上次插桩的时间和版本进行对比,如果匹配,那么我们可以重用该块。但是每次都比对代码会拖累代码运行。因此,这里设定了一个参数用于调优。
Stalker.trustThreshold
: 一个整型数,用来指明,运行多少次后信任代码。 -1表示不信任代码(慢),0表示完全信任代码,N表示在执行N此后信任。默认值是1。
实际上,信任代码前,代码每次重执行都要与之前的插桩版本做对比。若 N 此执行后,原代码若一直没问题,则信任该代码,并不再做对比。
需要注意的是,原始版本的代码会始终保存起来,不论信任阈值是-1
还是 0
。
信任阈值是这俩值的时候,这个保存的原始代码实际上并没有什么用,这么做仅仅是为了保持代码简单一致。
排除范围
Stalker 还提供了接口 Stalker.exclude(range)
,提供基础地址和偏移范围,
用来阻止Stalker对该范围内的代码进行插桩。举个例子,你的线程调用了libc
库内的malloc()
函数,
那么通常来说你不会想关心这个堆里的工作细节,而且追踪这部分代码还会降低代码效率,还会生成一大堆你不关心的事件。需要考虑的一件事是,当调用排除范围内的函数时,Stalker会
会停止工作,直到其返回。也就是说,如果该线程调用了一个不在规定范围内的函数,比如一个回调,那么这件事不会被Stalker捕获。正如其可以用来阻止追踪整个库,这个接口也可以用来
阻止追中某个特定的函数。当你的目标应用使用静态链接时,这个功能非常有用。有时我们不能简单的跳过 libc
中的所有调用,
我们可以使用Module.enumerateSymbols()
来查找malloc()
的符号,
然后忽略该单个函数。
冻结/解冻
有些系统中有DEP,用来标记页的写权限和执行权限。因此,Frida必须在写入插桩代码时装换页的权限,并让插入的代码有执行权限。当页具有可执行权限时, 我们称其被冻结(此时它们不能改变),而当它们可写时,我们称其解冻。
调用指令
AArch64 和 Intel 不同,没有 Intel 里显式的 CALL
函数调用指令。
而是有一堆不同的,用来应对不同情况的函数调用。这些指令都会分支跳转到一个指定的位置,然后更新链接寄存器,LR
,
BL
BLR
BLRAA
BLRAAZ
BLRAB
BLRABZ
简单起见,在本篇文章的余下部分,我们会将这些指令都称为"调用指令"
帧
当Stalker遇到一个调用时,Stalker会将返回地址,插桩代码的地址,放到一个数据结构中,然后存入Stalker自身维护的栈里。 Stalker使用该栈做一些优化,并在发出调用和返回事件时,有一些启发式的估计。
转换器
GumStalkerTransformer
用来生成已插桩代码。
默认的转换器实现看起来类似这样:
该函数由负责生成插桩代码的函数调用,
gum_exec_ctx_obtain_block_for()
和该函数的工作都是生成插桩代码。我们可以看到,该函数使用一个循环,然后在每个循环里处理一个指令。
首先从迭代器中获取指令,然后(不加修改的)通知Stalker。这两个函数是由Stalker自身实现的。
第一个函数负责解析cs_insn
和更新内部状态。
cs_insn
是内部由Capstone提供的表示指令的反汇编器。
第二个函数负责写出已插桩代码(或代码集)。我们后面会更详细的介绍。
除了默认的转换器,用户也可以提供定制的转换器实现。这里有一个不错的例子。
调用
转换器也可以创建调用。转换器指导Stalker发出指令,用来调用一个JavaScript函数,或者C回调,比如使用 CModule实现的回调函数。同时会给函数传递 表示 CPU 状态的上下文和
一个可选的上下文参数。该函数可以按需求修改或检视寄存器。这些信息都被存储在GumCallOutEntry
里。
EOB/EOI
回想一下,Relocator, 在生成插桩代码时非常重要。Relocator有两个重要的属性来控制其状态。
块结束(EOB)指明Relocator遇到了代码块的结尾。当遇到任何分支指令,就是块结束(EOB)状态。任何分支跳转, 调用,或者返回指令。
输入结束(EOI)指明Relocator不仅遇到了代码块结尾,而且可能遇到了输入的结尾,也就是紧跟着当前指令的下一条未必是指令。
这种情况通常不是调用指令,因为代码控制通常会在调用返回时,继续执行紧跟其后的若干指令。(注意,编译器通常会给无返回函数生成一个分支跳转指令,比如exit()
)
由于调用指令后面不一定紧跟着其他合法指令,我么这里可以做一些特殊的优化。
如果我们遇到一个非条件分支跳转指令,或者遇到一个返回指令,很有可能在这之后就不再有代码了。
序言/尾声
当控制流从程序跳转到Stalker引擎时,CPU的寄存器必须保存,以使得Stalker可以运行以及利用寄存器, 并在控制权返还给程序时恢复,以防止原程序上下文信息丢失。
AArch64 的函数调用标准 规定了一些寄存器(比较有名的是 X19 到 X29)是由被调用者存储使用的寄存器。也就是说,编译器生成使用这些寄存器的代码时,必须先把它们保存起来。 因此没必要严格的讲这些寄存器存储在上下文结构中,因为如果Stalker引擎需要的话,可以直接恢复。因此一个"极小"的上下文就足够大多时候的需求。
然而,如果Stalker引擎是要调用由Stalker.addCallProbe()
注册的prob,
或者由iterator.putCallout()
创建的调用
GumArm64CpuContext
指明的格式相同。
另外需要注意的是,前面提到的那些需要写入必要CPU寄存器的必要代码(序言),实际上是相当长的(大概数十行指令)。而之后恢复他们的代码(尾声)实际上也差不多长。 我们不想每次插桩时,都在代码的开始和结尾都插上这么两段长指令。因此我们会将这些指令写到一个公共的内存里(和我们写插桩代码的方式类似),然后需要时,我们只要直接 调用。这些公共内存称之为helpers。下面的函数就是用来创建这些序言和尾声:
最后需要注意的是,在AArch64架构下,直接分支跳转的范围是调用者的±128 MB偏移范围,而间接分支跳转的开销更加昂贵(内存开销以及时间开销)。 因此,随着我们插桩代码块的增加,我们会距离共享的序言和尾声代码越来越远。当我们超过128MB范围后,我们会直接再拷贝一个序言和尾声来使用。 这个交易非常划得来。
计数器
最后,你会在插桩代码块的后面,看到一些计数器,这些计数器保记录了遇到的每种指令的数量。 这些东西只是用于单元测试和性能调优,指明哪些分支类型需要完整的上下文切换。
Slabs
接下来我们介绍Stalker存储其插桩代码的地方——slabs。 下面的数据结构用来处理所有细节。
我们先看一些在初始化阶段,Stalker配置的大小:
我们可以看到,每个slab的大小是 4MB。slab个数的整数倍数存在了其头部,而the GumSlab
结构本身包含了
其自身的GumExecBlock
数组。注意该数组在GumSlab
结构中定义时长度为0,
但其实际可存储的容量是在 slab_max_blocks
内计算并存储的。
所以slab余下的部分用来做什么呢?slab的头部用来存储所有的"会计信息",余下部分(也就是尾部)用来存储插桩指令(在slab中內联存储)。
So why is a 12th of the slab allocated for the header and the remainder for the
instructions? Well the length of each block to be instrumented will vary
considerably and may be affected by the compiler being used and its optimization
settings. Some rough empirical testing showed that given the average length of
each block this might be a reasonable ratio to ensure we didn’t run out of space
for new GumExecBlock
entries before we ran out of space for new instrumented
blocks in the tail and vice versa.
我们下面看一下用于创建这些结构的代码:
Here, we can see that the data
field points to the start of the tail where
instructions can be written after the header. The offset
field keeps track of
our offset into the tail. The size
field keeps track of the total number of
bytes available in the tail. The num_blocks
field keeps track of how many
instrumented blocks have been written to the slab.
Note that where possible we allocate the slab with RWX permissions so that we don’t have to freeze and thaw it all of the time. On systems which support RWX the freeze and thaw functions become no-ops.
Lastly, we can see that each slab contains a next
pointer which can be used to
link slabs together to form a singly-linked list. This is used so we can walk
them and dispose them all when Stalker is finished.
Blocks
Now we understand how the slabs work. Let’s look in more detail at the blocks. As we know, we can store multiple blocks in a slab and write their instructions to the tail. Let’s look at the code to allocate a new block:
The function first checks if there is space for a minimally sized block in the
tail of the slab (1024 bytes) and whether there is space in the array of
GumExecBlocks
in the slab header for a new entry. If it does then a new entry
is created in the array and its pointers are set to reference the GumExecCtx
(the main Stalker session context) and the GumSlab
, The code_begin
and
code_end
pointers are both set to the first free byte in the tail. The
recycle_count
used by the trust threshold mechanism to determine how many
times the block has been encountered unmodified is reset to zero, and the
remainder of the tail is thawed to allow code to be written to it.
Next if the trust threshold is set to less than zero (recall -1 means blocks are
never trusted and always re-written) then we reset the slab offset
(the
pointer to the first free byte in the tail) and start over. This means that any
instrumented code written for any blocks within the slab will be overwritten.
Finally, as there is no space left in the current slab and we can’t overwrite it
because the trust threshold means blocks may be re-used, then we must allocate a
new slab by calling gum_exec_ctx_add_slab()
, which we looked at above. We then
call gum_exec_ctx_ensure_inline_helpers_reachable()
, more on that in a moment,
and then we allocate our block from the new slab.
Recall, that we use helpers (such as the prologues and epilogues that save and
restore the CPU context) to prevent having to duplicate these instructions at
the beginning and end of every block. As we need to be able to call these from
instrumented code we are writing to the slab, and we do so with a direct branch
that can only reach ±128 MB from the call site, we need to ensure we can get to
them. If we haven’t written them before, then we write them to our current slab.
Note that these helper funtions need to be reachable from any instrumented
instruction written in the tail of the slab. Because our slab is only 4 MB in
size, then if our helpers are written in our current slab then they will be
reachable just fine. If we are allocating a subsequent slab and it is close
enough to the previous slab (we only retain the location we last wrote the
helper functions to) then we might not need to write them out again and can just
rely upon the previous copy in the nearby slab. Note that we are at the mercy of
mmap()
for where our slab is allocated in virtual memory and ASLR may dictate
that our slab ends up nowhere near the previous one.
We can only assume that either this is unlikely to be a problem, or that this
has been factored into the size of the slabs to ensure that writing the helpers
to each slab isn’t much of an overhead because it doesn’t use a significant
proportion of their space. An alternative could be to store every location every
time we have written out a helper function so that we have more candidates to
choose from (maybe our slab isn’t allocated nearby the one previously allocated,
but perhaps it is close enough to one of the others). Otherwise, we could
consider making a custom allocator using mmap()
to reserve a large (e.g. 128
MB) region of virtual address space and then use mmap()
again to commit the
memory one slab at a time as needed. But these ideas are perhaps both overkill.
Instrumenting Blocks
The main function which instruments a code block is called
gum_exec_ctx_obtain_block_for()
. It first looks for an existing block in the
hash table which is indexed on the address of the original block which was
instrumented. If it finds one and the aforementioned constraints around the
trust threshold are met then it can simply be returned.
The fields of the GumExecBlock
are used as follows. The real_begin
is set to
the start of the original block of code to be instrumented. The code_begin
field points to the first free byte of the tail (remember this was set by the
gum_exec_block_new()
function discussed above). A GumArm64Relocator
is
initialized to read code from the original code at real_begin
and a
GumArm64Writer
is initialized to write its output to the slab starting at
code_begin
. Each of these items is packaged into a GumGeneratorContext
and
finally this is used to construct a GumStalkerIterator
.
This iterator is then passed to the transformer. Recall the default implementations is as follows:
We will gloss over the details of gum_stalker_iterator_next()
and
gum_stalker_iterator_keep()
for now. But in essence, this causes the iterator
to read code one instruction at a time from the relocator, and write the
relocated instruction out using the writer. Following this process, the
GumExecBlock
structure can be updated. Its field real_end
can be set to the
address where the relocator read up to, and its field code_end
can be set to
the address which the writer wrote up to. Thus real_begin
and real_end
mark
the limits of the original block, and code_begin
and code_end
mark the
limits of the newly instrumented block. Finally,
gum_exec_ctx_obtain_block_for()
calls gum_exec_block_commit()
which takes a
copy of the original block and places it immediately after the instrumented
copy. The field real_snapshot
points to this (and is thus identical to
code_end
). Next the slab’s offset
field is updated to reflect the space used
by our instrumented block and our copy of the original code. Finally, the block
is frozen to allow it to be executed.
Now let’s just return to a few more details of the function
gum_exec_ctx_obtain_block_for()
. First we should note that each block has a
single instruction prefixed.
This instruction is the restoration prolog (denoted by
GUM_RESTORATION_PROLOG_SIZE
). This is skipped in “bootstrap” usage – hence you
will note this constant is added on by _gum_stalker_do_follow_me()
and
gum_stalker_infect()
when returning the address of the instrumented code. When
return instructions are instrumented, however, if the return is to a block which
has already been instrumented, then we can simply return to that block rather
than returning back into the Stalker engine. This code is written by
gum_exec_block_write_ret_transfer_code()
. In a worst-case scenario, where we
may need to use registers to perform the final branch to the instrumented block,
this function stores them into the stack, and the code to restore these from the
stack is prefixed in the block itself. Hence, in the event that we can return
directly to an instrumented block, we return to this first instruction rather
than skipping GUM_RESTORATION_PROLOG_SIZE
bytes.
Secondly, we can see gum_exec_ctx_obtain_block_for()
does the following after
the instrumented block is written:
This inserts a break instruction which is intended to simplify debugging.
Lastly, if Stalker is configured to, gum_exec_ctx_obtain_block_for()
will
generate an event of type GUM_COMPILE
when compiling the block.
Helpers
We can see from gum_exec_ctx_ensure_inline_helpers_reachable()
that we have a
total of 6 helpers. These helpers are common fragments of code which are needed
repeatedly by our instrumented blocks. Rather than emitting the code they
contain repeatedly, we instead write it once and place a call or branch
instruction to have our instrumented code execute it. Recall that the helpers
are written into the same slabs we are writing our instrumented code into and
that if possible we can re-use the helper written into a previous nearby slab
rather than putting a copy in each one.
This function calls gum_exec_ctx_ensure_helper_reachable()
for each helper
which in turn calls gum_exec_ctx_is_helper_reachable()
to check if the helper
is within range, or otherwise calls the callback passed as the second argument
to write out a new copy.
So, what are our 6 helpers. We have 2 for writing prologues which store register
context, one for a full context and one for a minimal context. We will cover
these later. We also have 2 for their corresponding epilogues for restoring the
registers. The other two, the last_stack_push
and last_stack_pop_and_go
are
used when instrumenting call instructions.
Before we analyze these two in detail, we first need to understand the frame
structures. We can see from the code snippets below that we allocate a page to
contain GumExecFrame
structures. These structures are stored sequentially in
the page like an array and are populated starting with the entry at the end of
the page. Each frame contains the address of the original block and the address
of the instrumented block which we generated to replace it:
last_stack_push
Much of the complexity in understanding Stalker and the helpers in particular is that some functions – let’s call them writers – write code which is executed at a later point. These writers have branches in themselves which determine exactly what code to write, and the written code can also sometimes have branches too. The approach I will take for these two helpers therefore is to show pseudo code for the assembly which is emitted into the slab which will be called by instrumented blocks.
The pseudo code for this helper is shown below:
As we can see, this helper is actually a simple function which takes two
arguments, the real_address
and the code_address
to store in the next
GumExecFrame
structure. Note that our stack is written backwards from the end
of the page in which they are stored towards the start and that current_frame
points to the last used entry (so our stack is full and descending). Also note
we have a conditional check to see whether we are on the last entry (the one at
the very beginning of the page will be page-aligned) and if we have run out of
space for more entries (we have space for 512) then we simply do nothing. If we
have space, we write the values from the parameters into the entry and retard
the current_frame
pointer to point to it.
This helper is used when virtualizing call instructions. Virtualizing is the
name given to the replacement of an instruction typically those relating to
branching with a series of instructions which instead of executing the intended
block allow Stalker to manage the control-flow. Recall as our transformer walks
the instructions using the iterator and calls iterator.keep()
we output our
transformed instruction. When we encounter a branch, we need to emit code to
call back into the Stalker engine so that it can instrument that block, but if
the branch statement is a call instruction (BL
, BLX
etc) we also need to
emit a call to the above helper to store the stack frame information. This
information is used when emitting call events as well as later when optimizing
the return.
last_stack_pop_and_go
Now lets look at the last_stack_pop_and_go
helper. To understand this, we also
need to understand the code written by
gum_exec_block_write_ret_transfer_code()
(the code that calls it), as well as
that written by gum_exec_block_write_exec_generated_code()
which it calls. We
will skip over pointer authentication for now.
So this code is a little harder. It isn’t really a function and the actual assembly written by it is muddied a little by the need to save and restore registers. But the essence of it is this: When virtualizing a return instruction this helper is used to optimize passing control back to the caller. ret_reg contains the address of the block to which we are intending to return.
Lets take a look at the definition of the return instruction:
RET Return from subroutine, branches unconditionally to an address in a register, with a hint that this is a subroutine return.
RET {Xn} Where:
Xn Is the 64-bit name of the general-purpose register holding the address to be branched to, in the range 0 to 31. Defaults to X30 if absent.
As we can see, we are going to return to an address passed in a register.
Typically, we can predict the register value and where we will return to, as the
compiler will emit assembly code so that the register is set to the address of
the instruction immediately following the call which got us there. After
emitting an instrumented call, we emit directly after a little landing pad which
will call back into Stalker to instrument the next block. This landing pad can
later be backpatched (if the conditions are right) to avoid re-entering Stalker
altogether. We store the addresses of both the original block following the call
and this landing pad in the GumExecFrame
structure, so we can simply
virtualize our return instruction by replacing it with instructions which simply
branch to this landing pad. We don’t need to re-enter the Stalker engine each
time we see a return instruction and get a nice performance boost. Simple!
However, we must bear in mind that not all calls will result in a return. A
common technique for hostile or specialized code is to make a call in order to
use the LR
to determine the current position of the instruction pointer. This
value may then be used for introspection purposes (e.g. to validate code to
detect modification, to decrypt or unscramble code, etc.).
Also, remember that the user can use a custom transform to modify instructions as they see fit, they can insert instructions which modify register values, or perhaps a callout function which is passed the context structure which allows them to modify register values as they like. Now consider what if they modify the value in the return register!
So we can see that the helper checks the value of the return register against
the value of the real_address
stored in the GumExecFrame
. If it matches,
then all is well and we can simply branch directly back to the landing pad.
Recall on the first instance, this simply re-enters Stalker to instrument the
next block and branches to it, but at a later point backpatching may be used to
directly branch to this instrumented block and avoid re-entering Stalker
altogether.
Otherwise, we follow a different path. First the array of GumExecFrame
is
cleared, now our control-flow has deviated, we will start again building our
stack again. We accept that we will take this same slower path for any previous
frames in the call-stack we recorded so far if we ever return to them, but will
have the possibility of using the fast path for new calls we encounter from here
on out (until the next time a call instruction is used in an unconventional
manner).
We make a minimal prologue (our instrumented code is now going to have to
re-enter Stalker) and we need to be able to restore the application’s registers
before we return control back to it. We call the entry gate for return,
gum_exec_ctx_replace_current_block_from_ret()
(more on entry gates later). We
then execute the corresponding epilogue before branching to the ctx->resume_at
pointer which is set by Stalker during the above call to
gum_exec_ctx_replace_current_block_from_ret()
to point to the new instrumented
block.
Context
Let’s look at the prologues and epilogues now.
We can see that these do little other than call the corresponding prologue or
epilogue helpers. We can see that the prologue will store X19
and the link
register onto the stack. These are then restored into X19
and X20
at the end
of the epilogue. This is because X19
is needed as scratch space to write the
context blocks and the link register needs to be preserved as it will be
clobbered by the call to the helper.
The LDP and STP instructions load and store a pair of registers respectively and have the option to increment or decrement the stack pointer. This increment or decrement can be carried out either before, or after the values are loaded or stored.
Note also the offset at which these registers are placed. They are stored at
16
bytes + GUM_RED_ZONE_SIZE
beyond the top of the stack. Note that our
stack on AArch64 is full and descending. This means that the stack grows toward
lower addresses and the stack pointer points to the last item pushed (not to the
next empty space). So, if we subtract 16 bytes from the stack pointer, then this
gives us enough space to store the two 64-bit registers. Note that the stack
pointer must be decremented before the store (pre-decrement) and incremented
after the load (post-increment).
So what is GUM_RED_ZONE_SIZE
? The
redzone is a 128
byte area beyond the stack pointer which a function can use to store temporary
variables. This allows a function to store data in the stack without the need to
adjust the stack pointer all of the time. Note that this call to the prologue is
likely the first thing to be carried out in our instrumented block, we don’t
know what local variables the application code has stored in the redzone and so
we must ensure that we advance the stackpointer beyond it before we start using
the stack to store information for the Stalker engine.
Context Helpers
Now that we have looked at how these helpers are called, let us now have a look at the helpers themselves. Although there are two prologues and two epilogues (full and minimal), they are both written by the same function as they have much in common. The version which is written is based on the function parameters. The easiest way to present these is with annotated code:
Now let’s look at the epilogue:
This is all quite complicated. Partly this is because we have only a single register to use as scratch space, partly because we want to keep the prologue and epilogue code stored inline in the instrumented block to a bare minimum, and partly because our context values can be changed by callouts and the like. But hopefully it all now makes sense.
Reading/Writing Context
Now that we have our context saved, whether it was a full context, or just the minimal one, Stalker may need to read registers from the context to see what state of the application code was. For example to find the address which a branch or return instruction was going to branch to so that we can instrument the block.
When Stalker writes the prologue and epilogue code, it does so by calling
gum_exec_block_open_prolog()
and gum_exec_block_close_prolog()
. These store
the type of prologue which has been written in gc->opened_prolog
.
Therefore when we want to read a register, this can be achieved with the single
function gum_exec_ctx_load_real_register_into()
. This determines which kind of
prologue is in use and calls the relevant routine accordingly. Note that these
routines don’t actually read the registers, they emit code which reads them.
Reading registers from the full frame is actually the simplest. We can see the
code closely matches the structure used to pass the context to callouts etc.
Remember that in each case register X20
points to the base of the context
structure.
Reading from the minimal context is actually a little harder. X0
through X18
are simple, they are stored in the context block. After X18
is 8 bytes padding
(to make a total of 10 pairs of registers) followed by X29
and X30
. This
makes a total of 11 pairs of registers. Immediately following this is the
NEON/floating point registers (totaling 128 bytes). Finally X19
and X20
, are
stored above this as they are restored by the inline epilogue code written by
gum_exec_ctx_write_epilog()
.
Control flow
Execution of Stalker begins at one of 3 entry points:
_gum_stalker_do_follow_me()
gum_stalker_infect()
gum_exec_ctx_replace_current_block_with()
The first two we have already covered, these initialize the Stalker engine and
start instrumenting the first block of execution.
gum_exec_ctx_replace_current_block_with()
is used to instrument subsequent
blocks. In fact, the main difference between this function and the preceding two
is that the Stalker engine has already been initialized and hence this work
doesn’t need to be repeated. All three call gum_exec_ctx_obtain_block_for()
to
generate the instrumented block.
We covered gum_exec_ctx_obtain_block_for()
previously in our section on
transformers. It calls the transformed implementation in use, which by default
calls gum_stalker_iterator_next()
which calls the relocator using
gum_arm64_relocator_read_one()
to read the next relocated instruction. Then it
calls gum_stalker_iterator_keep()
to generate the instrumented copy. It does
this in a loop until gum_stalker_iterator_next()
returns FALSE
as it has
reached the end of the block.
Most of the time gum_stalker_iterator_keep()
will simply call
gum_arm64_relocator_write_one()
to emit the relocated instruction as is.
However, if the instruction is a branch or return instruction it will call
gum_exec_block_virtualize_branch_insn()
or
gum_exec_block_virtualize_ret_insn()
respectively. These two virtualization
functions which we will cover in more detail later, emit code to transfer
control back into gum_exec_ctx_replace_current_block_with()
via an entry gate
ready to process the next block (unless there is an optimization where we can
bypass Stalker and go direct to the next instrumented block, or we are entering
into an excluded range).
Gates
Entry gates are generated by macro, one for each of the different instruction
types found at the end of a block. When we virtualize each of these types of
instruction, we direct control flow back to the
gum_exec_ctx_replace_current_block_with()
function via one of these gates. We
can see that the implementation of the gate is quite simple, it updates a
counter of how many times it has been called and passes control to
gum_exec_ctx_replace_current_block_with()
passing through the parameters it
was called with, the GumExecCtx
and the start_address
of the next block to
be instrumented.
These counters can be displayed by the following routine. They are only meant to be used by the test-suite rather than being exposed to the user through the API.
Virtualize functions
Let’s now look in more detail at the virtualizing we have for replacing the branch instruction we find at the end of each block. We have four of these functions:
gum_exec_block_virtualize_branch_insn()
gum_exec_block_virtualize_ret_insn()
gum_exec_block_virtualize_sysenter_insn()
gum_exec_block_virtualize_linux_sysenter()
We can see that two of these relate to to syscalls (and in fact, one calls the other), we will cover these later. Let’s look at the ones for branches and returns.
gum_exec_block_virtualize_branch_insn
This routine first determines whether the destination of the branch comes from
an immediate offset in the instruction, or a register. In the case of the
latter, we don’t extract the value just yet, we only determine which register.
This is referred to as the target
. The next section of the function deals with
branch instructions. This includes both conditional and non-conditional
branches. For conditional targets the destination if the branch is not taken is
referred to as cond_target
, this is set to the address of the next instruction
in the original block.
Likewise regular_entry_func
and cond_entry_func
are used to hold the entry
gates which will be used to handle the branch. The former is used to hold the
gate used for non-conditional branches and cond_entry_func
holds the gate to
be used for a conditional branch (whether it is taken or not).
The function gum_exec_block_write_jmp_transfer_code()
is used to write the
code required to branch to the entry gate. For non-conditional branches this is
simple, we call the function passing the target
and the regular_entry_func
.
For conditional branches things are slightly more complicated. Our output looks
like the following pseudo-code:
Here, we can see that we first write a branch instruction into our instrumented
block, as in our instrumented block, we also need to determine whether we should
take the branch or not. But instead of branching directly to the target, just
like for the non-conditional branches we use
gum_exec_block_write_jmp_transfer_code()
to write code to jump back into
Stalker via the relevant entry gate passing the real address we would have
branched to. Note, however that the branch is inverted from the original (e.g.
CBZ
would be replaced by CBNZ
).
Now, let’s look at how gum_exec_block_virtualize_branch_insn()
handles calls.
First we emit code to generate the call event if we are configured to. Next we
check if there are any probes in use. If there are, then we call
gum_exec_block_write_call_probe_code()
to emit the code necessary to call any
registered probe callback. Next, we check if the call is to an excluded range
(note that we can only do this if the call is to an immediate address), if it is
then we emit the instruction as is. But we follow this by using
gum_exec_block_write_jmp_transfer_code()
as we did when handling branches to
emit code to call back into Stalker right after to instrument the block at the
return address. Note that here we use the excluded_call_imm
entry gate.
Finally, if it is just a normal call expression, then we use the function
gum_exec_block_write_call_invoke_code()
to emit the code to handle the call.
This function is pretty complicated as a result of all of the optimization for
backpatching, so we will only look at the basics.
Remember earlier that in gum_exec_block_virtualize_branch_insn()
, we could
only check if our call was to an excluded range if the target was specified in
an immediate? Well if the target was specified in a register, then here we emit
code to check whether the target is in an excluded range. This is done by
loading the target register using
gum_exec_ctx_write_push_branch_target_address()
(which in turn calls
gum_exec_ctx_load_real_register_into()
which we covered ealier to read the
context) and emitting code to call
gum_exec_block_check_address_for_exclusion()
whose implementation is quite
self-explanatory. If it is excluded then a branch is taken and similar code to
that described when handling excluded immediate calls discussed above is used.
Next we emit code to call the entry gate and generate the instrumented block of
the callee. Then call the helper last_stack_push
to add our GumExecFrame
to
our context containing the original and instrumented block address. The real and
instrumented code addresses are read from the current cursor positions of the
GeneratorContext and CodeWriter respectively, and we then generate the required
landing pad for the return address (this is the optimization we covered
earlier, we can jump straight to this block when executing the virtualized
return statement rather than re-entering Stalker). Lastly we use
gum_exec_block_write_exec_generated_code()
to emit code to branch to the
instrumented callee.
gum_exec_block_virtualize_ret_insn
After looking at the virtualization of call instructions, you will be pleased to
know that this one is relatively simple! If configured, this function calls
gum_exec_block_write_ret_event_code()
to generate an event for the return
statement. Then it calls gum_exec_block_write_ret_transfer_code()
to generate
the code required to handle the return instruction. This one is simple too, it
emits code to call the last_stack_pop_and_go
helper we covered earlier.
Emitting events
Events are one of the key outputs of the Stalker engine. They are emitted by the following functions. Their implementation again is quite self-explanatory:
gum_exec_ctx_emit_call_event()
gum_exec_ctx_emit_ret_event()
gum_exec_ctx_emit_exec_event()
gum_exec_ctx_emit_block_event()
One thing to note with each of these functions, however, is that they all call
gum_exec_block_write_unfollow_check_code()
to generate code for checking if
Stalker is to stop following the thread. We’ll have a look at this in more
detail next.
Unfollow and tidy up
If we look at the function which generates the instrumented code to check if we
are being asked to unfollow, we can see it cause the thread to call
gum_exec_ctx_maybe_unfollow()
passing the address of the next instruction to
be instrumented. We can see that if the state has been set to stop following,
then we simply branch back to the original code.
A quick note about pending calls. If we have a call to an excluded range, then we emit the original call in the instrumented code followed by a call back to Stalker. Whilst the thread is running in the excluded range, however, we cannot control the instruction pointer until it returns. We therefore need to simply keep track of these and wait for the thread to exit the excluded range.
Now we can see how a running thread gracefully goes back to running normal uninstrumented code, let’s see how we stop stalking in the first place. We have two possible ways to stop stalking:
gum_stalker_unfollow_me()
gum_stalker_unfollow()
The first is quite simple, we set the state to stop following. Then call
gum_exec_ctx_maybe_unfollow()
to attempt to stop the current thread from being
followed, and then dispose of the Stalker context.
We notice here that we pass NULL
as the address to
gum_exec_ctx_maybe_unfollow()
which may seem odd, but we can see that in this
instance it isn’t used as when we instrument a block (remember
gum_exec_ctx_replace_current_block_with()
is where the entry gates direct us
to instrument subsequent blocks) we check to see if we are about to call
gum_unfollow_me()
, and if so then we return the original block from the
function rather than the address of the instrumented block generated by
gum_exec_ctx_obtain_block_for()
. Therefore we can see that this is a special
case and this function isn’t stalked. We simply jump to the real function so at
this point we have stopped stalking the thread forever. This handling differs
from excluded ranges as for those we retain the original call instruction in an
instrumented block, but then follow it with a call back into Stalker. In this
case, we are just vectoring back to an original uninstrumented block:
Let’s look at gum_stalker_unfollow()
now:
This function looks through the list of contexts looking for the one for the
requested thread. Again, it sets the state of the context to
GUM_EXEC_CTX_UNFOLLOW_PENDING
. If the thread has already run, we must wait for
it to check this flag and return to normal execution. However, if it has not run
(perhaps it was in a blocking syscall when we asked to follow it and never got
infected in the first instance) then we can disinfect it ourselves by calling
gum_process_modify_thread()
to modify the thread context (this function was
described in detail earlier) and using gum_stalker_disinfect()
as our callback
to perform the changes. This simply checks to see if the program counter was set
to point to the infect_thunk
and resets the program pointer back to its
original value. The infect_thunk
is created by gum_stalker_infect()
which is
the callback used by gum_stalker_follow()
to modify the context. Recall that
whilst some of the setup can be carried out on behalf of the target thread, some
has to be done in the context of the target thread itself (in particular setting
variables in thread-local storage). Well, it is the infect_thunk
which
contains that code.
Miscellaneous
Hopefully we have now covered the most important aspects of Stalker and have provided a good background on how it works. We do have a few other observations though, which may be of interest.
Exclusive Store
The AArch64 architecture has support for exclusive load/store instructions. These instructions are intended to be used for synchronization. If an exclusive load is performed from a given address, then later attempts an exclusive store to the same location, then the CPU is able to detect any other stores (exclusive or otherwise) to the same location in the intervening period and the store fails.
Obviously, these types of primitives are likely to be used for constructs such as mutexes and semaphores. Multiple threads may attempt to load the current count of the semaphore, test whether is it already full, then increment and store the new value back to take the semaphore. These exclusive operations are ideal for just such a scenario. Consider though what would happen if multiple threads are competing for the same resource. If one of those threads were being traced by Stalker, it would always lose the race. Also these instructions are easily disturbed by other kinds of CPU operations and so if we do something complex like emit an event between a load and a store we are going to cause it to fail every time, and end up looping indefinitely. Stalker, however, deals with such a scenario:
Here, we can see that the iterator records when it sees an exclusive load and tracks how many instructions have passed since. This is continued for up to four instructions – as this was determined by empirical testing based on how many instructions would be needed to load, test, modify and store the value. This is then used to prevent any instrumentation being emitted which isn’t strictly necessary:
Exhausted Blocks
Whilst we check to ensure a minimum amount of space for our current instrumented
block is left in the slab before we start (and allocate a new one if we fall
below this minimum), we cannot predict how long a sequence of instructions we
are likely to encounter in our input block. Nor is it simple to detemine exactly
how many instructions in output we will need to write the necessary
instrumentation (we have possible code for emitting the different types of
event, checking for excluded ranges, virtualizing instructions found at the end
of the block etc.). Also, trying to allow for the instrumented code to be
non-sequential is fraught with difficulty. So the approach taken is to ensure
that each time we read a new instruction from the iterator there is at least
1024 bytes of space in the slab for our output. If it is not the case, then we
store the current address in continuation_real_address
and return FALSE
so
that the iterator ends.
Our caller gum_exec_ctx_obtain_block_for()
which is walking the iterator to
generate the block then acts exactly as if there was a branch instruction to the
next instruction, essentially terminating the current block and starting the
next one.
It is as if the following instructions had been encountered in the input right before the instruction which would have not had sufficient space:
Syscall Virtualization
Syscalls are entry points from user-mode into kernel-mode. It is how
applications ask the kernel carry out operations on its behalf, whether that be
opening files or reading network sockets. On AArch64 systems, this is carried
out using the SVC
instruction, whereas on Intel the instruction is sysenter
.
Hence the terms syscall and sysenter here are used synonymously.
Syscall virtualization is carried out by the following routine. We can see we only do anything on Linux systems:
This is required because of the clone
syscall. This syscall creates a new
process which shares execution context with the parent, such as file handles,
virtual address space, and signal handlers. In essence, this effectively creates
a new thread. But the current thread is being traced by Stalker, and clone is
going to create an exact replica of it. Given that Stalker contexts are on a
per-thread basis, we should not be stalking this new child.
Note that for syscalls in AArch64 the first 8 arguments are passed in registers
X0
through X7
and the syscall number is passed in X8
, additional arguments
are passed on the stack. The return value for the syscall is returned in X0
.
The function gum_exec_block_virtualize_linux_sysenter()
generates the
necessary instrumented code to deal with such a syscall. We will look at the
pseudo code below:
We can see that it first checks if we are dealing with a clone
syscall,
otherwise it simply performs the original syscall and that is all (the original
syscall instruction is copied from the original block). Otherwise if it is a
clone syscall, then we again perform the original syscall. At this point, we
have two threads of execution, the syscall determines that each thread will
return a different value.
The original thread will receive the child’s PID as its return value, whereas
the child will receive the value of 0.
If we receive a non-zero value, we can simply continue as we were. We want to continue stalking the thread and allow execution to carry on with the next instruction. If, however, we receive a return value of 0, then we are in the child thread. We therefore carry out a branch to the next instruction in the original block ensuring that the child continues to run without any interruption from Stalker.
Pointer Authentication
Last of all, we should note that newer versions of iOS have introduced pointer authentication codes. Pointer authentication codes (PACs) make use of unused bits in pointers (the high bits of virtual addresses are commonly unused as most systems have a maximum of 48-bits of virtual address space) to store authentication values. These values are calculated by using the original pointer, a context parameter (typically the contents of another register) and a cryptographic key. The idea is that the key cannot be read or written from user-mode, and the resulting pointer authentication code cannot be guessed without having access to it.
Let’s look at the following fragment of code:
The pacia
instruction combines the values of LR
, SP
and the key to
generate a version of LR
with the authentication code LR'
and stores back
into the LR
register. This value is stored in the stack and later restored at
the end of the function. The autia
instruction validates the value of LR'
.
This is possible since the PAC in the high bits of LR
can be stripped to give
the original LR
value and the pointer authentication code can be regenerated
as it was before using SP
and the key. The result is checked against LR'
. If
the value doesn’t match then the instruction generates a fault. Thus if the
value of LR
stored in the stack is modified, or the stack pointer itself is
corrupted then the validation will fail. This is useful to prevent the building
of ROP chains which require return addresses to be stored in the stack. Since
LR'
is now stored in the stack instead of LR
, valid return addresses cannot
be forged without the key.
Frida needs to take this into account also when generating code. When reading
pointers from registers used by the application (e.g. to determine the
destination of an indirect branch or return), it is necessary to strip these
pointer authentication codes from the address before it is used. This is
achieved using the function gum_arm64_writer_put_xpaci_reg()
.