# Cubik'cipher

Cubik'cipher was a challenge from the @HackDay Qualifications 2022 in the hardware category, with 3 solves. The challenge in itself was a mixture of hardware and crypto and was a fun introduction to VHDL, as well as an opportunity for me to learn playing with some HDL-specific tools, which was out of my comfort zone.

## Description

A month ago, a spy managed to infiltrate the secret lair of an enemy! This one was developing his own data encryptor! Since then, it seems that the enemy has finished it, so it's up to us to build the decryptor that will allow us to spy on him without him noticing! The flag is of the form HACKDAY{...}
We were given several files:
• cubik_cipher.vhd, cubik_pkg.vhd and key_randomize.vhd: VHDL sources for a cryptographic algorithm implementation, which included encryption only.
• test_vector.txt: a file containing an example case of plaintext, key and corresponding ciphertext.
• flag.txt: a file containing a ciphertext and a key.
The goal of the challenge is therefore clear: we have to implement the algorithm that will decrypt the flag. We also understand that this is not an actual crypto challenge — in the sense that, even though the algorithm may be original, it is not expected to find weaknesses in it in order to solve the challenge, as the key is given.
Globally, there are two main routes one can choose from:
• Implement decryption directly in VHDL.
• Translate the whole algorithm to another language, use the test vector to make sure encryption works correctly, and implement decryption.
Since I didn't feel comfortable enough to go for the first option, I rewrote the algorithm in Python with Sage. But first, we need to look at the sources and understand the algorithm.

## Understanding the cryptography

Prior to anything, let's take a look at test_vector.txt and flag.txt.
data : 4841434B4441590000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
data_ciphered : 065A9D041EA0ABB6A38764BA3DCB6B13EFE3FC67DF249BFF149A8EFEA0984D52AC295403103023537198AFDE0C64D7AC5D23F2F25BE941C1AA8149E9FC174995D491F150E518AF26
key : f0ae2e1abee8afbe3ea424cc71f4ce17455a21d5df15cc4f6362e3af095cfb6da7188a9777c2c875ab39145a88a2142aea7b5411607110d70cd3d37c20f259b1920031990709d8e0e8d661b1a05fe8b5719aab6569835b3e52be738982608fda36549fd1e3398c725190356fbe97998b79f84f0ef23c4dea63898b52319a47a2
Before encryption, the plaintext is padded with null bytes to have a 576-bits block (72 bytes). The ciphertext is also a 576-bits block. The key, however, is much larger (1024 bits).

### Main entity

The "main" file in the VHDL sources is cubik_cipher.vhd:
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
use work.cubik_pkg.all;
entity cubik_cipher is
generic (round_div_2 : positive := 8);
port (
resetn : in std_logic;
clk : in std_logic;
-- key
key : in std_logic_vector(1023 downto 0);
-- d in
d_v_i : in std_logic;
d_i : in std_logic_vector(data_width-1 downto 0);
-- dout
d_c : out std_logic_vector(data_width-1 downto 0);
d_v_c : out std_logic
);
end entity;
architecture rtl of cubik_cipher is
constant key_w : positive := 576;
type state is (idle, engine);
signal current_state,next_state : state;
signal ctr_round : natural range 0 to 9;
signal end_ctr,cmd_ctr : std_logic;
signal rkey : std_logic_vector(key_w-1 downto 0);
signal reg_cipher : std_logic_vector(d_i'range);
begin
process (clk, resetn) is
begin
if resetn = '0' then
ctr_round <= 0;
elsif rising_edge(clk) then
if cmd_ctr = '0' then
ctr_round <= 0;
else
ctr_round <= ctr_round + 1;
end if;
end if;
end process;
end_ctr <= '1' when ctr_round = 9 else '0';
process (clk, resetn) is
begin
if resetn = '0' then
current_state <= idle;
elsif rising_edge(clk) then
current_state <= next_state;
end if;
end process;
process (clk, resetn) is
begin
if resetn = '0' then
reg_cipher <= (others => '0');
elsif rising_edge(clk) then
if d_v_i = '1' and ctr_round = 0 then
reg_cipher <= d_i;
else
reg_cipher <= round(reg_cipher, rkey);
end if;
end if;
end process;
d_c <= reg_cipher;
process (current_state, end_ctr, d_v_i)
begin
d_v_c <= '0';
case current_state is
----------------------------
when idle =>
if d_v_i = '1' then
next_state <= engine;
end if;
cmd_ctr <= '0';
----------------------------
when engine =>
if end_ctr = '1' then
next_state <= idle;
d_v_c <= '1';
cmd_ctr <= '0';
else
cmd_ctr <= '1';
end if;
----------------------------
end case;
end process;
key_engine : entity work.key_randomize
generic map(key_w => key_w)
port map (
resetn => resetn,
clk => clk,
key => key,
key_r => rkey
);
end architecture;
Note: again, I am not well-versed in hardware and HDL, so some of my explanations or the terminology that I use may be incorrect at times.
The following part describes the inputs and outputs of the cubik_cipher entity:
entity cubik_cipher is
generic (round_div_2 : positive := 8);
port (
resetn : in std_logic;
clk : in std_logic;
-- key
key : in std_logic_vector(1023 downto 0);
-- d in
d_v_i : in std_logic;
d_i : in std_logic_vector(data_width-1 downto 0);
-- dout
d_c : out std_logic_vector(data_width-1 downto 0);
d_v_c : out std_logic
);
end entity;
We can see it needs a clock signal clk, an input key key (1024 bits), and input data d_i which is the plaintext. We will see that d_v_i dictates when the encryption process should start. The outputs are d_c, the ciphertext, and d_v_c, a bit that notifies the encryption process has terminated.
Then, the architecture for cubik_cipher is described at RTL (Register Transfer Level), which is a way to represent a circuit at a higher level of abstraction, using registers acting as signals.
architecture rtl of cubik_cipher is
constant key_w : positive := 576;
type state is (idle, engine);
signal current_state,next_state : state;
signal ctr_round : natural range 0 to 9;
signal end_ctr,cmd_ctr : std_logic;
signal rkey : std_logic_vector(key_w-1 downto 0);
signal reg_cipher : std_logic_vector(d_i'range);
begin
[...]
The architecture comprises several processes.
process (clk, resetn) is
begin
if resetn = '0' then
ctr_round <= 0;
elsif rising_edge(clk) then
if cmd_ctr = '0' then
ctr_round <= 0;
else
ctr_round <= ctr_round + 1;
end if;
end if;
end process;
end_ctr <= '1' when ctr_round = 9 else '0';
This process keeps track of a round counter incremented at each clock rising edge. We learn that when the counter reaches 9, the end_ctr signal is set to 1. This hints at the algorithm being a block cipher with around 9 rounds. However, with the way VHDL works and how difficult it is to have a clear idea in mind of how logic is sequentially unfolded, I wasn't sure at this point that the number of rounds was exactly 9, and not something like 8 or 10.
process (clk, resetn) is
begin
if resetn = '0' then
reg_cipher <= (others => '0');
elsif rising_edge(clk) then
if d_v_i = '1' and ctr_round = 0 then
reg_cipher <= d_i;
else
reg_cipher <= round(reg_cipher, rkey);
end if;
end if;
end process;
d_c <= reg_cipher;
This process performs, at each clock rising edge, a round of encryption. At the beginning (d_v_i = 1), the reg_cipher signal is loaded with the plaintext (d_i). Then, each iteration, the round function is called on the reg_cipher, with a rkey parameter. The d_c signal, which is the output of the entity, is linked to the reg_cipher, and therefore contains the output ciphertext when all rounds have been computed.
Let's see what is this rkey signal:
key_engine : entity work.key_randomize
generic map(key_w => key_w)
port map (
resetn => resetn,
clk => clk,
key => key,
key_r => rkey
);
This portion of code maps inputs and outputs for another entity in another file key_randomize.vhd. The rkey, which we understand means "round key", is an output of this key_randomize entity. Let's check it out.

### Key randomization entity

library ieee;
use ieee.std_logic_1164.all;
entity key_randomize is
generic (key_w : positive := 32);
port (
resetn : in std_logic;
clk : in std_logic;
key : in std_logic_vector(1023 downto 0);
key_r : out std_logic_vector(key_w-1 downto 0)
);
end entity;
architecture rtl of key_randomize is
signal reg : std_logic_vector(key'range);
begin
process (clk,resetn) is
begin
if resetn = '0' then
reg <= (others => '1');
elsif rising_edge(clk) then
reg <= key;
else
reg <= reg(reg'length-2 downto 0) & reg(reg'high);
reg(24) <= reg(23) xor reg(reg'high);
reg(421) <= reg(420) xor reg(reg'high);
reg(476) <= reg(475) xor reg(reg'high);
reg(545) <= reg(544) xor reg(reg'high);
reg(923) <= reg(922) xor reg(reg'high);
end if;
end if;
end process;
key_r <= reg(key_r'range);
end architecture;
At first, when load = 1 (d_v_i = 1), the key is loaded in the reg register, which acts as an internal state for a round key derivation function. At each clock rising edge, this internal state is updated by rotating it 1 bit to the left and flipping a few specific bits.
The output round key (key_r) is then a projection of the internal state, key_w being 576 bits. Therefore, even though the key is 1024 bits, the round keys are always 576 bits (least significant bits of the internal state).

### Core cryptographic entity

Let's now focus on the core part of the algorithm: the round function in cubik_pkg.vhd.
function round(data : std_logic_vector(data_width-1 downto 0) ; key : std_logic_vector(575 downto 0)) return std_logic_vector is
variable tmp : cubix;
variable data_out : std_logic_vector(data'range);
begin
tmp := slv2cubix(data);
tmp := mixcubix(tmp, (m_0,m_1,m_2,m_3));
tmp := roundcubix(tmp, key);
tmp := swap_rows(tmp);
data_out := cubix2slv(tmp);
return data_out;
end round;
In order to perform calculations, the input data vector is converted (slv2cubix) to another representation, called cubix, and converted back to normal at the end (cubix2slv).
A cubix is composed of four
$4 \times 4$
matrices which coefficients are 9-bit integers, which I will call nonets. Indeed, even though the data vectors are 72 bytes long, 576 bits is divisible by 9, which gives 64 nonets. These nonets are split in 4 groups of 16 nonets, each group populating a square matrix.
Denoting
$(m_0, \ldots, m_{63})$
the input nonets, the cubix is therefore:
$\begin{bmatrix}d_{63} & d_{62} & d_{61} & d_{60} \\ d_{59} & d_{58} & d_{57} & d_{56} \\ d_{55} & d_{54} & d_{53} & d_{52} \\ d_{51} & d_{50} & d_{49} & d_{48}\end{bmatrix} \: \ldots \: \begin{bmatrix} d_{15} & d_{14} & d_{13} & d_{12} \\ d_{11} & d_{10} & d_{9} & d_{8} \\ d_{7} & d_{6} & d_{5} & d_{4} \\ d_{3} & d_{2} & d_{1} & d_{0} \end{bmatrix}$
Then, the actual encryption part is carried through three different steps.
Mix cubix. This step uses four constant matrices
$m_0, m_1, m_2, m_3$
, defined as follows.
$\begin{bmatrix} 3 & 0 & 2 & 6 \\ 6 & 3 & 0 & 2 \\ 2 & 6 & 3 & 0 \\ 0 & 2 & 6 & 3 \end{bmatrix}, \begin{bmatrix} 6 & 4 & 3 & 0 \\ 0 & 6 & 4 & 3 \\ 3 & 0 & 6 & 4 \\ 4 & 3 & 0 & 6 \end{bmatrix}, \begin{bmatrix} 4 & 0 & 9 & 3 \\ 3 & 4 & 0 & 9 \\ 9 & 3 & 4 & 0 \\ 0 & 9 & 3 & 4 \end{bmatrix}, \begin{bmatrix} 2 & 4 & 0 & 9 \\ 9 & 2 & 4 & 0 \\ 0 & 9 & 2 & 4 \\ 4 & 0 & 9 & 2 \end{bmatrix}$
function mixcubix(c : cubix; mt : matrix_t) return cubix is
variable tmp : cubix;
begin
for k in cubix'range loop
tmp(k) := mixmatrix(c(k),shiftmt(mt,k));
end loop;
return tmp;
end mixcubix;
Each matrix in the cubix is mixed using 4 matrices. More especially, if
$(C_0, C_1, C_2, C_3)$
is the cubix:
$C_0 = \text{mixmatrix}(C_0, (m_0, m_1, m_2, m_3)) \\ C_1 = \text{mixmatrix}(C_1, (m_1, m_2, m_3, m_0)) \\ C_2 = \text{mixmatrix}(C_2, (m_2, m_3, m_0, m_1)) \\ C_3 = \text{mixmatrix}(C_3, (m_3, m_0, m_1, m_2))$
As for the mixmatrix function, it basically multiplies each row of the input matrix with the
$m_i$
matrices:
$\text{mixmatrix}(C, (m_0, m_1, m_2, m_3)) = \begin{bmatrix} C^{(0)} m_0 \\ C^{(1)} m_1 \\ C^{(2)} m_2 \\ C^{(3)} m_3 \end{bmatrix}$
function mixrow(r : row; m : matrix_c) return row is
variable tmp : row;
begin
for i in row'range loop
tmp(i) := times(m(i,0),r(0)) xor times(m(i,1),r(1)) xor times(m(i,2),r(2)) xor times(m(i,3),r(3));
end loop;
return tmp;
end mixrow;
function mixmatrix(m : matrix; mt : matrix_t) return matrix is
variable tmp : matrix;
begin
for j in matrix'range loop
tmp(j) := mixrow(m(j),mt(j));
end loop;
return tmp;
end mixmatrix;
However, the multiplications are carried out in a specific mathematical space. The function times implements a double-and-add algorithm.
function times2(n : nonaire) return nonaire is
variable tmp :nonaire;
begin
tmp := n(7 downto 4) & (n(3) xor n(8)) & n(2 downto 0) & n(8);
return tmp;
end times2;
function times(n1: nonaire; n2: nonaire) return nonaire is
variable tmp : nonaire;
begin
tmp := (others => '0');
for i in n1'range loop
if n1(i) = '1' then
tmp := tmp xor n2;
end if;
if i > 0 then
tmp := times2(tmp);
end if;
end loop;
return tmp;
end times;
If the add operation is merely a XOR, the double operation is more intricate. A comment at the beginning of the file gives out a big hint:
-- Primitive polynomial = D^9+D^4+1 -- GF(512)
Operations are actually carried out in a Galois Field,
$\text{GF}(512)$
, with the primitive polynomial
$D^9 + D^4 + 1$
. Roughly, this means a nonet can be seen as a polynomial modulo
$D^9 + D^4 + 1$
(so max degree 8), whose coefficients are in
$\mathbb{Z}/2\mathbb{Z}$
(so each coefficient is a bit of the nonet).
Matrices can be defined over this space since it is a field, and thus operations such as addition, multiplication, inverse... exist as well.
Round cubix. This second step is easier to understand. It simply performs a bitwise XOR between tmp and the round key:
function roundrow(r : row; key : std_logic_vector(35 downto 0)) return row is
variable tmp : row;
begin
for i in row'range loop
tmp(i) := r(i) xor key(9*(i+1)-1 downto 9*i);
end loop;
return tmp;
end roundrow;
function roundmatrix(m : matrix; key : std_logic_vector(143 downto 0)) return matrix is
variable tmp : matrix;
begin
for j in matrix'range loop
tmp(j) := roundrow(m(j),key(36*(j+1)-1 downto 36*j));
end loop;
return tmp;
end roundmatrix;
function roundcubix(c : cubix; key : std_logic_vector(575 downto 0)) return cubix is
variable tmp : cubix;
begin
for k in cubix'range loop
tmp(k) := roundmatrix(c(k),key(144*(k+1)-1 downto 144*k));
end loop;
return tmp;
end roundcubix;
Swap rows. The final step is also quite simple — it only permutes the coefficients of the matrix a certain way:
function swap_rows (c: cubix) return cubix is
variable tmp :cubix;
begin
tmp (0)(0) := c(1)(1);
tmp (0)(1) := c(3)(3);
tmp (0)(2) := c(2)(2);
tmp (0)(3) := c(2)(1);
tmp (1)(0) := c(3)(1);
tmp (1)(1) := c(2)(0);
tmp (1)(2) := c(0)(1);
tmp (1)(3) := c(1)(2);
tmp (2)(0) := c(1)(0);
tmp (2)(1) := c(0)(3);
tmp (2)(2) := c(2)(3);
tmp (2)(3) := c(3)(0);
tmp (3)(0) := c(3)(2);
tmp (3)(1) := c(0)(0);
tmp (3)(2) := c(1)(3);
tmp (3)(3) := c(0)(2);
return tmp;
end swap_rows;
We should have everything needed to implement the encryption algorithm. As for the Galois Field part, SageMath makes it easy.
But obviously, it didn't work at first try. As I presented it here, the algorithm seems rather unambiguous, but when you are in the process of reversing it, there are a lot of elements for which you have uncertainties. Indeed, many issues could happen at different levels:
• Actual bugs in my Python code
• Misunderstanding of the VHDL algorithm
• Uncertainty about the number of rounds
• Uncertainty about endianness at many steps
• Is the key randomization function applied before or after the round function?
• Did I understand the key randomization function properly? (especially with the fact that in VHDL, signals are updated at the end of a process)
• Do the times and times2 functions even do what I think they do?
• Is my mathematical interpretation correct?
• Is my translation to Sage correct?
• ...
With so many doubts, it felt mandatory to run the original VHDL sources with the test vector and observe the state of the signals at each iteration, in order to debug my code.

## Creating a testbench for debugging

I read some documentation and wrote a testbench to interact with the cubik_cipher component:
library ieee;
use ieee.std_logic_1164.all;
use work.cubik_pkg.all;
entity testbench is
end testbench;
architecture behavior of testbench is
component cubik_cipher is
port (
resetn : in std_logic;
clk : in std_logic;
-- key
key : in std_logic_vector(1023 downto 0);
-- d in
d_v_i : in std_logic;
d_i : in std_logic_vector(data_width-1 downto 0);
-- dout
d_c : out std_logic_vector(data_width-1 downto 0);
d_v_c : out std_logic
);
end component;
signal resetn : std_logic := '0';
signal input : std_logic_vector(data_width-1 downto 0);
signal output : std_logic_vector(data_width-1 downto 0);
signal inputkey : std_logic_vector(1023 downto 0);
signal d_v_i : std_logic := '0';
signal d_v_c : std_logic := '0';
signal clk : std_logic := '0';
constant clk_period : time := 1 ns;
begin
uut: cubik_cipher port map (
resetn => resetn,
clk => clk,
key => inputkey,
d_v_i => d_v_i,
d_i => input,
d_c => output,
d_v_c => d_v_c
);
clk_process : process
begin
clk <= '0';
wait for clk_period;
clk <= '1';
wait for clk_period;
end process;
stim_process: process
begin
wait for 1 ns;
resetn <= '1';
inputkey <= "1110011000110111111001110001010001111011001010010001000111011010011110101000000100100010011010011111001001001101101001001011101010000101001110101000110100000111000010000111101011101110101010000100110101101100010100001110001010010001010011110010111110001010110111001110101001100111001011101011111001000101110111101000111001000101100011111100111011011001101000001101101101110101010110011110101110000011101100110101010010001101110010010001101010100110000100101100110110110110000001100010111011011101011011001001101011001001100100111110110101100000011111100011111111000011100011000110111000100111110010000001110011010101011001100110101101101110101000010100011111000100011000001111001111000100011001010110010111011110100010010000010111101011110010010110010010110110100000110100001100001111110111001011000101010001111011000011110011110010000100100111010001000101101101010110111111010000000001111001110000010110000101000110011101111010100001100110111100010111100110001001110000000010000111001100010110110101001110111001011111001010";
input <= "010010000100000101000011010010110100010001000001010110010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000";
d_v_i <= '1';
wait for 1 ns;
d_v_i <= '0';
wait for 11 ns;
end process;
end;
In order to simulate the circuit, I used GHDL, an open-source compiler and simulator for VHDL.
I ran the simulation with ghdl -r testbench --vcd=out.vcd --stop-time=50ns. This generated a trace file in the VCD format, which I viewed with GTKWave. Here is what it looks like: First, the key and the plaintext are loaded. The first round key is the 576 least significant bits of the key. Next, the first round has been performed, and the next round key has been computed.
Thanks to this, I was able to determine that there were indeed 9 rounds and that the key started to be randomized after the first round. I could also debug my script step by step and compare my values to the signals, to locate where I made mistakes.
This allowed to have a working implementation of the encryption algorithm.

## Implementing decryption

The only thing that remained was to implement decryption. This should not be too hard — we only need to generate the round keys and follow the reverse steps for the rounds.
The swap rows and round cubix steps are trivial to invert (inverse permutation, and XOR the round key). As for the mix cubix step, its inverse it is actually the same operation, but with the inverse matrices for
$m_i$
(which is also trivial to do thanks to Sage).

### Solution script

Here is the full script that implements encryption and decryption in Python and Sage.
F.<D> = GF(2)[]
K.<x> = GF(512, name='x', modulus=D^9+D^4+1)
m0_ = [[3,0,2,6],[6,3,0,2],[2,6,3,0],[0,2,6,3]]
m1_ = [[6,4,3,0],[0,6,4,3],[3,0,6,4],[4,3,0,6]]
m2_ = [[4,0,9,3],[3,4,0,9],[9,3,4,0],[0,9,3,4]]
m3_ = [[2,4,0,9],[9,2,4,0],[0,9,2,4],[4,0,9,2]]
def swap_rows(c):
tmp = [ * 4 for i in range(4)]
tmp, tmp, tmp, tmp = c, c, c, c
tmp, tmp, tmp, tmp = c, c, c, c
tmp, tmp, tmp, tmp = c, c, c, c
tmp, tmp, tmp, tmp = c, c, c, c
return tmp
def inv_swap_rows(c):
tmp = [ * 4 for i in range(4)]
tmp, tmp, tmp, tmp = c, c, c, c
tmp, tmp, tmp, tmp = c, c, c, c
tmp, tmp, tmp, tmp = c, c, c, c
tmp, tmp, tmp, tmp = c, c, c, c
return tmp
def decimal2bin(d):
return [int(_) for _ in f"{d:09b}"]
def bin2nonet(d):
return K(d[::-1])
def nonet2bin(N):
L = N.polynomial().list()
while len(L) != 9:
L.append(0)
return L[::-1]
m0 = [[bin2nonet(decimal2bin(m0_[j][i])) for i in range(4)] for j in range(4)]
m1 = [[bin2nonet(decimal2bin(m1_[j][i])) for i in range(4)] for j in range(4)]
m2 = [[bin2nonet(decimal2bin(m2_[j][i])) for i in range(4)] for j in range(4)]
m3 = [[bin2nonet(decimal2bin(m3_[j][i])) for i in range(4)] for j in range(4)]
def slv2row(data):
return [bin2nonet(data[9 * i:9 * (i + 1)]) for i in range(3, -1, -1)]
def slv2matrix(data):
return [slv2row(data[36 * j:36 * (j + 1)]) for j in range(3, -1, -1)]
def slv2cubix(data):
return [slv2matrix(data[144 * k:144 * (k + 1)]) for k in range(3, -1, -1)]
def row2slv(r):
tmp = []
for i in range(4):
tmp = nonet2bin(r[i]) + tmp
return tmp
def matrix2slv(m):
tmp = []
for j in range(4):
tmp = row2slv(m[j]) + tmp
return tmp
def cubix2slv(c):
tmp = []
for k in range(4):
tmp = matrix2slv(c[k]) + tmp
return tmp
def mixrow(r, m):
return [
(m[i] * r) + (m[i] * r) + (m[i] * r) + (m[i] * r)
for i