### 5.1 Recap

In the last lecture we showed a constant-depth polynomial-size boolean circuit (of unbounded fanin) that computed the sum of two n-bit integers, implying the parallel time complexity of Integer Addition is constant. Then we wondered whether the following problems also have constant parallel time complexity: Iterated Integer Addition, Integer Multiplication and Iterated Integer Multiplication.

In the case of iterated addition, we observed that the least significant bit (LSB) of the sum of the $n$ nbit inputs is actually the parity of LSB's of the $n$ inputs. But parity of $n$ bits cannot be computed by any constant-depth boolean circuit:

Theorem 5.1 Any $\Delta$-depth circuit computing the parity of $n$ bits using $\wedge, \vee, \neg$ gates (of unbounded fanin) must have size $2^{n^{1 / \Delta}}[1]$.
(Proof not covered here.) Therefore, parallel time complexity of Iterated Addition is not constant.

### 5.2 Classes $\mathrm{AC}^{0}, \mathrm{TC}^{0}$

Definition 5.2 $\mathrm{AC}^{0}$ is the class of boolean functions computed by boolean circuits having $\vee, \wedge$ and $\neg$ gates, with constant depth and polynomial size, having unbounded fanin for $\wedge, \vee$.

From the discussion above, integer addition $\in A C^{0}$ and Iterated Integer Addition $\notin A C^{0}$.

Theorem 5.3 Integer Multiplication $\notin \mathrm{AC}^{0}$

Proof: We show an $\mathrm{AC}^{0}$ reduction from PARITY to Integer Multiplication.
Let the input be $x_{1}, \ldots, x_{n}$. We consider the binary string $x_{1} \ldots x_{n}$. Between each $x_{i}$ and $x_{i+1}$ for $i \in[n-1]$, we insert $\log n$ many zeros to get another binary string $y . y$ is of length $n+(n-1) \log n$. Let $z$ be the binary string of length $|y|$ such that it has 1 at the positions corresponding to $x_{i}$ 's of $y$, and 0 everywhere else. An AC ${ }^{0}$ circuit can easily form $y, z$. Treating $y, z$ as integers, let $w$ be their product with bits $w_{0}(\mathrm{LSB}), w_{1}, w_{2}, \cdots$. From the naive algorithm it is clear that $w_{n-1+(n-1) \log n}=\operatorname{PARITY}\left(x_{1}, \ldots, x_{n}, c_{n-1+(n-1) \log n}\right)$, where $c_{i}$ is the $i^{\text {th }}$ carry bit. A closer inspection reveals that $c_{n-1+(n-1) \log n}=0$ : Columns $n-2+(n-1) \log n$ through $n-1+(n-2) \log n$ "absorb" all carries propagated from the lower columns. Thus $w_{n-1+(n-1) \log n}=$ $\operatorname{PARITY}\left(x_{1}, \ldots, x_{n}\right)$.
It follows that Iterated Integer Multiplication $\notin \mathrm{AC}^{0}$.
We now consider threshold gates (slightly more powerful than the basic gates) and see if they give us constant-depth polynomial-sized circuits for Iterated Integer Addition, Integer Multiplication and Iterated Integer Multiplication.

Definition 5.4 For inputs $x_{1}, \ldots, x_{n} \in\{0,1\}$, the output of a threshold gate is
$T h\left(x_{1}, \ldots, x_{n}\right)= \begin{cases}1 & \text { if } a_{1} x_{1}+\cdots+a_{n} x_{n} \geq \theta \\ 0 & \text { otherwise }\end{cases}$
where $\theta, a_{1}, \ldots, a_{n} \in \mathbb{Z} . \theta, a_{1}, \ldots, a_{n}$ may depend on $n$, but they do not depend on the input $x_{1}, \ldots, x_{n}$.

## Remarks

1. If we set $a_{i}=1 \forall i \in[n]$ and $\theta=1$ then we get an $\vee$ gate.
2. If we set $a_{i}=1 \forall i \in[n]$ and $\theta=n$ then we get an $\wedge$ gate.
3. If we set $a_{1}=-1$ and $\theta=0$ then we get a $\neg$ gate.
4. If we set $a_{i}=1 \forall i \in[n]$ and $\theta=n / 2$ then we get a MAJORITY gate. A MAJORITY gate outputs 1 if at least half the (boolean) inputs are 1 and outputs 0 otherwise.
5. One of the motivations to study threshold gates comes from Artificial Neural Networks. In a simplistic view, neurons behave like threshold gates. A neuron receives signals from several neurons, and fires if the weighted sum of inputs exceeds some threshold.

Definition 5.5 $\mathrm{TC}^{0}$ is the class of boolean functions computed by constant-depth poly $(n)$-size circuits with threshold gates.

Observation From the remarks above it follows that $A C^{0} \subseteq T C^{0}$. It turns out that the containment is strict, i.e., $\mathrm{AC}^{0} \subsetneq \mathrm{TC}^{0}$, since MAJORITY $\in \mathrm{TC}^{0}$ (from Remark 4 above) and

Theorem 5.6 MAJORITY $\notin \mathrm{AC}^{0}$.
(Proof not covered here.)

### 5.3 Iterated Integer Addition $\in \mathrm{TC}^{0}$

Definition 5.7 (symmetric boolean functions) A boolean function $f:\{0,1\}^{n} \rightarrow\{0,1\}$ is symmetric if for all permutations $\sigma:[n] \rightarrow[n], f\left(x_{\sigma(1)}, x_{\sigma(2)}, \ldots, x_{\sigma(n)}\right)=f\left(x_{1}, \ldots, x_{n}\right)$.

Theorem 5.8 Every symmetric boolean function $f\left(x_{1}, \ldots, x_{n}\right) \in \mathrm{TC}^{0}$.

Proof: We observe that
$\exists S=\left\{b_{1}, \ldots, b_{m}\right\} \subseteq[n]$ such that

$$
\begin{aligned}
f\left(x_{1}, \ldots, x_{n}\right) & =1 \Longleftrightarrow \sum_{i \in[n]} x_{i} \in S . \text { Hence we can rephrase } f \text { as } \\
f\left(x_{1}, \ldots, x_{n}\right) & =\bigvee_{j \in[m]}\left(\sum_{i \in[n]} x_{i}=b_{j}\right) \quad \quad \triangleright \text { It's a conditional }=. \\
& =\bigvee_{j \in[m]}\left(\left(\sum_{i \in[n]} x_{i} \geq b_{j}\right) \wedge\left(\sum_{i \in[n]}\left(-x_{i}\right) \geq-b_{j}\right)\right)
\end{aligned}
$$

The two inequalities in the last line can be implemented by two threshold gates, one with $a_{i}=1 \forall i \in$ [ $n$ ], $\theta=b_{j}$ and the other with $a_{i}=-1 \forall i \in[n], \theta=-b_{j}$, respectively. Combining these two threshold gates with an $\wedge$ gate gives a subcircuit $C_{j}$ of size 3 and depth 1 . Making $m$ such subcircuits, one for each $b_{j}$, and combining them in parallel using an $\vee$ gate results in a circuit of size $3 m+1$ and depth 2 . Since $m \leq n$, it's
a poly-size constant depth circuit. Finally we can $\operatorname{turn} \wedge$ and $\vee$ into threshold gates as explained before and get a $\mathrm{TC}^{0}$ circuit.
Since PARITY is symmetric, we have

Corollary 5.9 PARITY $\in \mathrm{TC}^{0}$.

Theorem 5.10 Iterated Addition $\in \mathrm{TC}^{0}$.

## Proof:

Input: $n$ integers $a_{1}, \ldots, a_{n}$ of $n$ bits each.
Output: $s=a_{1}+\cdots+a_{n}$.
Let $c_{0}, c_{1}, \ldots, c_{n}$ be the carries as shown:

| $c_{n}$ | $c_{n-1}$ | $c_{n-2}$ | $\ldots$ | $c_{2}$ | $c_{1}$ | $c_{0}=0$ |
| :---: | :--- | :--- | :--- | :--- | :--- | :--- |
|  | $a_{1(n-1)}$ | $a_{1(n-2)}$ | $\ldots$ | $a_{12}$ | $a_{11}$ | $a_{10}$ |
| $+\cdots$ | $\ldots$ | $\ldots$ | $\cdots$ | $\ldots$ | $\ldots$ |  |
| $+a_{n(n-1)}$ | $a_{n(n-2)}$ | $\ldots$ | $a_{n 2}$ | $a_{n 1}$ | $a_{n 0}$ |  |

Just like in the algorithm of addition of two integers, we would like to first compute all the carry bits in parallel and then compute parities in parallel. But there are two differences here. First, the carries $c_{1}, \ldots, c_{n}$ are not single bits; they are small integers. So, our parity computation step now looks like $\forall 0 \leq i \leq n, s_{i}=\operatorname{PARITY}\left(c_{i, 0}, a_{1, i}, \ldots, a_{n, i}\right)$, where $c_{i, j}$ is the $j^{t h} \mathrm{LSB}$ of $c_{i}$. The second difference is that for each output bit we have to compute parity of $n+1$ bits, not 3 bits. But this is doable in constant depth with threshold gates as we saw earlier. We now show that computation of carries also is possible in constant depth with threshold gates.

As an example we discuss computing $c_{n-1,0}$ and generalize it for all $c_{i, 0}$ 's later.
Let us first estimate the size of $c_{n-1}$. For this we consider $t=b_{1}+b_{2}+\cdots+b_{n}$, where each $b_{i}$ is an integer obtained by retaining $n-1 \mathrm{LSBs}$ of $a_{i}$ and dropping the rest (here $a_{i, n-1}$ ). The reasoning is that $c_{n-1}$ depends only on $n-1$ LSBs of $a_{i}$ 's.
Since $b_{i}<2^{n-1} \forall i \in[n]$, clearly $t<n 2^{n-1}$. And the number of bits in $t=\log t \leq \log n+n-1$. Dropping the $n-1$ LSBs of $t$ we get $c_{n-1}$ which has at most $\log n$ bits.
Thinking of binary representation of $t$,

$$
t=t_{m} \cdot 2^{m}+\cdots+t_{n-1} \cdot 2^{n-1}+\cdots+t_{1} \cdot 2+t_{0}
$$

the bits of $c_{n-1}$ are $t_{m}, t_{m-1}, \ldots, t_{n+1}, t_{n}$. We can safely assume $c_{n-1}$ has exactly $\log n$ bits. Thus, $m$ is fixed for a fixed $n$.
Now we show how we can compute these bits using small number of threshold gates. We observe that

$$
\begin{align*}
& t_{m}=1 \Longleftrightarrow  \tag{5.1}\\
& 2^{m+1}-1 \geq t \geq 2^{m} \\
& t_{m-1}=1 \Longleftrightarrow  \tag{5.2}\\
& \left(2^{m}-1 \geq t \geq 2^{m-1}\right) \\
& \vee\left(2^{m}+2^{m}-1 \geq t \geq 2^{m}+2^{m-1}\right)
\end{align*}
$$

$$
\begin{align*}
& t_{m-2}=1 \Longleftrightarrow \quad\left(2^{m-1}-1 \geq t \geq 2^{m-2}\right) \\
& \vee\left(2^{m-1}+2^{m-1}-1 \geq t \geq 2^{m-1}+2^{m-2}\right) \\
& \vee\left(2^{m}+2^{m-1}-1 \geq t \geq 2^{m}+2^{m-2}\right) \\
& \vee\left(2^{m}+2^{m-1}+2^{m-1}-1 \geq t \geq 2^{m}+2^{m-1}+2^{m-2}\right)  \tag{5.3}\\
& t_{n}=1 \Longleftrightarrow \quad\left(2^{n}-1 \geq t \geq 2^{n-1}\right) \vee \\
& \vee\left(2^{m}+\cdots+2^{n+1}+2^{n}+2^{n}-1 \geq t \geq 2^{m}+\cdots+2^{n+1}+2^{n}+2^{n-1}\right) \tag{5.4}
\end{align*}
$$

Clearly it is $t_{n}$ that requires the maximum number of inequalities. Specifically, $t_{n}$ requires $2 \cdot 2^{m-n} \leq$ $2 \cdot 2^{\log n}=O(n)$ inequalities. Also the bounds in the inequalites are fixed for a fixed $n$. This means LSB of $c_{n-1}$ can be computed by $O(n)$ threshold gates, with depth 2 (taking into account the $\vee$ gate as well).

The above procedure can be generalized to compute any $c_{i, 0}$ by defining $b_{i}$ as the $i-1 \mathrm{LSBs}$ of $a_{i}$. One can verify that $c_{i}$ has at most $\log n$ bits for all $i$. Thus we can compute all $c_{i, 0}$ 's in parallel using $n$ subcircuits each of size $O(n)$ and depth 2 .

On top of this layer, finally, for $0 \leq i \leq n$ we can introduce $\mathrm{TC}^{0}$ parity circuits to compute $s_{i}=$ $\operatorname{PARITY}\left(c_{i, 0}, a_{1, i}, \ldots, a_{n, i}\right)$. This makes total size $O\left(n^{3}\right)$ and depth 5 . For $n+1 \leq i \leq m, s_{i}=t_{i}$ which too can be computed by inequalities like $5.1,5.2,5.3$ in parallel with the main computation. Thus we have a constant depth polynomial size threshold circuit for the problem.

Remark The above $\mathrm{TC}^{0}$ circuit is not the optmial one. TC ${ }^{0}$ circuits with smaller size for iterated addition are known.

Corollary 5.11 Integer Multiplication $\in \mathrm{TC}^{0}$
Proof: The grade school algorithm for multiplying two $n$-bit integers first performs $n^{2}$ bitwise multiplications to get $n n$-bit integers and then performs iterated addition over them (with appropriate shifts). The first step is clearly in $\mathrm{TC}^{0}$. The second step also is in $\mathrm{TC}^{0}$ from Theorem 5.10.

### 5.4 Analogy between Integers and Polynomials

We informally discuss some similarities between integers and univariate polynomials as it will be helpful in the later discussions.
Consider the ring of integers $\mathbb{Z}=\{\{0, \pm 1, \pm 2 \ldots\},+, \times\}$ and the ring of polynomials $\mathbb{F}[X]$ where $\mathbb{F}$ is $\mathbb{R}, \mathbb{C}$, or $\mathbb{F}_{p}$ (where $p$ is prime). We can compare the two on the following aspects:

Size and Degree While the size of an integer is the number of bits required to represent it, the degree of a polynomial determines the number of field elements required to represent it.

Prime numbers and Irreducible polynomials A prime number is an integer greater than 1 such that it is not a product of two integers greater than 1. Similarly, an irreducible polynomial over a field is a nonconstant polynomial such that it is not a product of two nonconstant polynomials in that field.

Unique factorization Any integer greater than 1 can be written as a unique product of prime numbers (ignoring the order). Similarly, any nonconstant polynomial over a field can be written as a unique
product of irreducible polynomials (ignoring the order) in that field.
Remainder is unique Given two integers $a, b, b \neq 0$, there exist unique integers $q, r$ such that $a=b q+r$ and $0 \leq r<|b|$. Similarly, given two polynomials $a, b, b \neq 0$ over a field, there exist two unique polynomials $q, r$ such that $a=b q+r$ and $0 \leq \operatorname{deg}(r)<\operatorname{deg}(b)$.

### 5.5 Iterated Integer Multiplication $\in \mathrm{TC}^{0}$

Input: $n$-bit integers $a_{1}, \ldots, a_{n}$
Output: $P=a_{1} \cdot a_{2} \cdots a_{n}$
Since $a_{i}<2^{n} \forall i \in[n]$, we have $P<2^{n^{2}}$, i.e., $P$ has at most $n^{2}$ bits.
Let us translate the problem to the world of polynomials and analyze it.

### 5.5.1 Iterated Polynomial Multiplication

Input: $n$ polynomials of degree $(n-1)$ each:

$$
\begin{aligned}
a_{1}(x) & =a_{1,0}+a_{1,1} x+\cdots+a_{1, n-1} x^{n-1} \\
\quad & \\
a_{n}(x) & =a_{n, 0}+a_{n, 1} x+\cdots+a_{n, n-1} x^{n-1}
\end{aligned}
$$

Output: $p(x)=a_{1}(x) a_{2}(x) \cdots a_{n}(x)$
The naive approach requires in total exponentially many $\left(n^{n}\right)$ multiplications and thus does not give us a $\mathrm{TC}^{0}$ circuit.
However, we can try the interpolation technique which we encountered before. For now we assume addition, subtraction and scalar multiplication are free. Here are the steps:

1. Choose $n^{2}$ points $\alpha_{1}, \ldots, \alpha_{n^{2}}$ from the field and get evaluation lists for $a_{i}$ 's:

$$
\begin{gathered}
a_{1}(x) \equiv\left(a_{1}\left(\alpha_{1}\right), a_{1}\left(\alpha_{2}\right), \ldots, a_{1}\left(\alpha_{n^{2}}\right)\right) \\
\ldots \\
a_{n}(x) \equiv\left(a_{n}\left(\alpha_{1}\right), a_{n}\left(\alpha_{2}\right), \ldots, a_{n}\left(\alpha_{n^{2}}\right)\right)
\end{gathered}
$$

Each evaluation is essentially an addition, so if we allow addition gates with unbounded fanin then this step costs depth 1.
2. Do pointwise multiplication to get

$$
p(x) \equiv\left(\prod_{i \in[n]} a_{i}\left(\alpha_{1}\right), \prod_{i \in[n]} a_{i}\left(\alpha_{2}\right), \ldots, \prod_{i \in[n]} a_{i}\left(\alpha_{n^{2}}\right)\right)
$$

If we allow multiplication gates with unbounded fanin then this step costs depth 1 .
3. Interpolate $p$ to get the coefficients $p_{0}, \ldots, p_{n^{2}-n}$. Again, allowing addition gates with unbounded fanin, this step costs depth 1.
Thus we have a depth 3 circuit with addition gates and multiplication gates of unbounded fanin, for the problem of iterated multiplication of polynomials.

### 5.5.2 Chinese Remaindering Theorem

The counterpart of the above technique in the integer world is really Chinese Remainder Theorem (CRT) which we describe now. Henceforth we assume that $\forall i \in[n] a_{i} \neq 0$, since otherwise the product is trivially 0 .

Step 1. Pick a few distinct (and small) prime numbers $\alpha_{1}, \ldots, \alpha_{m}$ such that their product exceeds $2^{n^{2}}$, the upper bound of $\prod_{i \in[n]} a_{i}=P$ (say). Let's say we pick the first $n^{2}$ prime numbers.

Fact 5.12 For each $a_{i}$, the tuple $\left(a_{i} \bmod \alpha_{1}, \ldots, a_{i} \bmod \alpha_{m}\right)$ uniquely determines $a_{i}$.

Compute the tuples for all $a_{i}$ 's.
Step 2. For each $\alpha_{j}, j \in[m]$, let $b_{j}^{\prime}:=\prod_{i \in[n]} a_{i} \bmod \alpha_{j}$. Compute $b_{j}:=b_{j}^{\prime} \bmod \alpha_{j}$.

Fact 5.13 The tuple $\left(b_{1} \bmod \alpha_{1}, \ldots, b_{m} \bmod \alpha_{m}\right)$ uniquely determines $P$.

Step 3. For all $j \in[m]$ let $Q_{j}:=\prod_{t \in[m], t \neq j} \alpha_{t}$. For all $j \in[m]$, let $R_{j} \in\left[1, \alpha_{j}-1\right]$ such that $Q_{j} R_{j} \equiv 1$ $\bmod \alpha_{j}$. Call $R_{j}$ the inverse of $Q_{j}$.

Fact 5.14 For each $Q_{j}$ there exists exactly one inverse $R_{j}$.
Let $p^{\prime}:=\sum_{j \in[m]} b_{j} Q_{j} R_{j}$. Compute $p^{\prime} \bmod \left(\alpha_{1} \alpha_{2} \cdots \alpha_{m}\right)$, which is $P$.

Back to analogy. Steps 1,2 and 3 of CRT are the analogues of evaluation, pointwise multiplication and interpolation respectively. Let's elaborate the analogy for Step 1.
Evaluation of a polynomial $p(x)$ at a point $\alpha$ is equivalently the remainder obtained by dividing $p(x)$ by $(x-\alpha)$. i.e., $p(x) \equiv p(\alpha) \bmod (x-\alpha)$. (For example, suppose $p(x)=x^{2}+1$ and $\alpha=1$. Then $p(1)=2$. Also, since $x^{2}+1=(x-1)(x+1)+2$, we have $p(x) \equiv 2 \bmod (x-1)$.) Noticing that $x-\alpha$ is irreducible, the above operation naturally translates to taking modulo a prime number in the integer world.

### 5.5.3 Discrete Logarithms

We need one more trick before designing the circuit. Step 2 of CRT still involves iterated multiplications of numbers, albeit modulo small primes. The trick is to reduce this problem into iterated additions of their discrete logarithms as follows.
For a prime number $\alpha$, the set of positive integers modulo $\alpha$ forms a cyclic group under multipication modulo $\alpha$. Thus there exists an element $2 \leq g \leq \alpha-2$ such that for every $1 \leq a \leq \alpha-1$ there exists $0 \leq e \leq \alpha-2$ such that $g^{e} \equiv a \bmod \alpha$. ( $g$ is called a generator.) Call $e$ the discrete logarithm of $a$ (to the base $g$ ). Now, given $1 \leq a_{1}, \ldots, a_{n} \leq \alpha-1$, their product $b$ modulo $\alpha$ is

$$
\begin{array}{rlrl}
b=a_{1} \cdots a_{n} \quad \bmod \alpha & =g^{e_{1}} \cdots g^{e_{n}} & \bmod \alpha & \text { where } g^{e_{i}} \equiv a_{i} \quad \bmod \alpha, \forall i \in[n] \\
& =g^{e_{1}+\cdots+e_{n}} \quad \bmod \alpha .
\end{array}
$$

Thus the problem is reduced to iterated addition $e_{1}+\cdots+e_{n}$ (at the added overhead of logarithm computation and exponentiation though).

### 5.5.4 The circuit

We are now ready to detail the $\mathrm{TC}^{0}$ circuit for iterated integer mutiplication. Broadly, the levels in the circuit can be partitioned into 3 layers, one for each step of CRT. For simplicity, when explaining layers 1 and 2 (steps 1 and 2 of CRT) we pretend there is only one prime number $\alpha$ to work with, instead of $n^{2}$-many. The idea is that the circuitry described for $\alpha$ can be replicated $m=n^{2}$ times in parallel, one for each $\alpha_{i}$. This replication causes only polynomial blowup in size and no change in depth.

Layer 1: Task: Given $a_{i}$, compute $a_{i} \bmod \alpha$. Let $a_{i, k}$ denote, as always, $k$ 'th LSB of $a_{i}$. The solution starts with having the following values precomputed: $2^{n-1} \bmod \alpha, 2^{n-2} \bmod \alpha, \ldots, 2^{1} \bmod \alpha, 2^{0} \bmod \alpha$. Then, computing the sum

$$
A_{i}=a_{i, n-1} \cdot 2^{n-1} \quad \bmod \alpha+a_{i, n-2} \cdot 2^{n-2} \quad \bmod \alpha+\cdots+a_{i, 1} \cdot 2 \bmod \alpha+a_{i, 0} \cdot 1 \bmod \alpha
$$

costs a standard $\mathrm{TC}^{0}$ circuit for the iterated addition. Call this circuit $\phi$. Crucially, $\phi$ gives us $A_{i}$ that is much smaller than $a_{i}$ and has the property $A_{i} \equiv a_{i} \bmod \alpha$. Specifically, $A_{i} \leq n \alpha$, a polynomial upper bound, which lets us take the "exhaustive approach" to find $A_{i} \bmod \alpha\left(=a_{i} \bmod \alpha\right)$, as follows.

We can have with us the following values precomputed: $\alpha, 2 \alpha, \ldots, n \alpha$. We'll have small $\mathrm{TC}^{0}$ circuits $C_{1}, \ldots, C_{n}$ in parallel (on top of $\phi$ ) such that $C_{l}$ computes $A_{i}-l \alpha$. Exactly one of them outputs a value in $[0, \alpha-1]$. And that is our $a_{i} \bmod \alpha$. (There is another level above $C_{l}$ 's to do this range-checking.)
Throughout, we have used polynomially many threshold gates and incurred constant depth.
Layer 2: Task: Implement Discrete Logarithms described in section 5.5.3. Input is of course $a_{1} \bmod \alpha, a_{2}$ $\bmod \alpha, \ldots, a_{n} \bmod \alpha$. We assume $a_{i} \neq 0 \forall i \in[n]$, since otherwise the product is trivially 0 .

We will have the following values precomputed: $g$ (a generator), $g^{2} \bmod \alpha, g^{3} \bmod \alpha, \ldots, g^{\alpha-1} \bmod \alpha$. (Precomputation is possible because $g$ depends only on $\alpha$ which is fixed.) We'll have small TC ${ }^{0}$ circuits $D_{1}, \ldots, D_{\alpha-1}$ in parallel such that $D_{l}$ computes $\left(a_{i} \bmod \alpha\right)-g^{l} \bmod \alpha$. Exactly one of them, say $D_{l^{\prime}}$, outputs 0 . $l^{\prime}$ is our $e_{i}$, i.e., $g^{e_{i}} \equiv a_{i} \bmod \alpha$. (There is one more level above $D_{l}$ 's to "select" $l^{\prime}$ and present it as $e_{i}$ to the upper level.)
The depth of layer 2 so far is a constant. The size of layer 2 so far is roughly $O(\operatorname{poly}(\alpha))$. In the worst case $\alpha$ is $n^{2}$ th prime number, which is less than $n^{3}$. Thus the size of layer 2 so far is $O(\operatorname{poly}(n))$.

We now have $e_{1}, \ldots, e_{n}$ as outputs. We simply place a standard iterated addition $\mathrm{TC}^{0}$ circuit over them to compute $s=\sum_{i \in[n]} e_{i}$.
The last step is to compute $g^{s} \bmod \alpha($ which equals $b \bmod \alpha)$. Let $s^{\prime}=s \bmod (\alpha-1)$. Notice that $g^{s^{\prime}}$ $\bmod \alpha=g^{s} \bmod \alpha$, since $g^{\alpha-1} \equiv g^{0} \equiv 1 \bmod \alpha$. Thus it suffices to compute $g^{s^{\prime}} \bmod \alpha$.

We need to find $s^{\prime}$ from $s$. For this we observe that $s \leq n(\alpha-1)$, and hence decide to have the following multiples of $\alpha-1$ precomputed: $\alpha-1,2(\alpha-1), \ldots, n(\alpha-1)$. Now, very similar to $C_{l}$ 's mentioned before, there will be a setup of $n$-many $\mathrm{TC}^{0}$ circuits that can find the right $s^{\prime}$ using the precomputed multiples of $\alpha-1$.

Now that we know $s^{\prime}$, computing $g^{s^{\prime}} \bmod \alpha$ is a matter of merely selecting the right element from the lot $\left\{g^{0}, g^{1}, g^{2}, \ldots, g^{\alpha-2} \bmod \alpha\right\}$, all of which are precomputed. A bunch of poly( $\alpha$ ) many threshold gates wired in parallel can do this. Thus we have computed $g^{s^{\prime}} \bmod \alpha$ which is $b \bmod \alpha$.
One can verify that layer 2 in total costs polynomial size and constant depth. This is true even after taking into account the fact that the above procedure has to be replicated (in parallel) for all $n^{2}$ prime numbers.

Layer 3: Input: $b_{1} \bmod \alpha_{1}, \ldots, b_{m} \bmod \alpha_{m}$. We borrow the notations from (step 3 of) section 5.5.2. $\alpha_{j}$ are fixed. So, for all $j \in[m]$, product $Q_{j} R_{j}$ can be precomputed. Further, we can have the following
multiples of $Q_{j} R_{j}$ precomputed: $Q_{j} R_{j}, 2 Q_{j} R_{j}, \ldots,\left(\alpha_{j}-1\right) Q_{j} R_{j}$. Then computing $b_{j} Q_{j} R_{j}$ amounts to selecting the appropriate multiple which can be done using a $\mathrm{TC}^{0}$ circuit. As usual we have $m$ such circuits in parallel one for each $j$. On top of them we have an iterated addition circuit to compute $p^{\prime}:=\sum_{j \in[m]} b_{j} Q_{j} R_{j}$. It remains to compute $P=p^{\prime} \bmod \left(\prod_{j \in[m]} \alpha_{j}\right)$. This task is essentially the same as what we did with $\alpha$ in layer 1, hence we replicate that here. Of course, this time the divisor is larger: $\prod_{j \in[m]} \alpha_{j} \leq \alpha_{m}^{m} \leq m^{2 m}=$ $\left(n^{2}\right)^{2 n^{2}}<2^{n^{3}}$ implying that it has at most $n^{3}$ bits. Accordingly we have to make a few changes, like precomputing the multiples up to the factor of $n^{3}$ instead of $n-1$ or so, etc.

One can verify that for layer 3 , and thus altogether, the size is polynomial in $n$ and the depth is constant.

### 5.6 References

[1] M. Furst, J. B. Saxe, M. Sipser, Parity, circuits, and the polynomial-time hierarchy. Mathematical Systems Theory, 17(1):13-27, Apr. 1984
[2] Neil Immerman,Susan Landau, The complexity of iterated multiplication, Information and Computation 116(1) (1995), 103-116,
[3] Paul W. Beame,Stephen A. Cook,H. James Hoover, Log Depth Circuits for Division and Related Problems, SIAM(1986) Vol. 15 No. 4.
[4] John Reif, On Threshold Circuits and Polynomial Computation, Second Annual Structure in Complexity Theory $\operatorname{Symp}(1987)$, 118-123
[5] David Cox,John Little,Donal O'Shea, Ideals, Varieties and Algorithms, Second Edition, Springer (1996)

