# ADEPT-Z: Zero-Shot Automated Circuit Topology Search for Pareto-Optimal Photonic Tensor Cores

Ziyang Jiang<sup>1</sup>, Pingchuan Ma<sup>1</sup>, Meng Zhang<sup>2</sup>, Rena Huang<sup>2</sup>, Jiaqi Gu<sup>1\*</sup>

<sup>1</sup>Arizona State University, <sup>2</sup>Rensselaer Polytechnic Institute

\*jiaqigu@asu.edu

## ABSTRACT

Photonic tensor cores (PTCs) are essential building blocks for optical artificial intelligence (AI) accelerators based on programmable photonic integrated circuits. Most PTC designs today are manually constructed, with low design efficiency and unsatisfying solution quality. This makes it challenging to meet various hardware specifications and keep up with rapidly evolving AI applications. Prior work has explored gradient-based methods to learn a good PTC structure differentiably. However, it suffers from slow training speed and optimization difficulty when handling multiple non-differentiable objectives and constraints. Therefore, in this work, we propose a more flexible and efficient zeroshot multi-objective evolutionary topology search framework ADEPT-Z that explores Pareto-optimal PTC designs with advanced devices in a larger search space. Multiple objectives can be co-optimized while honoring complicated hardware constraints. With only <3 hours of search, we can obtain tens of diverse Pareto-optimal solutions,  $100 \times$ faster than the prior gradient-based method, outperforming prior manual designs with  $2 \times$  higher accuracy weighted area-energy efficiency. The code of ADEPT-Z is available at link.

#### **ACM Reference Format:**

Ziyang Jiang<sup>1</sup>, Pingchuan Ma<sup>1</sup>, Meng Zhang<sup>2</sup>, Rena Huang<sup>2</sup>, Jiaqi Gu<sup>1\*</sup>, <sup>1</sup>Arizona State University, <sup>2</sup>Rensselaer Polytechnic Institute, \**jiaqigu@asu.edu* . 2025. ADEPT-Z: Zero-Shot Automated Circuit Topology Search for Pareto-Optimal Photonic Tensor Cores . In *30th Asia and South Pacific Design Automation Conference (ASPDAC '25), January 20–23, 2025, Tokyo, Japan.* ACM, New York, NY, USA, 7 pages. https://doi.org/10.1145/3658617.3697708

#### **1** INTRODUCTION

Photonic tensor cores (PTC) offer significant advantages in artificial intelligence (AI) acceleration in terms of speed and energy efficiency over traditional electronic processors. Various integrated PTC designs have been demonstrated for speed-of-light matrix multiplication [1-12]. Coherent PTCs leverage phases of the light to encode more information and perform linear transformation via interference. The transfer matrix of coherent PTCs is usually a complex-valued matrix with stronger expressivity than real-valued tensor cores [11, 12]. Based on the expressivity, coherent PTCs can be separated into universal PTCs that can realize arbitrary matrices and subspace PTCs whose implementable matrices are a subset of them. Clements/Reck-style Mach-Zehnder interferometer (MZI) meshes based on singular value decomposition belong to universal PTCs. Extensive subspace coherent PTCs have been proposed to increase efficiency and scalability. Butterfly-style PTC [4, 13, 14] has been proposed to reduce the high cost of unitary matrices by using logarithmic-depth butterfly mesh. Interlacing MZI

ASPDAC '25, January 20-23, 2025, Tokyo, Japan

 $\circledast$  2025 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 979-8-4007-0635-6/25/01

https://doi.org/10.1145/3658617.3697708

mesh based on repeated phase shifters and multi-port couplers [10] has been proposed for more robust programmable PTCs.

However, almost all PTCs available today are manually designed based on matrix decomposition, intuition, or inspiration from signal processing, which only covers several points in the enormous space. Universal MZI arrays have maximum expressivity but suffer from a large area and insertion loss. Butterfly mesh is very compact, but its expressivity is limited when scales to larger arrays. Besides those two extreme points, most space has been left unexplored. Designing Paretooptimal PTCs that honor multiple constraints remains a significant challenge due to the complex trade-offs among performance metrics, especially when it scales to large circuit sizes. Even an experienced researcher often requires huge design efforts to create a photonic circuit design that can simultaneously deliver high matrix expressivity, high machine learning (ML) task accuracy, low latency, small area, and low power. It is promising to develop an automated circuit topology search methodology to explore the design space of PTCs to push the Pareto-front in the accuracy/area/efficiency space with fast design closure. Prior work has formulated the PTC topology search as a differentiable optimization problem and used a gradient-based method for one-shot topology search. Parameters and architecture variables are co-optimized on a certain model and dataset. This method successfully finds PTC designs with higher accuracy and smaller device footprint than MZI arrays and butterfly mesh. However, it shows several key limitations. (1) The gradient-based circuit search method is limited to differentiable objectives. It takes considerable effort to mathematically relax the combinatorial optimization problem to its continuous equivalence. However, not all objectives, such as the longest path, bounding box, or sequence distance, can be converted to a differentiable version. Finding an accurate approximation and effective proxy also takes non-trivial efforts. Moreover, the differentiable formulation restricts the search space such that it cannot consider multi-port couplers or arbitrary coupler placements. (2) It is difficult to handle multiple constraints. Since many constraints are non-differentiable in nature, they often need to be gradually enforced by using penalty or Lagrangian methods. Too many penalty terms make it difficult to balance their gradients and, thus, hard to converge to a high-quality solution. (3) High search cost to explore the Pareto front. One-shot gradient-based PTC search uses a weighted sum to optimize a single objective, which converges to one solution after hours of training. The search process needs to be relaunched every time the constraints or objective (i.e., weighting coefficients for metrics) change.

Motivated by the above limitations, in this work, we propose an efficient and flexible zero-shot PTC topology search framework ADEPT-Z based on gradient-free multi-objective evolutionary search, co-exploring the Pareto frontier with multiple objectives and hardware constraints in a larger design space. Our main contributions are as follows:

- We introduce a zero-shot topology search framework to explore Pareto-optimal photonic tensor core designs automatically.
- Larger Design Space: We expand the design space to include advanced multi-port devices with arbitrary placements for more efficient information interaction.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.

Ziyang Jiang<sup>1</sup>, Pingchuan Ma<sup>1</sup>, Meng Zhang<sup>2</sup>, Rena Huang<sup>2</sup>, Jiaqi Gu<sup>1\*</sup> <sup>1</sup>Arizona State University, <sup>2</sup>Rensselaer Polytechnic Institute

ASPDAC '25, January 20–23, 2025, Tokyo, Japan

- **Multi-Objective Optimization**: We create a compact gene encoding with customized mutation/crossover operators for evolutionary search with balanced exploration and exploitation, which generates diverse Pareto-optimal solutions in accuracy-density-efficiency space, honoring area, power, and latency constraints.
- Efficient Zero-Shot Performance Evaluation: To avoid the high training cost, we introduce a comprehensive accuracy proxy based on efficient circuit trainability and expressivity evaluation. Rigorous layout area and circuit power calculations are used to evaluate efficiency and compute density.
- Extensive evaluation on various benchmarks and circuit scales demonstrate that our searched PTC topologies show superior accuracy, compute density, energy efficiency, and generalizability with >100× lower search runtime compared to manual designs and prior auto-design method.

## 2 BACKGROUND

#### 2.1 Automated PTC Design

Previously, a differentiable PTC design method ADEPT [15] has been proposed to formulate the combinatorial circuit topology search as a continuous probabilistic optimization problem and solve it with gradient descent. Discrete device placement and routing problems are re-formulated as binarization-aware training and permutation matrix learning. Expressivity of the PTC is optimized by training the constructed optical neural network (ONN) on a small dataset. During training, hardware constraints and device footprint constraints are gradually enforced by the penalty method and augmented Lagrangian method. This method takes 6-10 hours time to converge to a single feasible solution with carefully balanced objectives and penalty terms, which lacks flexibility for multiple objectives and complicated constraints and search efficiency to explore the Pareto frontier.

# 3 ZERO-SHOT AUTOMATIC PHOTONIC TENSOR CORE DESIGN FRAMEWORK

### 3.1 Search Space Specification

For a complex-valued weight matrix  $W \in \mathbb{C}^{M \times N}$ , we can partition it into  $P \times Q$  sub-matrices with the size of  $K \times K$ . Each submatrix block can be mapped to a size-K PTC. Our goal is to search the Paretooptimal topology for this  $K \times K$  PTC. As shown in Fig. 1, we adopt  $U\Sigma V$  as the design skeleton. Both of the unitaries  $U_{pq}^{\alpha}$  and  $V_{pq}^{\alpha}$  follow a pre-defined block-wise structure, each block containing a column of phase shifters  $\mathcal{R}$ , couplers  $\mathcal{T}$ , and waveguide crossings  $\mathcal{P}$ . The diagonal matrix  $\Sigma$  is simply a column of modulators. Their transfer matrices can be formulated as

$$U_{pq}^{\alpha} = \prod_{b=1}^{B^{U}} \mathcal{P}_{b} \mathcal{T}_{b} \mathcal{R} \left( \Phi_{pq}^{b} \right), \quad V_{pq}^{\alpha} = \prod_{b=B^{U}+1}^{B^{U}+B^{V}} \mathcal{P}_{b} \mathcal{T}_{b} \mathcal{R} \left( \Phi_{pq}^{b} \right)$$
(1)

For simplicity, we only discuss U and use B instead of  $B^U$ .

The first stage of each block is one column of *K* phase shifters (PS), which is equivalent to a diagonal matrix  $\mathcal{R}\left(\Phi_{pq}^{b}\right)$  to the input vectors  $\mathcal{R}\left(\Phi_{pq}^{b}\right) = \text{diag}\left(e^{-j\phi_{1}}, \cdots, e^{-j\phi_{K}}\right)$ . The second stage consists of multi-port couplers (DC) for all-to-all information mixing via diffraction and interference. Specifically, the multi-port couplers are Multi-Mode Interference (MMI) couplers. The transmission from *k*-th input port to *l*-th output port of an *N<sub>c</sub>*-port general MMI [16] is

$$M_{lk} = (-1)^{l+k} j \exp\left(j\frac{\pi}{4}\right) \\ \times \sqrt{\frac{1}{N_c}} \exp\left(-j\left((l-1/2) - (-1)^{l+k}(k-1/2)\right)^2 \pi/(4N_c)\right),$$
(2)



Figure 1: Illustration of PTC search space of ADEPT-Z.

where  $N_c$  is the number of the input/output ports. An array of DCs (or waveguides) is expressed as a block diagonal matrix  $\mathcal{T}_b$ . Our design space for each DC layer is equivalent to partitioning an integer K into the sum of  $n \in [1, K]$  nonnegative integers times their permutations, which is considerably larger than only densely placing 2-port DCs to fill all K wires in prior work [15]. Since there is no analytical form for the combinations, we denote the exponential solution space for each DC layer as  $F_K$ . **The last stage** in the block is the waveguide crossing layer can be expressed as a permutation matrix  $\mathcal{P}_b$ . The  $\mathcal{P}_b$ , which has K! possible combinations in one block, contributes most of the design space.

**Design Space**. In summary, a photonic mesh contains *B* blocks, each comprising a PS layer, a DC layer, and a CR layer. The topology  $\alpha$  includes the number of blocks  $B^U$  and  $B^V$ , the waveguide connections  $\mathcal{P}$ , and the placements of couplers as specified by  $\mathcal{T}$ . The total design space is  $O\left((F_K \cdot K!)^{B_{\text{max}}}\right)$ .

#### 3.2 **Problem Formulation**

Our goal is to explore the Pareto-front of coherent PTC designs to deliver high expressivity, area efficiency, and energy efficiency while honoring the area, power consumption, and latency constraints. We formulate the constrained multi-objective problem as follows,

$$\max_{\alpha \in \mathcal{A}} \{ S_1(gW^{*\alpha}), CD(\alpha), EE(\alpha) \}, \quad \alpha = \left( B^U, B^V, \mathcal{P}, \mathcal{T} \right)$$
  
s.t.  $W^* = \underset{W}{\operatorname{argmin}} \mathcal{L}(W^{\alpha}; \mathcal{D}^{\operatorname{trn}}), C_{\min_i} \leq C_i(\alpha) \leq C_{\max_i}, i \in \mathbb{N}^+$   
 $W^{\alpha} \in \mathbb{C}^{M \times N} = \left\{ W_{pq}^{\alpha} \right\}_{p=1,q=1}^{p=P,q=Q} = \left\{ U_{pq}^{\alpha} \Sigma_{pq} V_{pq}^{\alpha} \right\}_{p=1,q=1}^{p=P,q=Q}, \qquad (3)$   
 $B^U, B^V \in [B_{\min}/2, B_{\max}/2], W_{pq} \in \mathbb{C}^{K \times K},$   
 $\mathcal{P} = \left( \cdots, \mathcal{P}_b, \cdots, \mathcal{P}_B U_{+B^V} \right), \mathcal{T} = \left( \cdots, \mathcal{T}_b, \cdots, \mathcal{T}_B U_{+B^V} \right),$ 

where  $W^* \in \mathbb{C}^{M \times N}$  is the trained ONN weight matrix for accuracy evaluation. There are multiple hardware constraints  $C_i$  that need to be honored. The main optimization variables are architecture parameters  $\alpha$  that impact the structure of U and V circuits, which contains circuit block count  $B^{U/V}$ , coupler layer transmission  $\mathcal{T}$ , and waveguide ADEPT-Z: Zero-Shot Automated Circuit Topology Search for Pareto-Optimal Photonic Tensor Cores



Figure 2: Spearman coefficients for different accuracy proxies.

crossings  $\mathcal{P}$ . Diagonal  $\Sigma$  belongs to ONN weights, not circuit topology. The PTC topology  $\alpha$  **is shared** for all matrices in the neural network layers, not layer-specific.

## 3.3 Multi-Objective Evolutionary PTC Search

Algorithm 1 Evolutionary PTC topology search algorithm

| Inp | out: Maximum iteration times Imax, Second phase iteration time Iphase                     |
|-----|-------------------------------------------------------------------------------------------|
|     | Population size $P_0$ , Initial mutation rate $p_{mu}$ , constant crossover rate $p_{ca}$ |
|     | Baseline topologies $\alpha_0$                                                            |
| Out | tput: Optimal solution set S                                                              |
| 1:  | $S \leftarrow \alpha_0 + \text{randomInit}(P_0)$                                          |
| 2:  | for $i \leftarrow 1 \cdots I_{max}$ do                                                    |
| 3:  | $OffSpring \leftarrow mutateAndCrossover(S, p_{mu}, p_{co})$                              |
| 4:  | $OffSpring \leftarrow checkConstraints(OffSpring)$                                        |
| 5:  | $S \leftarrow S + \text{OffSpring}$                                                       |
| 6:  | S.assessSortSelect()                                                                      |
| 7:  | if $i \leq I_{max} - I_{phase2}$ then                                                     |
| 8:  | $p_{mu} \leftarrow \text{cosineDecayScheduler.step}()$                                    |
|     |                                                                                           |
| ,   | To combine the bighter discusts and condition bighter DTC to all a                        |

To explore the *highly discrete* and *multi-objective* PTC topology design space, a suitable search engine needs to meet the following requirements: **①** it can handle multiple (non-)differentiable objectives; **②** it should ensure sufficient exploration with balanced exploitation to cover the huge design space and generate multiple Pareto-optimal candidate solutions; **③** it must ensure feasibility under multiple constraints; **④** it should be efficient, especially avoiding high model training cost to find the trained weights  $W^*$ . To satisfy all the above requirements, we propose a zero-shot gradient-free framework based on a customized multi-objective evolutionary search algorithm to efficiently solve the **constrained combinatorial optimization problem with multiple objectives** as shown in Alg. 1.

3.3.1 Objective Definition. Multiple metrics must be considered in a balanced fashion when assessing a PTC topology. We employ three scores simultaneously to evaluate the performance, including task accuracy, compute density, and energy efficiency.

Accuracy Score. The accuracy impact of a PTC needs to be measured by mapping it onto an ONN and training and evaluating the model on a dataset. To avoid training costs and expedite the search process, we design a comprehensive Accuracy Score as a training-free proxy, which evaluates both the trainability and expressivity of PTC topology.

We select 5 candidate scores to construct the proxy: (1) Param-Score, (2) Sparsity-Score, (3) Zico-Score [17], (4) Zen-Score [18], and (5) Gradient Norm [19]. The first two scores are motivated by the insight that a matrix's **expressivity** is typically related to the number of independent parameters and its sparsity. For Param-Score, phases  $\Phi$  on programmable PS are the only trained parameters in unitaries. We count the total PS (directly connected phase shifters can be merged as one) normalized to the matrix size  $K^2$  as the Param-Score. Even with

![](_page_2_Figure_12.jpeg)

Figure 3: Gene-to-circuit mapping.

many parameters, if there is insufficient cross-channel interaction via couplers, the matrix can be a block diagonal matrix with high sparsity. Hence, as a complementary score, we use the sparsity of W (higher is denser) to evaluate its expressivity.

The last three are commonly used accuracy proxies from Zero-shot neural architecture search (NAS), which all focus on the gradient/Lipschitz-related property to evaluate its **trainability**. Figure 2 shows the Spearman correlation between each score and the test accuracy obtained from extensive training to measure the ability of the scores to accurately predict the *relative accuracy ranking* of different PTC topologies. *Zico-Score, Param-Score, and Sparsity-Score* have the highest Spearman coefficients, indicating more accurate predictions. We define an Accuracy Score *S* as a linear weighted combination of these three scores. Optimal combination coefficients are found by solving the following optimization problem to maximize the Spearman correlation,

$$\max_{c_i} \text{Spearman}(\mathcal{S}(\alpha), Acc(W^{*\alpha})), \mathcal{S}(\alpha) = \sum_i c_i * S_i,$$

$$S_i \in \{\text{Zico-Score, Param-Score, Sparsity-Score}\}$$
(4)

in which  $Acc(W^{*\alpha})$  is the actual test accuracy of a trained ONN model. The final Accuracy Score  $S(\alpha)$  is given:

$$S(\alpha) = 0.015 \cdot S_{\text{Zico}}(\alpha) + 0.561 \cdot S_{\text{Param}}(\alpha) + 0.175 \cdot S_{\text{Sparsity}}(\alpha)$$
(5)

**Compute Density (CD)**. Compute density (CD) is a commonly used performance metric in AI accelerators, which measures the computing speed with a unit chip area, typically in the unit of TOPS/mm<sup>2</sup>. Higher CD means better area efficiency. For a given topology  $\alpha$ , the compute density  $CD(\alpha)$  is as follows:

$$CD(\alpha) = 2K^2 / (A(\alpha) \times \tau(\alpha))$$
(6)

A( $\alpha$ ) represents the estimated area of  $\alpha$  and  $\tau(\alpha)$  represents the estimated latency of the PTC. Later, we will give a detailed estimate of the PTC area and latency.

**Energy Efficiency (EE)**. Energy efficiency (EE), typically in the unit of TOPS/Watt, is a vital objective for efficient AI hardware. The formula to calculate energy efficiency  $EE(\alpha)$  is as follows:

$$EE(\alpha) = 2K^2 / (P(\alpha) \times \tau(\alpha)), \tag{7}$$

where  $P(\alpha)$  represents the estimated power explained later.

For each potential solution, we evaluate these three objectives independently. The solutions are then ranked and selected based on their combined performance across Accuracy Score, Compute Density, and Energy Efficiency. Instead of a weighted sum of those three objectives with heuristic preference, we **simultaneously maximize three scores** to obtain multiple Pareto-optimal points, from which designers can further select suitable designs by only searching once.

3.3.2 Gene Encoding. To facilitate evolutionary search, we create a compact gene representation to encode the topology of a PTC in Fig. 3. The gene starts with a number *B* to indicate the first *B* blocks are active, followed by  $B^{max}$  block encodings, each carrying multi-port DC placement information and waveguide permutation indices.

ASPDAC '25, January 20–23, 2025, Tokyo, Japan Table 1: Mutation operators for different components.

| Туре  | Mutation Ops | Description                                                                                            |
|-------|--------------|--------------------------------------------------------------------------------------------------------|
|       | R2A1         | Remove 2 and add 1 DC. E.g., $[2,4,2] \rightarrow [1,1,1,1,1,1,2] \rightarrow [1,4,1,2]$               |
| DC    | A2R1         | $\mid$ Add 2 and remove 1 DC. E.g., $[1,1,2,1,1,1] \rightarrow [1,1,2,2,2] \rightarrow [1,1,2,1,1,2]$  |
|       | Move         | Move one DC to another position. E.g., $[1,2,2,2,1] \rightarrow [1,2,2,1,2]$                           |
|       | RS           | Resample a DC array                                                                                    |
| CR    | AddCR        | Add a random number of CR. E.g., Add 2 CRs: $[0,1,2,3] \rightarrow [0,3,1,2]$                          |
|       | ReduceCR     | $\big $ Reduce a random number of CR. E.g., Reduce 1 CR: $[0,\!2,\!1,\!3] \rightarrow [0,\!1,\!2,\!3]$ |
| Block | AddBlock     | Copy a random number of blocks from the front of gene, add to the end                                  |
|       | ReduceBlock  | Remove a random number of blocks from the end of gene.                                                 |
|       |              |                                                                                                        |

**Block Encoding**. The first integer in a gene is the number of effective blocks in  $\alpha$ . The first B/2 blocks construct U, while the rest are for V. **DC Array Encoding**. For the DC array, we use a sequence of nonnegative integers  $\{(N_c^1, \dots, N_c^n) | \sum_{i=1}^n N_c^i = K, N_c \in \mathbb{N}_+\}$ , where each integer corresponds to a specific type of multi-port DC ( $N_c > 1$ ) or a waveguide ( $N_c = 1$ ). For instance, a 3-port DC is denoted as 3. This approach simplifies the representation of the DC array, making it easy to parse and manipulate during the evolutionary process.

**<u>CR Array Encoding</u>**. The permutation indices, i.e., positions of '1' in the permutation matrix  $\mathcal{P}$ , are compact representations or waveguide routing solutions. All feasible solutions can be efficiently accessed by re-ordering the indices.

3.3.3 Population Initialization. The population size is  $P_0$ , and we randomly sample the number of active blocks *B* and DC placements for each population. Randomly permuted indices for CR layers have too many crossings. Thus, we heuristically limit the maximum crossings for each CR layer not to exceed the maximum crossings in butterfly mesh, i.e., K(K/2-1)/4. Initialized populations will honor all hardware constraints. Manual designs have also been added as initial solutions.

*3.3.4 Mutation & Crossover.* We customize global/local mutation and crossover operators to ensure better global coverage of the large design space while facilitating better convergence with local search.

**Mutation Operator**. We designed three types of mutation operators based on the type of devices they apply to, summarized in Table 1. For **DC Mutation**, we design 4 operators: (A2R1, R2A1, Move) for local adjustment, and RS to escape the local optima. For **CR Mutation**, we design 2 operators: AddCR and ReduceCR. Applying  $\Delta N_c$  steps of bubble sort (descending) to the CR-indices is equivalent to reducing the same amount of crossings. Ascending sort has the opposite effect of increasing crossings. The maximum crossings are still limited by the value in butterfly meshes, as explained during population initialization. For **Block Mutation**, we have 2 operators: AddBlock and ReduceBlock, which mainly adjust the circuit depth. For DC and CR arrays, before applying mutation, we first perform a **legality check** to ensure that the operator can be successfully applied to the gene segment. Then, with a mutation probability  $p_{mu}$ , we randomly select one legal operator from the operator set and apply it to the gene.

**Crossover Operator**. We customize crossover operators for solution interpolation, shown in Fig. 4. For **DC Crossover**, we identify all potential cutting points to divide parent genes into *border-aligned* segments, which avoids cutting through multi-port couplers. Sliced segments are swapped with a probability of 0.5. For **CR Crossover**, to ensure *legal* indices while *preserving the relative order* in parent genes, we select even-sized disjoint indices from two parents, shown in Fig. 4b. Then, we insert the selected indices from the other parent into the empty slots of one parent and generate two offspring. For **Block Crossover**, we randomly swap two active(effective) blocks at the same position with a probability of 0.5 to avoid generating illegal genes. All the detailed crossover methods are illustrated in Fig.4.

![](_page_3_Figure_8.jpeg)

![](_page_3_Figure_9.jpeg)

Figure 5: Layout for area/latency estimation. A compact crossing array layout that occupies the leftmost slots is pre-defined.

*3.3.5 Cost Estimation.* Here, we explain a detailed estimation of hardware cost, including area, power, and latency.

<u>Area Estimation</u>. For a  $K \times K$  PTC, we estimate the area cost of all its electrical and optical components as follows:

$$A(\alpha) = A_U(\alpha) + A_V(\alpha) + A_{\Sigma} + K(A_{\text{TIA}} + A_{\text{PD}} + A_{\text{MZM}} + A_{\text{DAC}} + A_{\text{ADC}}), \qquad (8)$$

where  $A_{\text{TIA}}$ ,  $A_{\text{PD}}$ ,  $A_{\text{MZM}}$ ,  $A_{\text{ADC}}$  and  $A_{\text{DAC}}$  are area cost for transimpedance amplifier (TIA), photodetector (PD), high-speed Mach-Zehnder modulator (MZM), analog-to-digital (ADC), and digital-toanalog converter (DAC). The area for the photonic part is:

$$\begin{aligned} A_{U/V}(\alpha) &= L_{PS}(W_{PS} + (K-1)\Delta W) + L_{DC}(K-1)\Delta W \\ &+ \left(N_c L_{CR} + (N_c - 1)\Delta L_{CR}\right) \left(N_r W_{CR} + (N_r - 1)\Delta W_{CR}\right) \\ &+ \left(3(K-1)\Delta W\Delta L + W_{PS}\Delta L\right), \end{aligned} \tag{9} \\ A_{\Sigma} &= \left((2K-1)\Delta W + W_{PS}\right) \cdot (L_{PS} + 2\Delta L) \\ &+ \left((K-1)\Delta W + L_Y\right) (2L_Y + \Delta W), \end{aligned}$$

where *L* and *W* represent device length and width for phase shifter (PS), coupler (DC), crossing (CR), and y-branch (Y).  $\Delta L$  and  $\Delta W$  are spacings.  $N_c$  and  $N_r$  represent the number of columns and rows occupied by our predefined compact triangular crossing array layout, which fills the leftmost column first and expands to the right. Figure 5 shows the details for estimating the hardware cost of the unitary matrices. **Our area estimation considers the actual chip layout and practical spacing, which is much more accurate than simply summing up all device footprint in prior work [15]. All dimensions for optical components can also be obtained from the GF foundry PDK. We set \Delta L = 20\mu m, \Delta W = 100\mu m, W\_{CR} = L\_{CR} = 10\mu m.** 

ADEPT-Z: Zero-Shot Automated Circuit Topology Search for Pareto-Optimal Photonic Tensor Cores

![](_page_4_Figure_1.jpeg)

Figure 6: Population size of 40 and Iteration time of 80 achieve a balance between exploration and efficiency.

![](_page_4_Figure_3.jpeg)

Figure 7: Initial mutation rate of 0.1 gives a good exploration.

**<u>Power</u> Estimation**. The PTC power is estimated as follows [4]:

$$P(\alpha) = P_{\text{laser}} + K \cdot (P_{\text{MZM}} + P_{\text{DAC}} + P_{\text{ADC}} + P_{TIA} + P_{\text{PD}}).$$
 (10)  
The formula for laser power is:  $P_{\text{laser}} = \frac{2^{b} \cdot 10^{(S_{PD} + IL)/10}}{\eta}$ , where  $\eta$  is the wall-plug efficiency,  $S_{PD}$  is the PD sensitivity, IL is the insertion loss of the circuit, and *b* refers to the ADC bit resolution. Given working frequency *f* and input bitwidth *b*, the power for DAC is derived by:  
 $P_{DAC} = \frac{b_0 2^b f}{b 2^{b_0} f_s} \cdot P_{DAC0}$ , where  $P_{DAC0}$  is the power of DAC at sampling rate  $f_s$  and  $b_0$  bit precision. The ADC power is derived by:  $P_{ADC} = \frac{b_0 f}{b f_s} \cdot P_{ADC0}$ , where  $P_{ADC0}$  is ADC power at sampling rate  $f_s$  and  $b_0$  bit precision.  $P_{\Sigma}$  and  $P_{PS}$  consider the static power of all phase shifters.  
Latency Estimation. The PTC latency is determined by the optical path delay, input modulation, and readout delay as follows:

 $\tau(\alpha) = \max(f^{-1}, n_{\rm g}L_{\rm path}/c_0 + \tau_{DAC} + \tau_{PD}),$ 

$$L_{\text{path}} = \sum_{b}^{B} (L_{PS}^{b} + L_{DC}^{b,max} + L_{CR}^{b,max} + 3\Delta L),$$
(11)

where the clock rate is set to f=10 GHz,  $c_0$  is the light speed,  $n_g$  is the group index,  $L_{\text{path}}$  refers to the longest optical path length, and  $\tau_{DAC}$  and  $\tau_{PD}$  refer to the delays of DAC and PD, and they are both set to 10ps. For the estimation of the longest path length  $L_{\text{path}}$ , we consider the worst-case scenario. We sum up the PS length, largest coupler length, longest waveguide routing path, and  $\Sigma$  matrix length to get  $L_{\text{path}}$ . If the optical path delay of a very deep circuit cannot be hidden by one cycle  $(f^{-1})$ , the clock frequency will be reduced to accommodate the latency accordingly [20].

3.3.6 Two-stage Pareto Front Search Strategy: NSGA-II. We adopt a Non-dominated Sorting Genetic Algorithm II (NSGA-II) algorithm [21] as the search engine to handle multi-objective optimization. Each search iteration doubles the population size by generating new legal solutions via crossover and mutation, and then it selects solutions from the superior Pareto fronts that contribute the most to solution diversity to maintain a constant population size.

To prioritize exploration at the beginning and gradually focus on local exploitation, we divide the search process into two phases. In the first phase, we use all mutations for DC, CR, and Blocks with a cosine-decayed mutation rate, which allows us to explore better genes across the search space while ensuring that high-quality genes do not undergo significant mutations. In the second phase, we set all mutation rates to 0.02 and remove the two block-level mutation operators and the RS operator for DC to avoid major gene changes. This ensures that only minor mutations occur for local search toward optimal solutions.

![](_page_4_Figure_12.jpeg)

Figure 8: Validate each mutation operator for (a) DC and (b) CR array. We remove one mutation operator at a time and observe the distribution of the objective values of the final populations.

![](_page_4_Figure_14.jpeg)

Figure 9: Compare the search performance of a constant scheduler and our two-stage cosine decay scheduler.

## 4 EXPERIMENTAL RESULTS

#### 4.1 Experiment Setup

**Datasets**. During the search phase, we used the MNIST dataset to estimate Test Accuracy. The solutions found were then evaluated on the MNIST [22], FMNIST [23], SVHN [24], and CIFAR10 [25] datasets. **NN Models**. The PTC topology is searched on a 2-layer CNN model and MNIST dataset without extensive training. The searched topology is then applied to other models/datasets.

**Searching Settings**. We choose the population size to be 40, the maximum iterations to be 80, and the initial mutation rate to be 0.1. For 16×16 PTCs, we use (2-port, 8-port) DCs during search. For K=8 and 32, we use (2-port, 4-port) for 8×8 PTCs and (4-port,16-port) for 32×32 PTCs. We applied area constraint [18.31,24.02]  $mm^2$  (80% of butterfly optical area up to 50% of MZI array optical area, plus electrical area), power constraint [50, 1000] mW, latency constraint [100, 1000] ps to the search process. We set f=10 GHz and the resolution as 4-bit. For device cost, we use GF foundry PDK [26] and a customized PDK [27]. Electrical devices are the same as [27].

## 4.2 Ablation Studies

**Population Size and Search Steps**. We use the product of three objectives to reflect the solution quality, and  $P_{avg}$  is the average value across the current populations. Figure 6 shows that 40 populations evolved for 80 steps have the best quality and runtime balance.

Mutation Rate and Operators. Figure 7 determines the best initial mutation rate of 0.1 to balance exploration and convergence. To verify the impact of the designed mutation operators, we removed one operator at a time, showing that all operators positively affect solution quality both for DC and CR, shown in Fig. 8a and Fig. 8b.

![](_page_5_Figure_0.jpeg)

Figure 10: Comparison of the observed solution points between random search and evolutionary search.

![](_page_5_Figure_2.jpeg)

Figure 11: Visualization of ADEPT-Z-a0 for K=8, 16, and 32.

**Two-Stage Search**. Our two-stage search using the cosine mutation rate scheduler, as shown in Fig. 9, converges faster to higher-quality solutions than the single-stage search with a constant mutation rate.

#### 4.3 Main Results

In Fig. 10, our solutions are Pareto-optimal as they dominate random designs and prior manual designs in the accuracy-density-efficiency space. To better evaluate the performance, we introduce two comprehensive metrics: **Area-Energy Efficiency (AEE) and Accuracy-weighted AEE (AAEE), i.e., Accuracy-AEE product**. From the final Pareto front, we select four designs for each PTC size, named ADEPT-Z-a0 to ADEPT-Z-a3, and compare them to manual designs, i.e., MZI array [1], Butterfly mesh [4], and interlaced MMI array [10] in Table 2.

ADEPT-Z-a0 is the best solution in terms of AAEE on all three PTC sizes. Our solutions balance expressivity and hardware cost compared to MZI and MMI arrays, showing an overall  $2.47 \times$  higher AAEE. We observe that butterfly solutions are roughly located at the Pareto front. Our solutions are  $1.03 \times$  better than Butterfly in AAEE, but our method gives much more diverse designs to cover various accuracy/power/area/latency requirements.

We visualize ADEPT-Z-a1 in Fig. 11. Multi-port DCs are frequently used for efficient cross-channel interaction. As a result, waveguide crossings are minimized to mix signals only when necessary to reduce hardware costs. Note that we do not compare to gradient-based ADEPT [15] as **ADEPT cannot handle multi-port couplers or non-differentiable latency/area objectives**. Moreover, our method can find 40 Pareto-optimal solutions within 2.7 hours, 100× faster than ADEPT, which requires 40×8=320 hours even if it is applicable.

Adapt PTCs to Different Foundry PDKs. Our method can flexibly adapt different device PDKs. We replaced the GF PDK with a customized PDK [27] in Table 3. For a 16×16 PTC size, we applied a new area constraint of [2.208, 15.197] mm<sup>2</sup>. The best-performing searched solution, ADEPT-Z-a0, shows 8.26× higher AAEE than MZI and MMI arrays and 1.04× compared with Butterfly mesh. Our solutions use multi-port couplers to enhance information mixing while having fewer Table 2: Evaluate PTCs with different sizes using GF PDK in terms of area (optical+electrical) (mm<sup>2</sup>), power (mW), and latency (ps). We also show compute density (CD) (TOPS/mm<sup>2</sup>), energy efficiency (EE) (TOPS/W), area-energy efficiency (AEE) (TOPS/W/mm<sup>2</sup>), and accuracy-weighted AEE (AAEE).

| K  | Metrics   | MZI [1]     | Butterly [4] | MMI [10]     | ADEPT-Z-a0 | ADEPT-Z-a1 | ADEPT-Z-a2  | ADEPT-Z-a3 |
|----|-----------|-------------|--------------|--------------|------------|------------|-------------|------------|
| 8  | Area(O+E) | 3.79+8.18   | 0.92+8.18    | 3.57+8.18    | 0.73+8.18  | 0.83+8.18  | 0.94 + 8.18 | 1.57+8.18  |
|    | Power     | 141.09      | 141.48       | 141.45       | 141.92     | 141.32     | 141.98      | 142.74     |
|    | Latency   | 100.69      | 100.00       | 100.00       | 100.00     | 100.00     | 100.00      | 100.00     |
|    | Accuracy  | 98.68       | 98.33        | 98.72        | 98.22      | 98.13      | 98.30       | 98.38      |
| -  | Area(O+E) | 15.32+16.36 | 2.43+16.36   | 21.16+16.36  | 2.37+16.36 | 2.39+16.36 | 3.38+16.36  | 3.44+16.36 |
| 16 | Power     | 209.83      | 283.15       | 266.64       | 282.19     | 282.20     | 284.18      | 284.48     |
| 10 | Latency   | 147.25      | 100.00       | 107.38       | 100.00     | 100.00     | 100.00      | 100.00     |
|    | Accuracy  | 98.74       | 98.16        | 98.58        | 97.83      | 97.69      | 98.27       | 98.24      |
|    | Area(O+E) | 61.56+32.72 | 6.11+32.72   | 142.58+32.72 | 5.62+32.72 | 6.46+32.72 | 8.53+32.72  | 9.77+32.72 |
| 20 | Power     | 316.34      | 487.36       | 394.52       | 563.50     | 563.52     | 563.69      | 563.71     |
| 52 | Latency   | 240.37      | 122.13       | 158.26       | 100.00     | 100.00     | 100.00      | 100.00     |
|    | Accuracy  | 98.85       | 97.88        | 98.65        | 97.77      | 97.62      | 97.73       | 97.98      |
|    | CD        | 0.106       | 0.141        | 0.109        | 0.144      | 0.142      | 0.140       | 0.131      |
| 0  | EE        | 9.010       | 9.047        | 9.049        | 9.019      | 9.058      | 9.016       | 8.967      |
| 0  | AEE       | 0.753       | 0.995        | 0.770        | 1.012      | 1.005      | 0.988       | 0.920      |
|    | AAEE      | 0.729       | 0.978        | 0.760        | 0.994      | 0.986      | 0.971       | 0.905      |
| 16 | CD        | 0.110       | 0.272        | 0.127        | 0.273      | 0.273      | 0.259       | 0.259      |
|    | EE        | 16.570      | 18.081       | 17.883       | 18.144     | 18.143     | 18.017      | 17.997     |
|    | AEE       | 0.523       | 0.962        | 0.477        | 0.969      | 0.968      | 0.913       | 0.909      |
|    | AAEE      | 0.509       | 0.944        | 0.470        | 0.948      | 0.946      | 0.897       | 0.893      |
| 32 | CD        | 0.090       | 0.432        | 0.074        | 0.534      | 0.523      | 0.497       | 0.482      |
|    | EE        | 26.934      | 34.407       | 32.802       | 36.344     | 36.343     | 36.332      | 36.332     |
|    | AEE       | 0.286       | 0.886        | 0.187        | 0.948      | 0.928      | 0.881       | 0.855      |
|    | AAEE      | 0.283       | 0.867        | 0.184        | 0.927      | 0.906      | 0.861       | 0.838      |

Table 3: 16×16 PTCs on customized PDKs with MNIST accuracy.

| $ \begin{array}{ c c c c c c c c c c c c c c c c c c c$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |           |           |                 |            |            |             |            | •          |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------|-----------|-----------------|------------|------------|-------------|------------|------------|
| $ \begin{array}{ c c c c c c c c c c c c c c c c c c c$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | Metrics   | MZI [1]   | Butterfly [4]   | MMI [10]   | ADEPT-Z-a0 | ADEPT-Z-a1  | ADEPT-Z-a2 | ADEPT-Z-a3 |
| Power         223.59         219.93         219.45         218.37         218.54         218.30         218.93           Latency         100.00         100.00         100.00         100.00         100.00         100.00           Accuracy         98.74         98.16         98.58         97.67         98.02         98.08         98.18           CD         0.653         3.242         0.297         3.370         3.301         3.322         2.507           EE         22.899         23.280         23.330         23.446         23.429         23.454         23.387           AEE         2.919         14.744         1.354         15.074         14.806         14.923         11.425           AAEE         2.882         14.473         1.336         15.074         14.806         14.923         11.242 | Area(O+E) | 7.51+0.33 | $1.25 \pm 0.33$ | 16.90+0.33 | 1.19+0.33  | 1.22 + 0.33 | 1.21+0.33  | 1.71+0.33  |
| Latency         100.00         100.00         100.00         100.00         100.00         100.00           Accuracy         98.74         98.16         98.58         97.67         98.02         98.08         98.18           CD         0.653         3.242         0.297         3.370         3.301         3.322         2.507           EE         22.899         23.280         23.330         23.446         23.429         23.454         23.387           AEE         2.919         14.744         1.354         15.045         15.215         11.450           AAEE         2.882         14.473         1.3364         15.074         14.806         14.923                                                                                                                                                       | Power     | 223.59    | 219.93          | 219.45     | 218.37     | 218.54      | 218.30     | 218.93     |
| Accuracy         98.74         98.16         98.58         97.67         98.02         98.08         98.18           CD         0.653         3.242         0.297         3.370         3.301         3.322         2.507           EE         22.899         23.280         23.330         23.446         23.429         23.454         23.387           AEE         2.919         14.744         1.354         15.434         15.105         15.215         11.450           AAEE         2.882         14.473         1.336         15.074         14.806         14.923         11.242                                                                                                                                                                                                                                      | Latency   | 100.00    | 100.00          | 100.00     | 100.00     | 100.00      | 100.00     | 100.00     |
| CD         0.653         3.242         0.297         3.370         3.301         3.322         2.507           EE         22.899         23.280         23.330         23.446         23.429         23.454         23.387           AEE         2.919         14.744         1.354         15.434         15.105         15.215         11.450           AAEE         2.882         14.473         1.336         15.074         14.806         14.923         11.242                                                                                                                                                                                                                                                                                                                                                           | Accuracy  | 98.74     | 98.16           | 98.58      | 97.67      | 98.02       | 98.08      | 98.18      |
| EE         22.899         23.280         23.330         23.446         23.429         23.454         23.387           AEE         2.919         14.744         1.354         15.434         15.105         15.215         11.450           AAEE         2.882         14.473         1.336         15.074         14.806         14.923         11.242                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | CD        | 0.653     | 3.242           | 0.297      | 3.370      | 3.301       | 3.322      | 2.507      |
| AEE         2.919         14.744         1.354         15.434         15.105         15.215         11.450           AAEE         2.882         14.473         1.336         15.074         14.806         14.923         11.242                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | EE        | 22.899    | 23.280          | 23.330     | 23.446     | 23.429      | 23.454     | 23.387     |
| AAEE 2.882 14.473 1.336 15.074 14.806 14.923 11.242                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | AEE       | 2.919     | 14.744          | 1.354      | 15.434     | 15.105      | 15.215     | 11.450     |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | AAEE      | 2.882     | 14.473          | 1.336      | 15.074     | 14.806      | 14.923     | 11.242     |

Table 4:  $16 \times 16$  ADEPT-Z-a0 is searched on CNN-MNIST and adapted to new benchmarks with GF PDKs. AAEE is shown.

| Model    | Dataset   MZI [1] | Butterfly [4] | MMI [10] | ADEPT-Z-a0 |
|----------|-------------------|---------------|----------|------------|
| CNN      | FMNIST 0.472      | 0.852         | 0.432    | 0.853      |
| VGG8     | CIFAR10 0.429     | 0.744         | 0.382    | 0.769      |
| ResNet20 | SVHN 0.491        | 0.884         | 0.445    | 0.890      |

blocks and crossings, thus achieving a better balance between expressivity and efficiency.

**Generalizability to New ONNs and Datasets**. It is important that our searched topology can be **generalized** to new ONNs and datasets other than the one used for the search. We train our searched PTC structures on various new benchmarks in Table 4. Though ADEPT-Z-a0 is searched on 2-layer CNN and MNIST, it *maintains superior performance and efficiency* on more complicated models and datasets, showing an average of 1.6× higher AAEE than manual baselines.

# 5 CONCLUSION

In this work, we propose a zero-shot multi-objective evolutionary circuit topology search framework ADEPT-Z to explore Pareto-optimal photonic tensor core designs. In an augmented design space with multiport couplers, our customized evolutionary algorithm simultaneously optimizes accuracy, compute density, and efficiency, honoring various hardware constraints with balanced exploration and exploitation. By paying less than a 3-hour search cost, our method can obtain tens of diverse Pareto-optimal circuit topologies, outperforming state-of-the-art manual designs with 2× higher accuracy weighted area-energy efficiency, with great flexibility and generalizability to more complicated applications and new hardware specifications.

#### REFERENCES

- [1] Yichen Shen, Nicholas C. Harris, Scott Skirlo, et al. Deep learning with coherent nanophotonic circuits. *Nature Photonics*, 2017.
- [2] Q. Cheng, J. Kwon, M. Glick, M. Bahadori, L. P. Carloni, and K. Bergman. Silicon Photonics Codesign for Deep Learning. *Proceedings of the IEEE*, 2020.
- [3] Bhavin J. Shastri, Alexander N. Tait, T. Ferreira de Lima, Wolfram H. P. Pernice, Harish Bhaskaran, C. D. Wright, and Paul R. Prucnal. Photonics for Artificial Intelligence and Neuromorphic Computing. *Nature Photonics*, 2021.
- [4] Chenghao Feng, Jiaqi Gu, Hanqing Zhu, Zhoufeng Ying, Zheng Zhao, et al. A compact butterfly-style silicon photonic-electronic neural chip for hardware-efficient deep learning. ACS Photonics, 9(12):3906–3916, 2022.
- [5] Zhihao Xu, Tiankuang Zhou, Muzhou Ma, ChenChen Deng, Qionghai Dai, and Lu Fang. Large-scale photonic chiplet taichi empowers 160-tops/w artificial general intelligence. *Science*, 384(6692):202–209, 2024.
- [6] Alexander N. Tait, Thomas Ferreira de Lima, Ellen Zhou, et al. Neuromorphic photonic networks using silicon photonic weight banks. Sci. Rep., 2017.
- [7] Xingyuan Xu, Mengxi Tan, Bill Corcoran, Jiayang Wu, Andreas Boes, Thach G. Nguyen, Sai T. Chu, Brent E. Little, Damien G. Hicks, Roberto Morandotti, Arnan Mitchell, and David J. Moss. 11 TOPS photonic convolutional accelerator for optical neural networks. *Nature*, 2021.
- [8] Johannes Feldmann, Nathan Youngblood, Maxim Karpov, Helge Gehring, Xuan Li, Maik Stappers, Manuel Le Gallo, Xin Fu, Anton Lukashchuk, Arslan Raja, Junqiu Liu, David Wright, Abu Sebastian, Tobias Kippenberg, Wolfram Pernice, and Harish Bhaskaran. Parallel convolutional processing using an integrated photonic tensor core. *Nature*, 2021.
- [9] H.H. Zhu, J. Zou, H. Zhang, et al. Space-efficient optical computing with an integrated chip diffractive neural network. *Nature Commun.*, 2022.
- [10] Kevin Zelaya, Matthew Markowitz, and Mohammad-Ali Miri. The goldilocks principle of learning unitaries by interlacing fixed operators with programmable phase shifters on a photonic chip. *Scientific Reports*, 14(1):10950, 2024.
- [11] H. Zhang, M. Gu, X. D. Jiang, J. Thompson, H. Cai, S. Paesani, R. Santagati, A. Laing, Y. Zhang, M. H. Yung, Y. Z. Shi, F. K. Muhammad, G. Q. Lo, X. S. Luo, B. Dong, D. L. Kwong, L. C. Kwek, and A. Q. Liu. An optical neural chip for implementing complexvalued neural network. *Nature Communications*, 2021.
- [12] Jiaqi Gu, Hanqing Zhu, Chenghao Feng, Zixuan Jiang, Ray T. Chen, and David Z. Pan. M3icro: Machine learning-enabled compact photonic tensor core based on programmable multi-operand multimode interference. APL Machine Learning, 2024.
- [13] Jiaqi Gu, Zheng Zhao, Chenghao Feng, et al. Towards area-efficient optical neural networks: an FFT-based architecture. In Proc. ASPDAC, 2020.

- [14] Jiaqi Gu, Zheng Zhao, Chenghao Feng, et al. Towards Hardware-Efficient Optical Neural Networks: Beyond FFT Architecture via Joint Learnability. IEEE TCAD, 2020.
- [15] Jiaqi Gu, Hanqing Zhu, Chenghao Feng, Zixuan Jiang, Mingjie Liu, Shuhan Zhang, Ray T. Chen, and David Z. Pan. ADEPT: Automatic Differentiable DEsign of Photonic Tensor Cores . In Proc. DAC, 2022.
- [16] Junhe Zhou and Philippe Gallion. Operation principles for optical switches based on two multimode interference couplers. *IEEE Journal of Lightwave Technology*, 30(1), January 2012.
- [17] Guihong Li, Yuedong Yang, Kartikeya Bhardwaj, and Radu Marculescu. Zico: Zero-shot nas via inverse coefficient of variation on gradients. arXiv preprint arXiv:2301.11300, 2023.
- [18] Ming Lin, Pichao Wang, Zhenhong Sun, Hesen Chen, Xiuyu Sun, Qi Qian, Hao Li, and Rong Jin. Zen-nas: A zero-shot nas for high-performance deep image recognition. In Proc. ICCV, 2021.
- [19] Mohamed S Abdelfattah, Abhinav Mehrotra, Łukasz Dudziak, and Nicholas D Lane. Zero-cost proxies for lightweight nas. arXiv preprint arXiv:2101.08134, 2021.
- [20] Hanqing Zhu, Jiaqi Gu, Hanrui Wang, Zixuan Jiang, Zhekai Zhang, Rongxin Tang, Chenghao Feng, Song Han, et al. Lightening-transformer: A dynamically-operated photonic tensor core for energy-efficient transformer accelerator. In Proc. HPCA, 2024.
- [21] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan. A fast and elitist multiobjective genetic algorithm: Nsga-ii. *IEEE Transactions on Evolutionary Computation*, 6(2):182– 197, 2002.
- [22] Y. LeCun. The MNIST database of handwritten digits. http://yann.lecun.com/exdb/ mnist/, 1998.
- [23] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. Arxiv, 2017.
- [24] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, et al. Reading Digits in Natural Images with Unsupervised Feature Learning. In Proc. NIPS, 2011.
- [25] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
- [26] Michal Rakowski, Colleen Meagher, Karen Nummy, Abdelsalam Aboketaf, Javier Ayala, Yusheng Bian, Brendan Harris, Kate Mclean, Kevin McStay, Asli Sahin, Louis Medina, Bo Peng, Zoey Sowinski, Andy Stricker, Thomas Houghton, Crystal Hedges, Ken Giewont, Ajey Jacob, Ted Letavic, Dave Riggs, Anthony Yu, and John Pellerin. 45nm cmos – silicon photonics monolithic technology (45clo) for next-generation, low power and high speed optical interconnects. In 2020 Optical Fiber Communications Conference and Exhibition (OFC), pages 1–3, 2020.
- [27] Meng Zhang, Dennis Yin, Nicholas Gangi, et al. Tempo: efficient time-multiplexed dynamic photonic tensor core for edge ai with compact slow-light electro-optic modulator. *Journal of Applied Physics*, 135(22), 2024.