





### FPGA-based real-time track reconstruction for the CMS phase-2 tracker upgrade

Luis Ardila

INSTITUTE FOR DATA PROCESSING AND ELECTRONICS (IPE)



KIT – The Research University in the Helmholtz Association

### www.kit.edu

### Outlook

Introduction: High Luminosity Large Hadron Collider (HL-LHC)

- CMS Tracker Upgrade
- Track Finder Architecture

#### **Track Finding Algorithms**

- Tracklet
- Time-Multiplexed Track Trigger (TMTT)
- Hybrid

#### Hardware R&D

- Hardware Prototyping Platforms
- Integrated Board Management Module

#### Summary





#### Institute for Data Processing and Electronics (IPE)

Karlsruher Institut für Technologi

Luis Ardila - Feb 20, 2020

# High Luminosity LHC – CMS

By **2026** the LHC will be upgraded in luminosity  $5-7 \times 10^{34}$  / cm<sup>2</sup> / s

#### Silicon strip tracker will be replaced

Challenging high occupancy conditions. ~10,000 charged particles per bunch crossing

Necessary to **include tracking** information at **first level** of **triggering** 





# CMS Tracker Upgrade



#### p<sub>T</sub> discrimination provided by use of special modules

- Pairs of closely spaced silicon sensors, separated 1.6 - 4 mm
- Signals from each sensor are correlated \_
- Only hit pairs compatible with  $p_T > 2 3 \text{ GeV/c}$ ("Stubs") are forwarded off-detector
- Factor ~10 data reduction ~15,000 stubs per \_ bunch crossing





+

### Tracker $\rightarrow$ Trigger Data Flow





Average 15,000 stubs every 25ns (200PU)  $\rightarrow$  **Stub bandwidth O(20) Tb/s** 

L1 hardware trigger reduces event rate from **40 MHz to < 750 kHz** using calorimeter, muon and tracker primitives

- Tracker primitives are all tracks (pT > 2-3 GeV/c) from Outer Tracker
- L1-Accept triggers all front-end buffers to read out to  $DAQ \rightarrow HLT$  farm

FE L1 latency buffers limited to 12.5 µs

| Transmission of stubs to back-end electronics   | 1 µs   |
|-------------------------------------------------|--------|
| Correlation of trigger primitives (inc. tracks) | 3.5 µs |
| Broadcast of L1-Accept to front-end<br>buffers  | 1 µs   |
| Safety Margin                                   | 3 µs   |

# $\rightarrow$ Track finding from stubs must be performed in 4 $\mu s$

### **Track Finder Architecture**





# **Track Finder Architecture**





Two stages of data processing

- DAQ, Trigger and Control (DTC) layer
- Track Finding Processor (TFP) layer
- All-FPGA processing system
- ATCA form factor; CMS standard dual-star backplane

### Outer Tracker cabled into nonants

Use of time-multiplexing to increase parallelization

- Time-multiplexing directs data from multiple sources to a single processing node
- 1 event per processing node

Processors are independent entities  $\rightarrow$  simplifies commissioning and operation

Spare nodes available for redundancy

# Track Finder Architecture – DTC



Two stages of data processing

- DAQ, Trigger and Control (DTC) layer
- Track Finding Processor (TFP) layer
- All-FPGA processing system
- ATCA form factor; CMS standard dualstar backplane



Karlsruher Institut für Technologie

#### DTC card must handle

- <=72 modules (5G/10G
  lpGBT opto-links)</pre>
- Control/Readout for each module
- Direct L1 stream to central DAQ (16G/25G)
- Direct stub stream to TFPs (25G)

Stub pre-processing includes:

- **Local**→ **Global** look up, position calibration
- Sort and pre-duplication
- Time-multiplexing

 $\rightarrow$  216 DTC boards, 18 crates, 1 rack/nonant

# Track Finder Architecture – DTC



Two stages of data processing

- DAQ, Trigger and Control (DTC) layer
- Track Finding Processor (TFP) layer
- All-FPGA processing system
- ATCA form factor; CMS standard dualstar backplane





#### DTC card must handle

- <=72 modules (5G/10G lpGBT opto-links)
- Control/Readout for each module
- Direct L1 stream to central DAQ (16G/25G)
- Direct stub stream to TFPs (25G)

Stub pre-processing includes:

- **Local**→ **Global** look up, position calibration
- Sort and pre-duplication
- Time-multiplexing

 $\rightarrow$  216 DTC boards, 18 crates, 1 rack/nonant

# Track Finder Architecture – TFP



- **Track Finding Processor (TFP) layer**
- All-FPGA processing system
- ATCA form factor: CMS standard dualstar backplane



72

@16/25 Gbps

1 x

12

TX



#### TFP card must handle

- Up to 48 DTCs (25G optical links)
- Track Finding from stubs
- Track Fitting
- Transmission to L1 Correlator Trigger

High bandwidth processing card

- ~1 Tb/s processing bandwidth
- Rate to L1 Correlator much lower < 30 Gb/s

# **Track Finding Algorithms**



Two main algorithms for reconstructing tracks, plus a number of hybrids, variation and options



#### **TRACKLET + CHI2 FIT APPROACH**

- Combinatorial approach using **pairs of stubs as seeds**
- **Extrapolation** to other layers  $\rightarrow$  hit matching
- Linearized x2 fit on candidates
- Uses full resolution stubs at earliest stage of processing
- N time-slices x M regions  $\rightarrow$  6 x 24, 9 x 18



#### HOUGH TRANSFORM + KALMAN FILTER APPROACH

- Uses a **Hough Transform** to detect coarse candidates
- Candidates are filtered and fitted in a single subsequent step using a Kalman Filter
- Combinatorial problem pushed to latter stages of processing
- N time-slices x M regions  $\rightarrow$  18 x 9

# Hybrid Algorithm





### Hybrid Comparison





Efforts have started to merge the two approaches

- Working on defining a **reference algorithm** 

2.5

# Hybrid Performance

- Average track finding efficiency for tt tracks > 95% (> 3 GeV)
- z<sub>o</sub> resolution ~1 mm (barrel)
- **p**<sub>T</sub> resolution ~1% (barrel)
- Per event average ~60 tracks (3 GeV) ~200 (2GeV) (tt at 200 PU)



Efficiency

0.8

0.6

0.4

02

-2

# Hardware R&D: ATCA

Advanced Telecommunications Computing Architecture (ATCA)



2x Redundant Radial Internet Protocol -Capable Transport





280 x 322 mm board size

#### All CMS Phase-2 back-end electronics will be ATCA-based

- Dual-star backplane

Standardized the use of the backplane for clocks, and timing and throttling signals

- LHC bunch-crossing clock(40.08MHz)
- Precision crossing clock(320.64MHz)
- TTC2 trigger and fast-control stream (from DTH to back-ends)
- TTS2 throttling stream (from back-ends to DTH)

# HW - R&D

### ATCA infrastructure

- Systematic thermal studies about air cross-section and impact on optolifetime
- Backplane signal integrity  $\rightarrow$  important for DAQ/timing

### Use of interposer technology

- Flexibility (e.g. FPGA)
- Mitigate losses/costs due to yield issues
- Modularity; separate complex and simpler part of the board design

On-board computing and control variety

- Standard on-board PC (COM Express mini)
- ZyngUS+ SoC
- Intelligent Platform Management Controller (IPMC)



Samtec Z-RAY

interposer

Clock test

daughtercard





**KU15P** VU9P

daughtercards



COM Express

Samtec Firefly x12 RX/TX pairs





**Bristol University**,

Imperial College,

RAL, SACLAY, TIFR





### HW - R&D

APOLLO uses coplanar PCBs with Back-Plane Connectors in between

- Flexibility (e.g. FPGA+Optics)
- Modularity; separate complex and simpler part of the board design

On-board computing and control variety

- Zynq SoC
- CERN-IPMC or UW-IPMC

PCB Characteristics:

- 16 layers / Megtron-6 / 1.8 mm
- Apollo analogy: Split into "Command" and "Service" modules









Boston University, Cornell University, Rutgers University, Ohio State University, University of Notre Dame, Northwestern University, University of Colorado

### HW - R&D

APOLLO uses coplanar PCBs with Back-Plane Connectors in between

- Flexibility (e.g. FPGA+Optics)
- Modularity; separate complex and simpler part of the board design

On-board computing and control variety

- Zyng SoC
- **CERN-IPMC or UW-IPMC**

PCB Characteristics:

- 16 layers / Megtron-6 / 1.8 mm
- Apollo analogy: Split into "Command" and "Service" modules













**Boston University, Cornell** University, Rutgers University, Ohio State University, University of Notre Dame, Northwestern University, **University of Colorado** 



# Proof Of Concept: Adapter On Serenity





Serenity v1.0 by Imperial College





☑ Test and debug interfaces (ETH, I2C, UART, JTAG)

Pigeon Point IPMC standalone software compiled for the ARM-R5 processors

☑ IPMC boot, ☑ Com with Shelf Manager, ☑ Board activation/deactivation, I Power-up/power-down sequence, I Read of IPMC sensors, I Cold reset, ☑ Initiating boot of Linux, ☑ Coexistence of Linux & IPMC, I JTAG on Linux (XVC, not integrated)

Petalinux & CentOS running on the ARM-A53 Processors

✓ SSH to Linux on ZyngUS+

Artix 7

- Eth

- I2C

- TCDS

1000

# ZynqMP-IPMC ATCA Test Board

- Power tree tailored to examine different configurations (e.i. always active, only lowpower domain, partial reconfiguration)
- independent Ethernet interfaces, one for the management control and one for the Linux OS
- Front panel access to UARTs and JTAG interfaces via FTDI
- Mechaniical mounts for Xilinx VCU118 Evaluation board
- PCB Assembled this week, currently under bring up tests





# ZynqMP-IPMC ATCA Test Board





Luis Ardila - Feb 20, 2020

# KIT – ATCA Board & Management Module





- One FPGA design VU9P / VU13P
- No inter-FPGA communication
- Monolithic heat sink for all optics
- Clean Optical Cable management (one type of module)
- TCDS backplane signals directly routed to main FPGA
- All Firefly connectors populated with 12x RX or TX signals
- 16 Gbps Firefly Y cable available Today
- 25 Gbps Firefly Y cable expected to be available in Q2 2020



### **Integrated Management Module**



- Interface defined with FMC+ Form factor
- ZU4EG-B900 device with 16 MGTs @ 16 Gbps
- 2GB DDR4
- CMS Clock distribution with Si5397 chip
- Host USB PHY
- RGMII ETH Phy + SGMII
- SATA x1 lane
- PL and PS independent power supply
- IPMC functionality
- Schematic & Placement at 90%



### Summary



- CMS needs tracks at L1 trigger to cope with HL-LHC pileup conditions
  - p<sub>⊤</sub> modules provide first layer of efficient data reduction
- **Highly flexible** track-finder/pattern recognition algorithms were demonstrated in hardware
- Highly scalable, time/physical segmentation could be as large/small as required based on data rates
- Proven with currently available hardware, that a level-1 track-trigger based on FPGAs is feasible
- Lots of flexibility with an all-FPGA solution

### Common infrastructure R&D

- Current prototypes showed very good optical performance
  - Evaluation of other optical drivers is planned in the future
- Realizing all management functionality in a single MPSoC is an **exciting** solution for next generation boards in the DAQ chain
- Proof of principle successfully built and tested for the unified controller architecture

# **Track Finding Algorithms**





Luis Ardila - Feb 20, 2020

### Hardware Demonstrators - 2016



Half a barrel demonstrator in hardware, verified using emulation software

Hardware demonstrator has been built to validate the algorithm and measure latency

- 4 CTP7 boards with Virtex-7 FPGA 3 CTP7 cover 3
   Φ sectors 1 CTP7 emulate DTC
- 1 AMC13 card for clock and synchronization
- 240 MHz internal fabric speed
- Measured latency of 3.33 µs in agreement with latency model – without duplicate removal step



Demonstrator in hardware and emulation

samples from PU  $0 \rightarrow 200$ 

- One per time multiplexing and detector nonant
- Each box is one MP7 board with Virtex-7 FPGA
- Can compare hardware output directly with software
- 240 MHz internal fabric speed
- Latency verified to be 3.5 µs

### **Integrated Board Management**

Low Power Domain – always active:

- Intelligent Platform Management Interface (IPMI) application running in one of the R5 cores.
- It uses the On-chip-Memory and the I2C peripherals
- IPMC memory region protected through system configuration
- ◆ SPI and PMBus

**High Power Domain** – active upon request of full power from the crate:

- Runs Yocto/CentOS based Linux
- FPGA configuration and monitoring
- Slow-control to FPGAs
- Test patterns to firmware





Figure 2-1: Zynq UltraScale+ MPSoC Device Hardware Architecture

### Integrated Board Management



**Integrate** IPMC, GPP based slow control functionality and FPGA in a **single heterogeneous MPSoC** (Zynq Ultrascale+)

- Intelligent Platform Management
   Interface (IPMI) in ARM-R5 processor
   running freeRTOS
- Timing and Control Distribution System (TCDS) in PL-FPGA
- Xilinx Virtual Cable (XVC) JTAG
- AXI Chip2Chip slow control capable



### **Displaced Track Finding**

#### Motivation

- Lots of interesting physics with displaced tracks e.g rare
   Higgs decay to a long lived (dark matter) φ (~no background)
- Alternative to expensive dedicated experiments



#### Challenges

- No beam point constraint -> higher (but manageable) fake rates
- Increased processing requirements

   truncation vs FPGA resources
- Adaptations
  - Seed with stub triplets
  - Fit with 5 param fit (d<sub>0</sub>)

http://cds.cern.ch/record/2647987



r [mm]

### **Displaced Track Finding**

#### Efficiency

- Up to 5 10 cm
- Limitation is bend cut on FE in inner layers

#### Rate

- 1.2 x increase in rate w.r.t 5 param prompt
- 1.4 x increase w.r.t 4 param prompt



² ŋ





Institute for Data Processing and Electronics (IPE)

# **High-Speed Optical Evaluation**

- FMC+ sized board for evaluation of the Finisar BOA 25 Gb/s transceiver
- 12 TX and 12 RX integrated in the same package
- 4 Electrical loop-back channels capacitively coupled with different features
- Skew < 20 µm
- MT ferrule optical interface
- Performance of capacitively coupled lanes looks good











# **Thermal Simulation And Tests**

#### Simulation setup

- PCB imported from PADS
- Placed in a 33 mm deep tunnel
- 4 m/s airflow from bottom (20 °C) to top

### **Placed components**

- KU15P (50 W) doubled  $\theta_{JB}$  to take interposer into account
- Firefly banks 25 G (30W) and 16 G (12 W)
- Total power 205.4 W

### Test setup

- Two heat-pads 45 mm x 45 mm and 12 mm x 70 mm
- Just one mockup board is present, it will be put in between two additional soon
- ~11 W for 6x block of 16 Gbps optics
- ~10 W for 6x block of 25 Gbps optics





Test1 (°C) 4xFan-block speed=50% Exhaust temp ~17°C (~amb) Power on FPGA heaters = 86 W Power on Optics heaters = 41 W

X1FTop = 60.7 X1FBottom = 59.1 X1ORTop = **50.8** X1ORBottom = **49.7** X1OFTop = 43.1 X1OFBottom = 41.7



X0FTop = 53.7 X0FBottom = 50.1 X0OFTop = 35.8 X0OFBottom = 28.2 X0ORTop = 37.2 X0ORBottom = 31.1



## THERMAL & MECHANICAL TESTS





Thermal simulations

Physical thermal studies at CERN

Mechanical component design studies into stress on FPGA solder balls and stress on PCBs at IC





# DTC Firmware in VU9P

- Target single FPGA (VU9P) to avoid side wise communication
- Use of EMP framework with 64 bit frames at 320 MHz
- Input packet structure of 58 data frames + 6 frames gap
- 5 gbps modules send up to 16 stubs per CIC in 8 BX packets
- 10 gbps modules send up to 35 stubs per CIC in 8 BX packets
- One 64 bit word contains a Stub from CIC0 and a Stub from CIC1



Taken from Thomas Schuh

fra Com







# Integrated Management Module Architecture





- IPMI (standalone/RTOS) on ARM-R5 processor
- Slow Control (Linux) on ARM-A53 Cores
- Xilinx Virtual Cable (XVC) JTAG
- 2 Links to main FPGAvia PL-MGTs (AXI C2C)
- I2C-SPI to configure Optics/Clocks
- **PMBus** to configure Power Supplies
- Eth and I2C backplane connection

# **ATCA Layout**

- One FPGA design VU9P / VU13P
  - No inter-FPGA communication
- TCDS backplane signals directly routed to main FPGA
- Integrated IPMC slow control solution
- Monolithic heat sink for all optics
- Clean Optical Cable management
- Only one type of Firefly cable x10







Luis Ardila - Feb 20, 2020