Sprecher
Beschreibung
The explosive growth of transformer-based AI models and the push toward adaptive intelligence at the edge have exposed fundamental limits of conventional von Neumann hardware, where data movement—not computation—dominates energy and latency. This talk presents recent progress from our group on memory-centric co-design spanning devices, circuits, and architectures to address these challenges for two key AI primitives.
First, we reformulate attention score computation as massively parallel in-memory similarity search using Flash-based Content-Addressable Memory (FlashCAM). High-uniformity amorphous oxide semiconductor Flash devices (>95% yield, 4 V memory window) with optimized speed–retention–endurance characteristics have been realized and integrated into 16×16 CAM arrays. A custom PCB measurement platform with Arduino/Jetson control has been developed to demonstrate matchline discharge dynamics that directly encode similarity scores.
Second, we introduce a family of CMOS-compatible non-filamentary memristors (graphene- to metal-insulator-metal stacks) engineered for BEOL monolithic 3D integration and edge continual learning. Latest devices achieve 100 ns switching at 2.5 V while maintaining >100 s retention, high uniformity via via-hole structures, and low cycle-to-cycle variation that enables verification-free programming. We experimentally validate a deterministic outer-product parallel programming scheme on 6×6 subarrays within 32×32 crossbars, achieving O(1) weight updates. Supported by generalizable compact models and macro architectures that emulate floating-point operations for BF16-quantized LoRA adapters, these primitives enable accurate in-situ LLM fine-tuning with minimal accuracy loss.
Together, these results demonstrate practical hardware pathways that dramatically reduce data movement for attention mechanisms and enable efficient on-device adaptation, offering a cohesive device-to-architecture framework for next-generation AI accelerators.