FPT ’10, Day 1 – Debuggin' Everyday

Friday pickup bus at 8:30.
[ Opening ]
– Received total 163
– Regular oral 32
– Special oral 5
– Poster 52
– Demo 9
– Accepted total 98
– JP submission 23, regular 6, poster 7, demo 1.
– Asia submission 63, accept 37
– Design competition: reversi opponent.
[ Keynote 1: Reconfigurable Computing – Evolution of Von Neumann Architecture ]
Prof. ShaoJun Wei at Tsinghua University.
修士をここで取ったあとベルギーで博士をとられて、いまは母校の講壇だそうです。かっこいい。
– Golden Moore: semiconductor, scaling-down rule
– Von Neumann: computer, Von Neumann architecture
Power density はだいたい一定のはずだったけど、プロセスの微細化が進むと急速に大きくなった。漏れ電流が無視できないから。Scale down が cost down であった時代も終わりつつあり、cost/gate は 32nm → 22nm では 3% しか変わらない。
Von Neumann architecture はいろいろ進歩したけど、instruction をとってきて operand をとってきて計算して store して、という Von Neumann bottleneck は本質的に解決していない。
これを解決するために datapath と controller を分けて、reconfigurable processor を作ればいいじゃないか、というお話。Operating system (マルチタスクが問題!), 高位合成から power gating まで…
Datapath のところは ALU array で、controller のところは RISC-based programmable FSM で作っている。
この手の dynamically reconfigurable array では design partitioning が重要。データパスを切り分けるときに気をつけないと deadlock したりする。
中国の半導体輸入量は 138 billion USD/year. まじか… 組み立てて再輸出する分とかも入っており、国内消費は 1/6 くらいっぽい。
[ Keynote 2: FPGA Platforms leading the way in the apps of ‘More than Moore’s’ technology ]
Dr. Ivo Bolsens, Senior Vice President and CTO, Xilinx.
Design cost challenge → チップを作ることはリスクを抱えること。
LSI まわりへの投資は急速に減っているみたい。
そうすると、従来の ASIC/ASSP と FPGA の application boundary が移動して、FPGA を使った方がいい範囲が広くなる。
More than Moore: stacked silicon interconnect.
– Chip-to-chip via standard I/Os and serdes: more gates but… 🙁
– Xilinx のアレは、silicon interposer の上に FPGA slice を並べて作っている。standard I/O で作るより性能面でずっと有利。Interposer は TSMC が作っている模様。
– Chip-package co-design. In-package power plane, on chip decoupling caps…
Programmable platform
– legacy: CPU – North bridge – south bridge – PCI – FPGA = I/O extension
– current: CPU- south bridge – PCI – FPGA = co-processing
– new: CPU – HT/QPI – FPGA = peer-computing (cache coherent!)
やっぱいちいち DMA とかごりごり書いてるようじゃ世の中変わらんかなあ。
[ An FPGA architecture supporting dynamically controlled power gating ]
University of British Colombia, Canada.
Turn off regions at run-time with on-chip control.
ASIC designers do this regularly.
But in FPGA:
– routing for control signals
– handling rush current in a programmable way.
ハイエンドの FPGA では電力がしんどくなってきている (絶対そうだよね) ので、なんとかしなければならない。
Proposed architecturea:
– Divide FPGA device into power-controlled regions
– Used general-purpose routing fabric for control signals
Logic block と routing channel (LB から配線にのせるところ) は power control ができる。スイッチは問題 (やってないっぽい)。
sleep transistor をどれだけの範囲で共有するか。範囲を大きくすれば面積は節約できるけど、大きくしすぎると設計が難しい。
Rush current： limit how much can be turned at once.
1) expose it to the user: usual ASIC way
2) expose it to the CAD tools
3) dedicated architectural support: i.e., programmable delay elements in turn-on circuits so they don’t turn on at once.
Current solution is (1).
評価はSPICEでやっている模様。
– Area overhead: static gating > dynamic gating by 33%, but less than 1% overhead compared to ungated version.
– Leakage: Dynamic gating &gt static by 11%. dynamic / static < ungated by 40+%.
– Delay overhead is 10%.
Isolation block が必要では? コストは計算に入っている? → 出力バッファのところでやっている。off にしたブロックの出力をごりっと停める (いいの？)
switch は? → 全面的な再設計が必要です。
[ A tiled programmable fabric for quantum-dot cellular automata ]
IIT Delhi の学生さん。量子ドットですって!?
4 quantum dots in each cell, 2 mobile electrons – binary 0, 1, NULL を表現。wire や各種のゲートが作れる。
クロックは 4-cycle で表現。
– LUTs, CLBs, Switches – NOT NECESSARY
– Selective clocking: let the unused cells relax
– Reduce defects: use clock based scheme
ああなんかわかった。ゲートも配線も同じ仕組みでできてるんだ・・・
programming / clocking のところがよくわからんとです。
シリコンでの実現まではどれくらいかかりそう？→まだけっこうね・・・
[ Phase-change-memory-based storage elements for configurable logic ]
Non-volatile FPGA is expensive… New technological opportunities?
Phase-change RAM principle:
– Material with 2 stable phases: polycrystal (high conductivity) and amorphous (low conductivity).
– requires heater electrode + contact
– non-volatile, small size, low delay and cost friendly!
書き込み時間は 50ns くらいかな。
面積は SRAM 115 > FLash 46 > PCM 30.
Area reduction up to 13%, delay reduction up to 51%!
PCM、つまりresistor memoryは4kΩだけど pass transistor なら9kΩ。抵抗が小さいから遅延も小さい。
製造のための具体的な問題はそれほどないらしい。
writing cycle が問題で、現状のSRAM cell のようには使えないので、Flash の代替として考えるのが正しい、とのこと。
[Dynamic Reconfigurable Bit-Parallel architecture for large-scale regular expression matching]
Yusaku Kaneta @ 北大院
Massive regex matching in apps such as NIDS (Network intrusion detection system) etc.
Static compilation approach: fast but hard to change regex in runtime.
Dynamic reconfiguration approach: suitable for dynamic reconfiguration, but worst-case performance is not guaranteed.
Proposal:
– Dynamic BP-NFA architecture
– Dynamic reconfiguration by bit-parallel NFA simulation
– Extended patterns
Dynamic BP-NFA on Virtex-5 FPGA. - BP-NFA for string pattern: 54 slices, 2.9Gbps
– BP-NFA for extended patterns: 123 slices, 1.6Gbps.
– It’s FAST!
– Worst-case performance is GUARANTEED, while others are not.
– Fast reconfiguration.
すごい。
Can process 256 patterns in parallel.
[ Impact on Reconfigurable Hardware on Acceelrating MPI_Reduce() ]
Already implemented MPI_Barrier() in previous research and got promising results.
Testbed: Xilinx ML410 Board x 64 + bidirectional SATA cable.
PowerPC 300MHz + reduce core + 16 local link interfaces.
小さなメッセージがたくさん飛ぶような状況では commodity なクラスタより改善するとのこと。大きいメッセージの場合は RDMA が威力を発揮するから？
scalability が改善する点はよさそう。
[Accelerating HMMER on FPGA using Parallel Prefixes and Reductions]
Writing Virterbi and DP.
[Multiple dataset reduction on FPGAs ]
No shown?
[ Accelerating FPGA Development Through the Automatic Parallel Application of Standard Implementation Tools ]
Pain for large-scale FPGA implementations:
– No software-like linkage allowing concurrent module implementation
– Global implementation changes when adding or changing signal probe
– P&R algorithm is mostly single threaded and memory eating
Implement each major block as a partial module
– Simplified PR design flow without reconfiguration
– Automatic floorplanning, including bus macro insertion
モジュールごとに配置配線しておいて、くっつけるときは inter-module net delay だけ考えるのか。
自動フロアプランのところとか、かっこいい。
incremental design をうまく使って P&R にかかる時間とかかなり短縮される模様。
design verification とかも短縮できるよね！
[Parallelizing FPGA Placement Using Transactional Memory]
CAD の並列化は重要 – simulated annealing based placement
1. start with random placement of blks
2. randomly pick a pair of blks to swap
3. evaluate and loop
いろいろな trial があるわけだからそこは並列化できるよね。
Swap を accept するか reject するか、というのを、transaction を exec するか abort するかで表現できるといい感じ。
STM (software based transactional memory) has high overhead, but no HTM (hardware TM) yet.
– New software transactional memory (tinySTM)
– potential easier parallelization with TM.
– based on VPR (Versatile P&R) 5.0
– Platform: 8 CPUs
学生が1ヶ月でやってのけた。つまりわりと実装は簡単。P&R はリニアに速くなるけど QoR degradation がすごい (30%) 。abort rate も 60% と高い。
VPR 自体が途中でやりなおすためのコードをもっているので、そのあたりを改善したりしてみたところ、QoR deg worst 35% to 8%, avg 7% to 2% で、かなり改善。
[A Message-Passing Multi-Softcore Architecture on FPGA for Breadth-First Search]
Breadth-first search in graph.
global buffer と barrier sync が必要。ちょっとよくわからん。
[Deterministic Multi-Core Parallel Routing for FPGAs]
Routing を並列化するお話。
PathFinder: VPR と並んで Xilinx/Altera のベースになっているやつ。Maze routing を使っている。
1. Route all signals (allow shorts)
2. Increase penalties for shorts
3. Route all signals
3.1 rip-up and re-route next signal
3.2 update congestion
3.3 return to 3.1 if more signals remaining
4. return to 2 if shorts remain
– Fine-grained: maze routing of a single net in parallel
— using pthreads, parallelize calculation of forward cost & adding coresponding nodes to the priority queue
— for N procs, maintain N separate priority queues to avoid need of locks
– Coarse-grained: each node routes different net
— 3 のところがまるごと並列化されて MPI でつながる
Fine-grained は Core2Quad (FSB共有) では遅いけど Core i5 (L3 共有) ならいける。Coarse-grained ならどちらでも。
[The TransC Process Model and Interprocess Communication]
TransC language
– C-like
– Supports parallel processes: communication via data streams
– Multiple return values (!)
[Comparing Performance and Energy Efficiency of FPGAs and GPUs for High Productivity Computing]
いくつかのアプリケーションで比較評価してる。
FPGA で FFT やると速いな。flops/W は圧倒的。
Monte-carlo は FPGA のほうが GPU より速いっぽいんだけど、ううむ。どういうアーキテクチャでやってるか気になるぞ。
[Local-and-Global Stall Mechanism for systolic Computational- Memory Array on Extensible Multi-FPGA System]
東北大の王さん。
異なったクロックドメイン間の systolic array 状の PE たちを同期させるシステムの話。FIFO の empty 信号などから生成した local stall signal と、それを全部 or とった global stall signal を使う。

コメントを残す コメントをキャンセル

コメントを残すコメントをキャンセル