2010年12月 – Debuggin' Everyday

FPT’10, Day 3

[Obstacle-free Two-dimensional Online-Routing for Run-time Reconfigurable FPGA-based Systems]
State-of-art “Bus Macro”.
Wire を予約しておいて使えるようにすればよい。
– Virtex-II and 4: soe restrictions in the switch
– V5: heterogeneous routing fabric (not very suitable)
– V6: suitable, but only very few wires available.
Virtex-II の実装例。
Register at input and output of wire.
Pong のゲーム本体と、入力用モジュールとかで実例を出している。
Suitable for lower throughput, but 100MBaud – 70 audio channels or 41 CIF video frames.
ああなんかこれは、計算とかじゃなくて、実用アプリにはいいのかも。
[The Effect of Multi-bit Based Connections on the Area Efficiency of FPGAs Utilizing Unidirectional Routing Resources]
聞き逃した…
[ATB: Area-Time Response Balancing Algorithm for Scheduling Real-Time Hardware Tasks]
これからは Dynamic Partial Reconfiguration が大事だぜー。
[Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters]
去年 american option pricing using Monte-Carlo で FPT best paper を取ったらしい。
Collaborative FPGAs+GPUs+CPUs in a cluster.
dynamic sub-task scheduling.
example: GARCH asset simulation. 33.8x speedup. FPGA + GPU + 2 CPUs
うーん。これはよしみさんがやってたやつより偉いのだろうか(アルゴリズム的に)。よくわからん。

FPT ’10, Day 2

[ Keynote 3: Bringing FPGA Design to Application Domain Experts ]
Dr. James Truchard @ National Instruments
NI LabVIEW: Enable graphical system design & verification in engineering, what spreadsheet does in finance.
Mindstorm NXT から CERN の加速器まで設計できるぜ。
Long tail for real time application: low volume な real time app というのはたくさんある。High volume なものはがんばって作ればいいけど、low volume なものは LabVIEW で作りましょう。
Compact RIO, a LabVIEW FPGA module.
Re-use drives IP abstraction levels.
Upgrade to new FPGA, board, chassis.
IP Plug’n’Play is required to accelerate innovation.
[Technology Issues Facing the World’s Largest Integrated Circuits]
Stratix V は12.5Gbps とか 28Gbps とかでるぜ、というほかは特に目新しい話はないかな。100G Ether MAC が載るらしい・・・
TSMC 28nm process, Power budget 2-20W for high-end FPGAs.
おっと、programmable power voltage だと？
Quartus が自動的に電圧を設定するらしい。まじ？
Partial reconfiguration, based existing incremental design & floorplanning tools. Can be controlled by soft logic or an external device.
省電力化には High-K metal gate が効いているの？
[Floating-point exponential functions for DSP-enabled FPGAs]
FloPoCo のひと。
単精度ではうまくテーブル引きをやるトリックをつかっており、BlockRAMひとつ。倍精度でもアドレス9bit x データ95bit なので、36×512が3つですね。
The main messages of this talk:
– FPGA computing should be done the FPGA way and not by mimicking what processors do.
– Do I really need to compute this bit?
わは。
FloPoCo いいよね。
[Modular Design of Fully Pipelined Accumulators]
ふつうの accumulator というよりは、入力を並列にしてがーっとreduction演算する感じのやつ。前作は加算器カスケードになってる。これがちゃんとパイプラインで動くのはパズルだな。
いや、なんかちょっと理解できていない気がする。
[Efficient implementation of Parallel BCD Multiplication in LUT-6 FPGAs]
BCD2bin + binary mult + bin2BCD ではなく、そのまま。
1. 0-9 の数字を Y^U (0, 5 or 10) と Y^L (-2 -1, 0, 1 or 2) に recode して処理。
2. 部分積を計算
3. BCD carry-ripple adder で足しあわせる
これちょっとおもしろいな。1 と 2 はそれぞれ1ステージ。
ゆくゆくは FloPoCo に入れたいらしい。
[Lightweight DPA Resistant Solution on FPGA to Counteract Power Models]
Differential Power Analysis か。AES の例をだしてた。
– Random inversion against hamming weight model
— All intermediate results are randomly inverted
— requires 1 bit RNG
– Random register renaming against hamming distance model
[An FPGA-Based Text Search Engine for Approximate Regular Expression Matching]
Approximate regex match か。
string match では smith-waterman systolic cell とかがあるけど、regex の実装例はないらしい。
approximate で計算するために edit distance を使う。
DP のテーブルの横幅の分だけモジュールを並べるっぽいのだが大丈夫か・・・
# of cells limits the pattern length. Max pattern length is 250 on current FPGA.
[Real-time Detection of Line Segments on FPGA]
部屋の写真から部屋の全部の corner を検出したり、道路のセンターラインとかそういう要素をずばーっと検出したり。すげーよすげーよ。丸山研。
基本的なやり方としては、ELS (elementary line segment) を見つけて、それをずずずっと merge していく。
品質はどうやって検証して、どうやって「検出終了」と判断するのか？
[True Random Number Generation in Block Memories of Reconfigurable Devices]
Generic TRNG module: 512×36 BRAM で、write collision を起こす。
あーこれはかなりおもしろいぞ。post processing の方法にもよるんだけど、7Mbps〜105Mbpsのスループットが出る！
Ring oscillator なんかに比べるとかなりいいスループット。
Robustness も検証している。ランダム性のテストはエントロピーの分布でやればいいのか。
Placement が大事かも。大事？んー。大事だよなあ。かなり。

FPT ’10, Day 1

Friday pickup bus at 8:30.
[ Opening ]
– Received total 163
– Regular oral 32
– Special oral 5
– Poster 52
– Demo 9
– Accepted total 98
– JP submission 23, regular 6, poster 7, demo 1.
– Asia submission 63, accept 37
– Design competition: reversi opponent.
[ Keynote 1: Reconfigurable Computing – Evolution of Von Neumann Architecture ]
Prof. ShaoJun Wei at Tsinghua University.
修士をここで取ったあとベルギーで博士をとられて、いまは母校の講壇だそうです。かっこいい。
– Golden Moore: semiconductor, scaling-down rule
– Von Neumann: computer, Von Neumann architecture
Power density はだいたい一定のはずだったけど、プロセスの微細化が進むと急速に大きくなった。漏れ電流が無視できないから。Scale down が cost down であった時代も終わりつつあり、cost/gate は 32nm → 22nm では 3% しか変わらない。
Von Neumann architecture はいろいろ進歩したけど、instruction をとってきて operand をとってきて計算して store して、という Von Neumann bottleneck は本質的に解決していない。
これを解決するために datapath と controller を分けて、reconfigurable processor を作ればいいじゃないか、というお話。Operating system (マルチタスクが問題!), 高位合成から power gating まで…
Datapath のところは ALU array で、controller のところは RISC-based programmable FSM で作っている。
この手の dynamically reconfigurable array では design partitioning が重要。データパスを切り分けるときに気をつけないと deadlock したりする。
中国の半導体輸入量は 138 billion USD/year. まじか… 組み立てて再輸出する分とかも入っており、国内消費は 1/6 くらいっぽい。
[ Keynote 2: FPGA Platforms leading the way in the apps of ‘More than Moore’s’ technology ]
Dr. Ivo Bolsens, Senior Vice President and CTO, Xilinx.
Design cost challenge → チップを作ることはリスクを抱えること。
LSI まわりへの投資は急速に減っているみたい。
そうすると、従来の ASIC/ASSP と FPGA の application boundary が移動して、FPGA を使った方がいい範囲が広くなる。
More than Moore: stacked silicon interconnect.
– Chip-to-chip via standard I/Os and serdes: more gates but… 🙁
– Xilinx のアレは、silicon interposer の上に FPGA slice を並べて作っている。standard I/O で作るより性能面でずっと有利。Interposer は TSMC が作っている模様。
– Chip-package co-design. In-package power plane, on chip decoupling caps…
Programmable platform
– legacy: CPU – North bridge – south bridge – PCI – FPGA = I/O extension
– current: CPU- south bridge – PCI – FPGA = co-processing
– new: CPU – HT/QPI – FPGA = peer-computing (cache coherent!)
やっぱいちいち DMA とかごりごり書いてるようじゃ世の中変わらんかなあ。
[ An FPGA architecture supporting dynamically controlled power gating ]
University of British Colombia, Canada.
Turn off regions at run-time with on-chip control.
ASIC designers do this regularly.
But in FPGA:
– routing for control signals
– handling rush current in a programmable way.
ハイエンドの FPGA では電力がしんどくなってきている (絶対そうだよね) ので、なんとかしなければならない。
Proposed architecturea:
– Divide FPGA device into power-controlled regions
– Used general-purpose routing fabric for control signals
Logic block と routing channel (LB から配線にのせるところ) は power control ができる。スイッチは問題 (やってないっぽい)。
sleep transistor をどれだけの範囲で共有するか。範囲を大きくすれば面積は節約できるけど、大きくしすぎると設計が難しい。
Rush current： limit how much can be turned at once.
1) expose it to the user: usual ASIC way
2) expose it to the CAD tools
3) dedicated architectural support: i.e., programmable delay elements in turn-on circuits so they don’t turn on at once.
Current solution is (1).
評価はSPICEでやっている模様。
– Area overhead: static gating > dynamic gating by 33%, but less than 1% overhead compared to ungated version.
– Leakage: Dynamic gating &gt static by 11%. dynamic / static < ungated by 40+%.
– Delay overhead is 10%.
Isolation block が必要では? コストは計算に入っている? → 出力バッファのところでやっている。off にしたブロックの出力をごりっと停める (いいの？)
switch は? → 全面的な再設計が必要です。
[ A tiled programmable fabric for quantum-dot cellular automata ]
IIT Delhi の学生さん。量子ドットですって!?
4 quantum dots in each cell, 2 mobile electrons – binary 0, 1, NULL を表現。wire や各種のゲートが作れる。
クロックは 4-cycle で表現。
– LUTs, CLBs, Switches – NOT NECESSARY
– Selective clocking: let the unused cells relax
– Reduce defects: use clock based scheme
ああなんかわかった。ゲートも配線も同じ仕組みでできてるんだ・・・
programming / clocking のところがよくわからんとです。
シリコンでの実現まではどれくらいかかりそう？→まだけっこうね・・・
[ Phase-change-memory-based storage elements for configurable logic ]
Non-volatile FPGA is expensive… New technological opportunities?
Phase-change RAM principle:
– Material with 2 stable phases: polycrystal (high conductivity) and amorphous (low conductivity).
– requires heater electrode + contact
– non-volatile, small size, low delay and cost friendly!
書き込み時間は 50ns くらいかな。
面積は SRAM 115 > FLash 46 > PCM 30.
Area reduction up to 13%, delay reduction up to 51%!
PCM、つまりresistor memoryは4kΩだけど pass transistor なら9kΩ。抵抗が小さいから遅延も小さい。
製造のための具体的な問題はそれほどないらしい。
writing cycle が問題で、現状のSRAM cell のようには使えないので、Flash の代替として考えるのが正しい、とのこと。
[Dynamic Reconfigurable Bit-Parallel architecture for large-scale regular expression matching]
Yusaku Kaneta @ 北大院
Massive regex matching in apps such as NIDS (Network intrusion detection system) etc.
Static compilation approach: fast but hard to change regex in runtime.
Dynamic reconfiguration approach: suitable for dynamic reconfiguration, but worst-case performance is not guaranteed.
Proposal:
– Dynamic BP-NFA architecture
– Dynamic reconfiguration by bit-parallel NFA simulation
– Extended patterns
Dynamic BP-NFA on Virtex-5 FPGA. - BP-NFA for string pattern: 54 slices, 2.9Gbps
– BP-NFA for extended patterns: 123 slices, 1.6Gbps.
– It’s FAST!
– Worst-case performance is GUARANTEED, while others are not.
– Fast reconfiguration.
すごい。
Can process 256 patterns in parallel.
[ Impact on Reconfigurable Hardware on Acceelrating MPI_Reduce() ]
Already implemented MPI_Barrier() in previous research and got promising results.
Testbed: Xilinx ML410 Board x 64 + bidirectional SATA cable.
PowerPC 300MHz + reduce core + 16 local link interfaces.
小さなメッセージがたくさん飛ぶような状況では commodity なクラスタより改善するとのこと。大きいメッセージの場合は RDMA が威力を発揮するから？
scalability が改善する点はよさそう。
[Accelerating HMMER on FPGA using Parallel Prefixes and Reductions]
Writing Virterbi and DP.
[Multiple dataset reduction on FPGAs ]
No shown?
[ Accelerating FPGA Development Through the Automatic Parallel Application of Standard Implementation Tools ]
Pain for large-scale FPGA implementations:
– No software-like linkage allowing concurrent module implementation
– Global implementation changes when adding or changing signal probe
– P&R algorithm is mostly single threaded and memory eating
Implement each major block as a partial module
– Simplified PR design flow without reconfiguration
– Automatic floorplanning, including bus macro insertion
モジュールごとに配置配線しておいて、くっつけるときは inter-module net delay だけ考えるのか。
自動フロアプランのところとか、かっこいい。
incremental design をうまく使って P&R にかかる時間とかかなり短縮される模様。
design verification とかも短縮できるよね！
[Parallelizing FPGA Placement Using Transactional Memory]
CAD の並列化は重要 – simulated annealing based placement
1. start with random placement of blks
2. randomly pick a pair of blks to swap
3. evaluate and loop
いろいろな trial があるわけだからそこは並列化できるよね。
Swap を accept するか reject するか、というのを、transaction を exec するか abort するかで表現できるといい感じ。
STM (software based transactional memory) has high overhead, but no HTM (hardware TM) yet.
– New software transactional memory (tinySTM)
– potential easier parallelization with TM.
– based on VPR (Versatile P&R) 5.0
– Platform: 8 CPUs
学生が1ヶ月でやってのけた。つまりわりと実装は簡単。P&R はリニアに速くなるけど QoR degradation がすごい (30%) 。abort rate も 60% と高い。
VPR 自体が途中でやりなおすためのコードをもっているので、そのあたりを改善したりしてみたところ、QoR deg worst 35% to 8%, avg 7% to 2% で、かなり改善。
[A Message-Passing Multi-Softcore Architecture on FPGA for Breadth-First Search]
Breadth-first search in graph.
global buffer と barrier sync が必要。ちょっとよくわからん。
[Deterministic Multi-Core Parallel Routing for FPGAs]
Routing を並列化するお話。
PathFinder: VPR と並んで Xilinx/Altera のベースになっているやつ。Maze routing を使っている。
1. Route all signals (allow shorts)
2. Increase penalties for shorts
3. Route all signals
3.1 rip-up and re-route next signal
3.2 update congestion
3.3 return to 3.1 if more signals remaining
4. return to 2 if shorts remain
– Fine-grained: maze routing of a single net in parallel
— using pthreads, parallelize calculation of forward cost & adding coresponding nodes to the priority queue
— for N procs, maintain N separate priority queues to avoid need of locks
– Coarse-grained: each node routes different net
— 3 のところがまるごと並列化されて MPI でつながる
Fine-grained は Core2Quad (FSB共有) では遅いけど Core i5 (L3 共有) ならいける。Coarse-grained ならどちらでも。
[The TransC Process Model and Interprocess Communication]
TransC language
– C-like
– Supports parallel processes: communication via data streams
– Multiple return values (!)
[Comparing Performance and Energy Efficiency of FPGAs and GPUs for High Productivity Computing]
いくつかのアプリケーションで比較評価してる。
FPGA で FFT やると速いな。flops/W は圧倒的。
Monte-carlo は FPGA のほうが GPU より速いっぽいんだけど、ううむ。どういうアーキテクチャでやってるか気になるぞ。
[Local-and-Global Stall Mechanism for systolic Computational- Memory Array on Extensible Multi-FPGA System]
東北大の王さん。
異なったクロックドメイン間の systolic array 状の PE たちを同期させるシステムの話。FIFO の empty 信号などから生成した local stall signal と、それを全部 or とった global stall signal を使う。

Design Gaia 2010, Day 3

[データを直接回路化したパターン認識装置の消費電力評価]
kernel法を使って確率密度関数でやる。かっこいい。
(メモとったのだけど、消してしまいました。。。)
[電力を再構成可能なFlex Power FPGAの低電力プロセスによる試作と評価]
static power は
– V5: 65%
– V6: 62%
– V7: 45%
くらい。
今回は低電力プロセスを使った Flex Power FPGA.
オフ電流の変化幅が大きい。high/low Vt 混在時のオフ電流削減効果の向上率が小さかった。
設計ツールはどうするの？　→　配置配線のときに閾値をどうマッピングするかが問題。ツールは共同研究者が開発しております。
16ビットカウンタのどこがどれくらい電力を使ってるかという情報はある → ありません。
high/low Vt でスピードがかわるのとかはどう評価しているか → on 電流が減るのでそれでスピードが変わる。混在させた場合はそれを相殺できる。
[近磁界測定によるサイドチャネル評価実験]
暗号処理中の FPGA の電磁波チャネルからどのように情報が漏洩しているか？
放射電磁波のホットスポットは存在するか？
V5 に AES を載せて磁界測定。
まずは磁界強度マップを作る。それで、そこに磁界プローブを当てて測定。
ちゃんと10ラウンドの波形がみえる！
相関解析で鍵が破れてしまうことを確認した。
[ PCI-Expressに接続されたFPGAによる並列ループの効果的処理手法 ]
Impulse C を使って高位合成する人むけのやつ。
[確率密度関数の推定法とMIA成功率に関する一考察]
CPA: Correlation power analysis
MIA: 相互情報量
MIA でヒストグラム法、核密度推定法、最尤推定法。
CPA より少ない取得波形で鍵を導出しているが、波形数をふやしてもどうも収束しないっぽい。ビン幅やバンド幅をちょっと調整する必要があるかもしれません、とのこと。
[Performance Evaluation for PUF-based Authentication Systems with Shift Post-processing]
ふたたび堀さん。
物理的に複製不可能、というやつ。
39% の企業が、自社製品の偽物をみたことがある。
市場の 5% 前後の製品が偽物!? 特にチップが危ない。
biometrics な手法を用いてチップの個体識別。
False accept rate / false reject rate が問題。
ROC とか久しぶりにみたな・・・
[ TFT SRAMを用いた3D-FPGAの開発 ]
90nm CMOS + Cu9層まではふつうに作れる。そこに via を植えて、amorphous Si を載せて・・・普通にトランジスタを作る。位置あわせが重要。
TFT transistor の性能が問題。a-Si は mobility が低い。ライフタイムも。製造時に400度以上に持って行けないのもつらいよ。
でも configuration SRAM が逃げるのでだいぶチップが小さくなるし、いろいろプロセス上の制約があるけど所望の特性が得られている。量産時にはメタルマスクで作ったROMに置き換えることも考えられる。
FPGA だから読み出しの遅さがクリティカルパスになることはないと思うんだけど、リーク電流はどうでしょう (宇佐美先生) → 最終製品ではメタルにするので、そっちは問題ない。TFT のほうはまだ測っていないし、信頼性なんかでも不安が残るので出荷はまだ先かな。
FPGA のアーキテクチャは Xilinx 的なものですか (名古屋先生) → そうです。→ 写真で見えてる四角はなんでしょう → configuration SRAM のブロック？ → 容量は? → 100k LUT で 22Mb かな。
[ Impulse Cを用いた車載向け低コスト顔向き認識システムのFPGAへの実装 ]
25fps 以上はいらない。とにかく低コストで。
TAT も短くしたいので、ImpulseC でやっている。
性能はまあまあだが Spartan3 には入らなかった模様。。。
インタフェイスのめんどくさいところを作るのを Impulse C に任せて中身の一部を HDL でごつっと書く、というやりかたもいいかな、と。
赤外線を使うのは画像処理にくらべてロバスト性が高く、コストが低いから。
[ FPGAによるデフォルト強度モデルの高速化 ]