FPL ’08 – Debuggin' Everyday

FPL ’08: Day 3

[ Synthesis ]
#1: Floating Point Datapath Synthesis for FPGAs
Altera の人のプレゼンテーション。
Floating Point Compiler
データパス上で必要な値域や例外は予測することができるので、すべてのノードを IEEE754 に従わせる必要はない。新しい浮動小数点型を定義して、それを使う。正規化と例外処理はしない (例外検出は行い、それは次のノードに送る)。正規化とかしないのは、バレルシフタがやばいから。
– 200+ MHz double precision: 50GFLOPS – 3SE260
– 250+ MHz single precision: 100GFLOPS – 3SE260
やっぱ演算器は作らなきゃいけない、ということだね。ちょっと参考になりそうだ。
#2: Automatic Generation of Run-time Parameterizable Configurations
TMAP: Tunable LUT mapping
FIR フィルタとかの係数をレジスタに書いて演算器で処理する形式から、決めうちにした回路に reconfiguration して使う方法。
– 1999LUTs → 1008LUTs
– 8.4MHz → 11.9MHz
いいんだけど、configuration に時間が掛かる上に、構成情報のデータベースが大きくなるのが問題。これを、うまく coefficient に関係するところだけランタイムで作ってなんとかする話。
– 1301LUTs, 11.5MHz
VHDL に annotation を入れて、それを合成するようなツールを開発。
#3: Generation of partial FPGA configurations at run-time
動的再構成するモジュール間を Bus macro みたいなので、つなぎかたをきめてしまうのではなくて、route table をもつ共通のインタフェイスで接続することで簡単にする。
Route table がもつのは、
– connection ID: relative coordinates of the endpoints
– route specification: list of frame and bit indices of all sw points in the route
– incompatibility information
そんな感じ。
[ Optimization ]
#3: Memory access parallelization in high-level compilation for reconfigurable adaptive computers
Adaptive computing system: CPU + reconfigurable hardware
高位言語からの合成に関する研究の多くは科学技術計算 = regular memory access, loops with constant bounds, perfectly nested loops がターゲット。そうじゃないものを扱うのに問題になるのは、ポインタ。
control flow graph を作って、activate token と cancel token を使う。Token を使うことで依存関係を壊さずにメモリアクセスを並列に行ったり、serialize したり、speculation したりする。

FPL ’08: Day 2

Keynote 3 は 8:30 から… 無理です。
[ Compilers for Reconfigurable Architectures ]
#2: CHiMPS: A C-Level Compilation Flow for Hybrid CPU/FPGA Architectures
Xeon のソケットに挿す FPGA をターゲットに考えている。
Spatial DFG にして並列性を抽出してハードウェア化。
どこを HW に落とすかは pragma を使って指定する。あと、ポインタをほげほげする restrict というキーワードがある以外は、ふつうの C 言語。
メモリからとってくるデータをキャッシュするためのメモリをたくさん用意する。これでレイテンシを下げたり、競合を減らしたりする。キャッシュの一貫性を維持するプロトコルはちゃんとやるとコストが大きいので、書き込みができるのは一度にひとつだけ。Direct mapped cache.
うーん。C から design entry する限りは結局、キャッシュとかそういう Von Neumann の呪縛からは逃げられないのかなー。
小さいキャッシュがたくさんあると、それをメモリコントローラにつなぐところが大変じゃない? →そうでもないぜー
マルチコアとの関係は？→ それもやりたいね。あと、浮動小数点関係とかね。
#3: Combining Data Reuse Exploitation with Data-Level Parallelization for FPGA Targeted Hardware Compilation: A Geometric Programming
Memory subsystem と loop level parallelism の co design.
どの loop nest level で使われるデータ化で、off-chip にするか on-chip にするかをきめる？
むー。こういうことは考えているのだが、この研究のように定式化するところが難しそう…
メモリの使い方でクロック周波数はかわる？ → 変わるけど、サイクル数のほうが実行時間に対して支配的。
[ Random # Generation & PLL ]
#1: Sampling from the Exponential Distribution using Independent Bernoulli Variates
固定小数点の、指数分布の乱数を作る方法。exp() を使わないで高速にやる。
よしみさんが cite されてた。
将来的には fixed-floating conversion を使わずに、直接に浮動小数点の値を出せるようにしたい。
#2: Enhancing security of ring oscillator-based TRNG implemented in FPGA
ring osc を乱数生成に使う場合、電源からのノイズとかがけっこう問題。気をつけましょう。気をつけて実装すれば deterministic jitter はけっこう減らせる。
[ Tutorial: Virtex-5 FXT: A new FPGA platform ]
Virte-5 FXT: has PPC440 instead of PPC405. Super-scaler, larger caches, deeper pipeline.
Not just a PPC440, but with: 5×2 crossbar connection, memory interface, 32bit DMA, user defined instruction (by APU), single/double IEEE-754 compliant 128-bit FPU (soft core) …
GTX high-performance transceivers: 150Mbps to 6.5Gbps
Power dissipation = 250mW / channel or less
TXT Family: FX without PPC, but twice # of GTX transceivers. ES 4Q08, MP 1Q09.
8 to 24 transceivers for FXT family, 40 to 48 in TXT family.
Device availability の保守についてはどう考えている?
XC3000 は 15 年前の製品で、まだ売ってるけど、いまからデザインするならば新しいのを使いましょう。1年の FPGA の進歩は人間の15歳分だよ!
でも、binary compatible にすることは考えていません。
[ High performance Computing for Financial and Biological Modeling ]
#1: FPGA Acceleration of Monte-Carlo Based Credit Derivative Pricing
single/double precision floating point の hybrid design とかをしているところが面白そう。構成のおもしろさでは 443 さんの仕事のほうがずっと上だな。
#3: Acceleration of a Production Rigid Molecure Docking Code

FPL ’08: Day 1

[ Opening Keynote ]
FPGA is the programmable platform for Transforming, transporting and computing digital data.
Triple play opportunity:
– DSP
– Packet processing
– Tera computing
Network Trends:
– Global IP traffic will reach 44bn gigabytes / month in 2012, while less than 7bn in 2007.
– Video goes 22% of consumer traffic in 2007, will be 90% in 2012.
– Mobile data traffic will roughly double each year from 2008-2012.
→ focus on reducing power consumption to reduce operating expense.
Stanford Open Slate Program
– Programmable Open Mobile Internet
– 25 Stanford inter-disciplinay faculty + Cisco, Deutsche Telekom, DoCoMo, NEC, Xilinx
Virtual operating system is available: next step is virtual network!
– High speed / reliability / security : various demands
– packet level reconfiguration !?
そのほかいろいろ
– Packet based on-board interconnection
– MIMO, Multi mode radio とかは信号処理の量がヤバい
– Heterogeneous multiprocessing: Intel + FPGA (on FSB)
University Program:
– XUPV5-LX110T
– 周辺回路もりだくさん。PCIe もついてる。やべえ、楽しそう
– OpenSPARC evaluation kit も…
[ NoC session ]
#2
Coarse grain reconfigurable architecture + multistage interconnection
Good old MIN!
2-input 2-output だけど、大きいのは考えなかった?　→ そのうちねー
#3
many core なシステムへ向けたお話。
programmable router より NoC のほうがいいよ、みたいな。
なんか、両方ともわりと普通の話だったので、ちょっと残念。
[ Keynote #2, FPGAs at CERN ]
CERN では加速器のまわりの、大量の検出器からの信号を処理するのに FPGA を使っている。加速器なので、当然ヤバい粒子がいっぱい飛び出すわけで、radiation がすごいところは古いチップ (Virtex-II Pro とか) を使ってエラー率を下げるみたいな話しも。
[ Image & Video ]
How fast is an FPGA in image processing?
この間の RECONF で聴いたお話とだいたい同じ。ソフトウェアは Intel C++ 使って評価してるのかー。いいかも。
SIMD はプログラミングが大変だといっていたが、使わなかったら speedup はどうなるか？ → 測定してない。プログラミングの得意な学生が実装しました。
DSP とか GPU を使わないのはなぜ? → intel のプロセッサはみんなが持っているから。
FPGA Hardware Acceleration for H.264 Motion Estimation
フレームを16×16 のブロックに区切って、SAD が最小になるところを見つけ、motion vector を求める。
256 subtraction+accumulation / SAD → 96giga / sec.
H.264 ではブロックのサイズをフレーム内の場所ごとに選べることと、複数のフレームを reference として選べる (鳥が羽ばたいたりとか、そういう periodic motion をうまく扱えるようになる。たいていの実装では 4 つの reference frame を使う) ことが違う。計算量は 4 倍ちょっと。
1.5 giga macro-reference block access / sec。メモリアクセスも最適化しないとやばい。
Real Time Image Super Resolution on an FPGA
Imperial College のひと。うおー超解像だ!
複数の低解像度のフレームから高解像度のフレームを合成する。
Frequency domain approach と spacial domain approach がある。やったのは後者 (だと思う)。
1. Motion Estimation
2. Image Reconstruction
3. Image Quality: Deblurring, Denoising
ハードウェア処理に適した SR アルゴリズムを開発した。2, 3 を実装している。
XC2V6000 で 1280×720, 61fps 出る。いまのところ、ふつうの HDTV にはこれで充分だし、FPGA のリソースを全部使い切っているわけでは全然ないので、もっと小さいのでも載りそう。一番必要な資源は multiplier. 固定小数点だが、浮動小数点による実装と比べて目に見える違いはない。