https://landley.net/talks/lca-2017.txt https://linux.conf.au/schedule/presentation/29/ - Rob Landley (@landley on twitter, https://patreon.com/landley, etc). - intro - first public release announced at Linuxcon Japan 2015. lkml June 18, 2015 http://lkml.iu.edu/hypermail/linux/kernel/1506.2/02538.html - BSD licensed VHDL. Open processor, Build and install on FPGA. - walkthrough on j-core.org (yes I need to update the docs) - kernel patch that booted Linux to a shell prompt. - toolchain that could build kernel and userspace. - simple userspace build (initramfs). - That's our baseline. Already gave a talk on that: https://www.youtube.com/watch?v=lZGHbMS882w http://j-core.org/talks/ELC-2016.pdf - Since then we've made improvements along more than one axis: - hardware, kernel, "userspace" (toolchain/libc/rootfs) - Optimizing _for_what_? - can't optimize in a vacuum, lots of tradeoffs - ratios (price/performance = x86, power/performance = arm) - absolute performance, price, power budget (power==heat), size (gate count), soft/hard realtime, I/O bandwidth (streaming)... - Why j-core - Why do a new architecture? We didn't, old one (1990's) out of patent. - fresh implementation, simple classic 5-stage harvard pipeline https://en.wikipedia.org/wiki/Classic_RISC_pipeline#The_classic_five_stage_RISC_pipeline - although going to 7 stage is on todo list (more MHZ) - tradeoff: clocks faster, but more transistors = bigger chip - same with cache, spending transistors for performance - called j-core because trademarks haven't expired - Who's "we", Kemosabe? - https://se-instruments.com Founder/CEO is Jeff Dionne, the founder of uClinux way back when - moved to japan in 2003, went back to hardware world - http://j-core.org Yes I need to update http://j-core.org/news.html (I miss kernel-traffic, Jon Masters' kernel podcast... time consuming!) - Why "Open Hardware?" - We're trying to recruit software developers into hardware world. VHDL with "two process method" is closer to software dev than most, FPGA dev is normal compile/install/test cycle. - Moore's Law has mostly ended. (You may have missed it.) - Yes we're doing an ASIC. You can do for about a $60k kickstarter. - but don't have time for that here, see previous talk and ping us - Why superh? - Original SuperH paper from Hitachi (circa 1994): http://www.hotchips.org/wp-content/uploads/hc_archives/hc06/2_Mon/HC6.S4/HC6.4.2.pdf - used in Sega Saturn, Dreamcast, auto industry, shmobile (phones) - existing support in kernel, compiler, libc, gdb, strace... - 1997 asian economic crisis, finished existing chips but no new ones, hitachi spun-off Renesas but kept engineers, Renesas couldn't make sh5 matter, then developed new chips in-house and "not invented here" inherited tech, lost interest when patents expired - Our new use case we need a chip for: synchrophasors - hard realtime, high bandwidth, low radio noise, low power/heat, low cost, vibration resistant, integrate stuff like GPS correlators... - first tried to make Leon Sparc work (from the ESA, also patent expired) - then looked at other out-of-patent chips (arm, x86, m68k, mips, ppc...) - wanted low power, high clock speed, paralleization (multi-issue, SMP), ability to add coprocessors (DSP, GPU, etc)... - But the deciding factor was instruction set density. Why: http://web.eece.maine.edu/~vweaver/papers/iccd09/iccd09_density.pdf - memory bus, cache. (One 32 byte cache line is 16 instructions.) Above hotchips paper section 4.2.4 graph Code density is an emergent property. 16-bit instruction that does a lot vs 8 bit instruction that needs 3 argument bytes... (Load 32 bit constants?) - Hitachi designed instruction set via statistical analysis of compiler output. What does a compiler _want_ when building code? Provide that. - mostly-RISC design. Fixed length instructions, but not pure 1 cycle/ins - microcoded. One instruction telling processor to do more work shrinks code size. But fixed length makes decoding and multi-issue easy. - This is a nice chip that was sidelined by the 1997 Asian Economic Crisis, hitachi spun off its chip design to Renesas but _kept_the_engineers_. - Renesas inherited sh4, tried to do sh5. Didn't really work out. - first j-core drop was 33 mhz, UP, no dma, prefetch. And nommu. - http://j-core.org/roadmap.html - j3 adds mmu (sh3), j4 multi-issue (sh4), j64 - someday Jeff Dionne should do a talk about j64. - but customer excitement so far is for j1 (going _down_) - fabs want to offer this as a library to their customers - We're doing a fully open source toolchain for this http://lists.j-core.org/pipermail/j-core/2016-June/000184.html - This talk isn't about what we're planning to do, but what we already did - and before we go too deep into the hardware... - basic kernel stuff - prehistory: port to current (3.4->4.3) - 4.3 was first release, 3.4 code never published (it really sucked) - 4.8 was first upstream merge of cleaned up stuff, 4.9 was first _usable_ - because the interrupt controller driver was bikeshedded to death - luckily tglx picked it out of the maintainer's tree and pushed in his or we'd probably _still_ be waiting. - Rich is a much, much better maintainer than I am. He doesn't say stuff like that publicly. - Yoshinori Sato did first Linux sh2 support (commit 9d4436a6fbc8 Nov 2006) - note: sh4 support in linux 2.3.16 June 30 1999, sh2 both older & newer - because Dreamcast - Why do people think GPL action needed to run on linksys, but not for dreamcast 5 years earlier? Or xbox, or post-update ps3, or... from 2010: https://www.youtube.com/watch?v=PR9tFXz4Quc - @7 min: hardware that can't run linux to it runs linux, ~12 months if it's extensively locked down, a few days if it isn't. - but that's assuming game-console sized userbase. - Ahem. Tangent. - Sato-san lives in tokyo! We have a tokyo office! We bought him lunch twice - so sh2 already supported, but not this board, not this processor, and almost nobody had ever actually used it - Sato-san suggested we rewrite everything to be less ugly - the code from the russian contractors was difficult to describe politely. - port to device tree - arch/sh had never used device tree anywhere in the architecture! - Rich made new device-tree-only board file with no platform devices - created new bindings. Checking in the documentation was 90% of the work - months of bikeshedding. Nice to get review but at some point you have to shoot the engineers and go into production. - sdcard driver: generic infrastructure had a bug hardware with no card change notification interrupt, probed for card change with no locking. Probes randomly stomped reads and writes, corrupting card. - Lots of arm systems also it, but we were the ones to find/fix it. (why?) - affected spi mmc host with no media detect pin. (I.E. the cheap stuff) mmc is the slower way to access an sdcard faster protocol requires $$$ software patent license (sigh) - community: no nommu linux water cooler - Used to be uClinux.org, but Jeff moved on in 2003, new maintainers didn't - repo still in CVS until they had a hard drive crash and lost everything - new releases start from previous binary release and swap out files - distro full of stale packages and place to learn about nommu mixed - impression was nommu for linux unsupported. Despite cortex-m! - We'll come back to this. - add fdpic and PIE support - nommu needs relocatable code because fixed addresses conflict - ideally load 4 elf sections (text, rodata, data, bss) independently - share ro data, fit into fragmented memory - requires an extra register to track, compiler changes - kernel has 2 loaders for this: binflt is a.out based, fdpic is elf based - binflt was no longer maintained! Forked to arch-specific versions. - upstream had been on uclinux.org but their CVS lost in disk crash - buildroot was using blackfin fork on a developers' personal website - Yoshinori Sato had an sh2 version on sourceforge.jp - we got people talking and https://github.com/uclinux-dev/elf2flt - but Linux moved from a.out to ELF in 1996. binflt->fdpic long overdue - Note: community problem, no nommu linux community water cooler - Used to be uClinux.org, but Jeff moved on in 2003, new maintainer didn't - repo still in CVS until they had a hard drive crash and lost everything - new releases start from previous binary release and swap out files - distro full of stale packages and place to learn about nommu mixed - impression was nommu for linux unsupported. Despite cortex-m! - We'll come back to this. - fdpic loader can load conventional elf, ala ext4 driver reading ext2 - can do relocatable code in old elf: Position Independent Executable (PIE) - build everything -fpic and then use different start code - but can't share code/rodata between instances, needs 1 contiguous alloc - Need to enable fdpic loader to load ELF (even PIE) on nommu at _all_ for superh: https://gcc.gnu.org/ml/gcc/2008-02/msg00619.html for cortex-m: http://www.slideshare.net/linaroorg/sfo15406-arm-fdpic-toolset-kernel-libraries-for-cortexm-cortexr-mmuless-cores PIE in fdpic on sh2: http://www.openwall.com/lists/musl/2015/06/16/5 - Note, move from - unified syscall numbers (politics between sh2 and sh3 devs over wince) http://lkml.iu.edu/hypermail/linux/kernel/1508.3/01702.html - new maintainer for arch/sh: Rich Felker (musl-libc maintainer) @richfelker https://patreon.com/musl - we'd already hired him (part-time) to add nommu to musl for us, so... - largest amount of work was port to SMP - recent example: new interrupt controller broke (RCU stalls) because the handle_irq function with "simple" in the name was the more complicated one that did extra work. patch: https://lkml.org/lkml/2016/10/13/580 - Rich Felker said the kernel has "laborate, well-designed interfaces, none of which are documented. You have to guess their interface contract from how they're used." - Our hardware did not match what the kernel assumed it was doing - kernel: per-cpu timers have different IRQ number - us: same IRQ delivered to appropriate processor? - kernel: but that's what arm does, all the world's an arm - was x86 - was vax - most kernel changes due to what the hardware was doing, so... - hardware - as said above, first drop was 33 mhz, UP, no dma, prefetch. And nommu. - second drop 50 mhz, third is 62.5 mhz on same Spartan 6 FPGA. http://lists.j-core.org/pipermail/j-core/2016-April/000031.html - 87 mhz on Kintex 7 (a more expensive FPGA) 150nm asic ~125mhz without layout work, ~250 with. at 45nm that's 450-800? - cutting edge fab process we can do gigahertz+, but that's way off $$$ - This is layout/routing. Shorter wires take signal less time to travel down - higher resolution fab process naturally shorter wires - giant wall-sized drawing, giant lens, focus light on silicon wafer like frying ants with magnifying glass. First coat silicon wafer in toxic chemicals, then dip in a photographic developer that gives your ancestors cancer. Lays down wires and dopes silicon to bias P/N junctions. Rinse repeat. - Alas, Moore's Law ended. (Most of us didn't notice, we'll get to that.) - If we can be faster (shorter wires connecting the bits) on same FPGA (made with same manufacturing process), our chip should be better on all processes. (Modulo the each FPGA manufacturer and ASIC fab having its own proprietary code generation backend in its VHDL toolchain, so micro-optimizing isn't portable. Sigh.) - 8k icache, 8k dcache (calculated 1.25 slower than perfect, testing ~1.18) @50 mhz, drystone: neither=3340, d=3983, i=12376, di=36496 @62.5 mhz off=3641.7/sec (247.6us) on=48309.2 dhry/sec (20.7 us) - note cache replaced prefetch unit. Combining them is todo item - direct mapped, each cache line is (address>>5)&13 and the way to evict is load an aliased cache line (addr + x*8k). - 2-way associative is more circuitry. Probably will in j3. - kernel has "evict cache line" and "clear cache" functions you need to implement. I did "evict by aliasing" but it was slow. Asked hardware engineers for a "clear cache bit", and had both just do that. Faster. http://lkml.iu.edu/hypermail/linux/kernel/1506.2/02540.html - Cache is tricksy. - memcpy can alias. Page aligned memcpy has 50% chance of 8k aliasing, each cache line is 8 32-bit words, so 16 cache line evictions if alias. - Rich wrote a memcpy that loads 8 registers, writes from 8 registers. but would need to change it in musl, glibc's inlined memcpy, and kernel, and for loops would still get it wrong. - TODO: make writes bypass cache if they would evict a cache line? - Original sh2 not perfect: constants are loaded PC-relative, meaning you pollute data cache with cache lines from code segment... But compatibility. We implemented the existing standard. - TODO: special case pc-relative loads to use icache? Or bypass cache? - DMA engine. sdcard is using it, ethernet is TODO - 8 channels, no allocation mechanism yet (each use hard-coded to a channel) - new lpddr DRAM controller (aside on FPGA libraries) http://lists.j-core.org/pipermail/j-core/2016-April/000038.html - mostly for ASIC, but FPGA is using it because dogfooding - explain fpga "libraries" (existing special-purpose circuitry in FPGA) - phased locked loop, timing crystal, dram controller, etc. - 256 megs ram per controller instance (because that's biggest lpddr chip) - numato has 128m (smaller chip) - SMP - 1994 chip predated rise of single-chip SMP by a decade - moore's law s-curve started flattening in 2000 with 1.13ghz P3 recall http://www.eetimes.com/document.asp?doc_id=1126505 - had to go "wide"instead 2004: pentium M with 2 megs of l2 cache. 2005 core duo: smp - currently 2-way, caches talk to each other - nommu SMP, Linux kernel hadn't done that before - TODO: higher-way smp requires a bus (we have "bitlink") - cmpxchg instruction (from IBM 360, waaaaay out of patent) - llsc is not what futexes expect, interrupt disabling UP only - New interrupt controller - each processor needs its own timer interrupt to drive scheduler - need IPI (interrupt to tell other processor "you've got mail") - started with one processor running linux, other a bare metal test program validated spinlock impl, then got linux using second processor - Boards: - originally avnet http://www.em.avnet.com/s6microboard (see archive.org) - expensive, no built-in sdcard adapter (boot media!), availability - Replaced with numato as entry level board: $49.95 (USD) with free shipping. http://numato.com/mimas-v2-spartan-6-fpga-development-board-with-ddr-sdram Xilinx "Spartan 6" lx9 fpga, sdcard, 128 megs ram. - but no built-in ethernet, and it can't do SMP. - Anybody want to do VHDL to hook up the VGA to a framebuffer? - lx9 = 9k cells, lx25 = 25k, lx45 = 45k. - j2 SOC takes up about 40% of an lx9, which means we can't do SMP here - low barrier to entry, but not far to go from there. - internal testing of SMP can't be reproduced by community - Turtle prototypes: https://twitter.com/jcoreeng/status/730330848306700288 Raspberry PI 2b with lx25 FPGA instead of ARM. (And it's blue.) - sdcard, ethernet, hdmi, 1 upstream usb 4 downstream, audio... - We run all I/O to FPGA except atmel boot chip and usb 2 hub. - We have VHDL or Verilog code to test all these devices, but not clean implementations checked in to the repo yet. - some glitches in prototype, know how to fix 'em now - usb hub chip has 2 versions (enable lo, enable hi), bought wrong one - PCB too thick (pi 80% of ours), some pi cases don't snap closed - don't pull "up" on micro-usb connector. It comes off. (Add rivet.) - someone else is manufacuring this (ask Jeff Dionne) - This is for the community. We have our own $$$ lx45 boards. - userspace - laying uClinux to rest: the _community_ had bitrotted (see earlier) - triage https://landley.net/toybox/roadmap.html#uclinux - buildroot, https://github.com/landley/mkroot - To replace the "water cooler" aspect, we did http://nommu.org with mailing list. - but maybe list should be on vger.kernel.org and contents of page should be in kernel Documentation directory... - and it doesn't explain fdpic yet - should have survey of nommu targets (coldfire, cortex-m, armv7r...) - executable format: binflt vs fdpic (above) - toolchain - first toolchain was code sourcery from 2010 (pre-mentor graphics) - then buildroot, aboriginal linux, then Rich did musl-cross-make - build instructions in http://lists.j-core.org/pipermail/j-core/2017-January/000478.html yes I need to update http://j-core.org - fix superh bit-rot (Rich's musl-cross-make patches for 5.2) https://github.com/richfelker/musl-cross-make/tree/master/patches/gcc-5.2.0 http://lists.nommu.org/pipermail/0pf/2015-August/000012.html - The basic regression test for superh has been "QEMU runs sh4". - I was doing that, that's why the j-core guys approached me - configure wasn't enabling tls, even though sh had support for it - spec file didn't know how to make pie executables static - not sh-specific. No code change, just link different (existing!) crt1.o and crtbegin.o (for static pie it's rcrt1.o and crtbeginT.o) and feed --no-dynamic-linker to ld - added --enable-default-pie to config, again tweaking specfile - fix for an "sh sibcall bug" (ask Rich, his checkin comment sucks) The perennial "new gcc wouldn't build old gcc" bug category: https://gcc.gnu.org/ml/gcc-patches/2016-01/msg00044.html - add fdpic support - compiler changes are (by far) the most intrusive part of adding fdpic. for sh it looks like https://github.com/richfelker/musl-cross-make/blob/master/patches/gcc-5.2.0/0007-fdpic.diff for cortex-m see https://github.com/mickael-guene/fdpic_manifest - add -mj2 machine type - 2 backported barrel shift instructions from sh3, and cmpxchg from s360 - uClibc->musl - uclibc died. I was there. (Yes there's a uClibc-ng but 10 yr tech debt.) http://lists.busybox.net/pipermail/buildroot/2016-December/180102.html - musl-libc is new replacement, but it didn't have nommu support. - so we paid Rich Felker to add nommu support (and sh2) to musl. - toybox nommu support (ask me offline) - Root filesystem I had http://landley.net/aboriginal/about.html now replacing it with https://github.com/landley/mkroot - also poking at buildroot support, debian support - debian not big in nommu but they're intently waiting for j3 - summary - questions Because somebody's going to ask, why not Risc V: - want design from industry, not academia development driven by users, not abstract "this would be good" and users come along years later. Unix vs Multics: everything for everybody vs this for us (and them). C vs Pascal: demonstration of proper principles vs "make it work" Linux vs GNU: Hurd="there should be a free OS". Yeah, but why _this_ one? Linux: dial in to comp.os.minix on modem, oops I fried my minix setup, a year later everybody's running a webserver on their old 386 Projects driven by use cases, what they need next. - 4004 for busicom, 8008 for "glass tty" customer, 8080 incorporated user feedback to a respin for new fab, 8086 was 16 bit extension of 8080... - compare with i432 and Itanium, "let's start over and do it right!" Infrastructure in search of a user is a bad idea in software _and_ hardware.