Bash is just sad at times.
$ alias banana='$BANANA'; BANANA='ls -l'; banana
bash: banana: command not found
$ banana
total 1
drwxr-xr-x 4 landley landley 4096 Mar 22 00:33 bms-c
$
The answer to "what order does alias perform operations in" is basically Munsch's "the scream".
I've been deterred from just creating a new gitlab account because when I go to the issue link the top bar has "log in" and "free trial" buttons: all accounts are either paid for or time out. It is NOT a free web hosting service with any interest in open source projects, it is proudly a paid proprietary web hosting service only.
I remember when I was researching github alternatives (back when they were taking my web login away) I looked at gitlab as one of the options and rejected it, but the options I _didn't_ choose all bleed together. I mostly remember gitlab as "that place you pull qemu from these days" and there were some other historical projects that migrated to it when other servers went down, I think? They're not sourceforge, not kernel.org, not gnu/savannah... remember when "google code" was relevant? Those were the days...
Alas if I _do_ have a historical gitlab account grandfathered in before this venture late stage capitalism nonsense, I don't seem to have my notes about it on the machine I brought with me to tokyo. Sigh, last visit the looming personal server administration issues were migrating my email from gmail to dreamhost, and figure out what to do about losing my github web login. Always something...
Felt largely ok, went out, ate lunch with people (very slowly), crashed HARD, had to go home. Whee.
(It's like my blood pressure is really low, and digestion lowering it further means I just collapse.)
Managing to stay upright for over an hour at a time. Yay.
Sigh, nobody's upvoted the gitlab issue. Multiple people have now reported it to me, and even retweeted my mastodon posts about it, but despite posting it to the toybox list nobody's engaged with gitlab.
I may have to make a new gitlab account (whether I already have one or not). Also, if their "virus" detection is literally "a document containing the string landley.net" (which it seems to be), there's a number of wikipedia pages linking to my site (especially to the computer history mirror), and links from lwn.net and so on, that you'd think would trigger. Mastodon's syndicated several URLs with that string in it. A bunch of linux kernel archive sites...
But I'm not trying to engage with this while still loopy from saturday's gastric attack.
Spent the day in bed. Running a fever, for some reason. Starting to suspect this was an allergic reaction or something rather than food poisoning.
Food poisoning is never fun, but japan's "bathroom and shower are in separate rooms, and even small solid chunks can't go down the shower drain" adds an extra layer of fun to the proceedings. On the bright side, I did not aspirate anything! Which is pretty much what I was focused on. For 8 hours.
(I think the turning point was being able to keep water down around sunrise. It was an unpleasant night.)
It's good to know that renewables have passed coal in the US so despite the Recidivist's thrashing, market forces are strongly arrayed against fossil fuels. And the actuarial tables against fossil politicians. (Not that both won't cheat on all cylinders, the question is who provides groveling compliance and who at least slow walks it until the octagenarian's dead.)
Someone was kind enough to open a gitlab issue. It's most likely the same nonsense I argued with dreamhost about a few months back. If I have a gitlab account the login info isn't on the machine I have with me in tokyo, so hopefully other people can upvote the issue...
(I remember looking around for places to flee github to when they took away my web login, and gitlab somehow managed to be MORE corporate than microsoft.)
Got an email that ublock is blocking landley.net. I do not have the spoons to deal with this right now.
I got a couple of bug reports about selinux support in tar, which collectively imply that Elliott never tested archive creation, only extraction. I do not have a machine with selinux in it, I ran fedora in a VM to test this stuff way back when.
Anyway, I think they're both fixed now, but I pestered the reporter to give me a pair of test files (which are actually .tar files despite the .zip extension, because Microsoft Github allows you to upload files with a .tar extension but will not allow you to upload files with a .zip extension, no the actual format or contents of the files don't matter, thanks Microsoft Github).
The file produced by red hat and the file produced by toybox have some differences, but after looking through them (diff -u <(hd gnu.tar) <(hd toybox.tar) | less) I'm sort of leaning towards NOT regression testing xattr creation, because A) I still don't personally care about this feature I don't use, B) ew.
The first change is that that the header's filename (which is basically a comment) says "./PaxHeaders/a" and "./PaxHeaders/b" for the two entries in the gnu one, and "././@PaxHeaders" for both in toybox. It seems to be accepting it anyway, because the important thing is that the header is type 'x', not the name. An x record applies to the next header entry after this one, so the name in the x record isn't used and doesn't matter. My code is creating those comment-like names via sprintf(tmp.name, "././@%s", type=='x' ? "PaxHeaders" : "LongLink"); which goes into a fixed length buffer padded with NUL bytes so the shorter name doesn't actually save any space. I got that from somewhere, and it looks like gnu maybe has version skew? Making the output binary identical is silly when what they produce changes each version upgrade.
The next change is that various internal header length counters are different, because the payload is different. Seems like an effect, not a cause.
The next hunk of diff is that gnu's header has "ustar\000\0" (I.E. ustar, a null byte, two ascii zeroes, and a null byte), and mine has "ustar " (with two trailing spaces) and "root" with null terminators. My ustar has two trailing spaces of padding (to match what was there at the time!) and I'm using the name instead of the UID by default. Which is WHAT IT WAS DOING, and while there's a --numeric-owner flag to tell it to use numbers, there isn't any sort of --non-numeric-owner flag to tell it to use names: it's the default behavior. If it STOPS being the default behavior, we lose that capability. So either gnu broke and lost a capability, or Red Hat is being nuts with aliases or something? Plus, the fields are at fixed offsets, so their 00 starting two bytes earlier than it should is funky (and why TWO zeroes?), their actual "uname" field seems to be entirely NULL. And then they populated the gname field, which it looks like mine didn't (dunno why, but it's an x record so who cares: none of this gets used?) The actual "I can't parse their output, they can't parse my output" bug reports got fixed, this is differences that haven't resulted in actual problems... Concerning, but I dunno what's going on here.
And then the next difference is that an x record is basically a string with a bunch of "%d keyword=value\n" records concatenated together, where the %d is length in bytes of the record... Except of course it's not the length of the string, it's the length of the LINE including the newline, the number itself, and the space between the number and the string. (Sigh. You have to print it, work out how many digits that number is, and then work out if adding that increases the length of the number by one ala 9->10 and this is VERY GNU. They did not have to do this. It's entirely self-inflicted.)
$ alias ls='ls -l'
$ ls
sh: ls -l: No such file or directory
I was TRYING to confirm that alias isn't recursive, but... Sigh.
Busy in Tokyo with other things (Jeff's stuff), haven't been using my laptop and thus haven't been blogging. Tokyo remains very nice. Being on the other side of the planet from Putin's pet tangerine is also very nice.
Trying to glue the cleaned up blake3 implementation into toybox, which would be easy to do if there weren't two codepaths. The library codepath has magic string names for each hash type corresponding with some enum out of a header, but openssl doesn't seem to have blake3 yet? I cd'd over to the boringssl source, listed the top level contents, noticed a "rust" directory, and noped right out of there.
If the rust devs want to write new implementations, fine. The go, swift, zig, and oberon people are not trying to contaminate every existing project with internal language domain crossings too mark their territory. Nor do they insist that they are "owed" all those other projects. If you want to create a replacement for the linux kernel, do that. If you want to implement a new drop-in libssl replacement in a different language, do that. But DON'T BREAK THE EXISTING ONE YOU FUCKING ASSHOLES. But no, the current existing codebase must BECOME riddled with rust, not be replaced by it and outcompeted in the marketplace of ideas. And of course it doesn't remove C (because they can't, they're not actually load bearing), it just adds more layers of complexity.
Do a new implementation in a new language if you want, but STOP CONTAMINATING C PROJECTS. One project written in two languages with binary domain transitions at runtime is a BAD THING. Lua is designed to be extended with C, rust is a parasite that infects C.
This is why I refuse to have a rust toolchain in any of the systems I build. Any package that can't build without specks of rust weakening its infrastructure is broken, and I stay at the last version until I find a dropbear or bearssl or similar package that ISN'T CRAZY.
I have plenty of practice avoiding systemd, and abandoning KDE when it got toxic, and avoiding windows and facebook in the first place. "Not being part of that ecosystem" is fine with me. I am willing to be convinced, but not coerced.
I have a very nice room in a "monthly mansion", through the 27th. They emailed me about a resident meet-and-greet (sakura bloom viewing, it's basically Japan's pumpkin spice) on the 28th.
I'd love to live in Tokyo, and planned to do so while Fade was getting her doctorate, but now she's graduated and has a job in minneapolis. (And her response to the election was that she grew up in Ecquador where the government collapsing was a regular occurrence. I am less sanguine, but staying with her. Happy to spend some time on the other side of the planet, though.)
(It's really still the same day, but international dateline.)
I'm trying to debug the "wget http://10.0.2.2/blah.tgz | tar xv" bug, which is like 3 different bugs. I should have this codepath in the test suite, but autodetecting compression types from nonseekable input isn't something the host version does, so it would be a toyonly test anyway. (It works fine if I pipe to tar xvz, but it TRIES to autodetect...)
Got another request for "nologin". Still don't see the point, but it's in debian's default install and in busybox, so...
I'm trying to get together some of the info I'll need for the paperwork at the airport, which includes the address of the place I'm staying. Instead of Google Maps (which turned into pure advertising) I've been using the "Organic Maps" app, which is an open source Android app using the Open Street Map data. Like Gmail, Google Maps started life as a web version of a 30 year old open source project. (Microsoft's approach with Encrata was to try to put Wikipedia out of business. Google's approach is to embrace and extend open source projects. They were net contributing back until about 2019, but stuff like "Google Amp" is about interposing their services so nobody uses the original, and the AI summary stuff is even more of that... Google Maps has sattelite view and street view, which the open street map data doesn't, but highlighting advertised businesses three zoom levels before where they'd otherwise show up, and refusing to show me local black owned businesses even when I zoom all the way in? That wasn't cool.
Let's just handwave, for the moment, the difficulty of open source projects surviving in Google's proprietary Play Store. (I trust Debian's repositories to have good code. The play store, noticeably less so.)
Anyway, the hiccup I hit is that the app works offline, so does not dynamically download map data. Instead when you zoom in enough it prompts you to "download tokyo prefecture (120 mb)", which is a thing I should have done before getting on the plane. (No wonder searching for tokyo street addresses didn't provide any hits, I'd only downloaded minnesota...)
The plane's completely full (I know because the checkin kiosk did this "bid to be bumped" pop-up thing, $500 was the max of the anchoring options and they didn't bite when I hit it), but I have an aisle seat so managed to do a little work on my laptop. (I have elbow room on one side, anyway.)
But it's in bursts, and after fixing a thing and queueing up three more things I need to do (in a "before I can fix THAT bug I hit THIS bug, oh yeah I remember that issue I haven't fixed yet..." way), I took a break to try to watch one of the in-flight movies on the seatback screen.
I had high hopes for "Deadpool and Wolverine", but no. I had to pause at 12 minutes in. And then again about once a minute since. Through him being rejected by the avengers, through failing as a used car salesman, through the painful burthday party here he's broken up with Vanessa for some reason? Even through the TVA which is where you'd THINK the movie would perk up but it's still just sort of... It's not quite embarassment squick, but it's emotionally hard to watch in a way I don't remember previous deadpool movies being. They were cathartic and funny. There was tragedy and drama but it not a lot of "waiting for the other shoe to drop" for more than like 30 seconds at a time.
Maybe it's just been a long time and I'm not remembering. Maybe I got spoiled by too many clips online. But it's paused at just under 23 minutes in and other than the credits sequence (which was great), this movie has been clearing its throat and waiting to start.
The concept of "anchor being" is... sigh. A billion galaxies all depending on one dude they've never met? The fabric of reality breaking down because Jesus died? I'm not buying the physics. They could have at least come up with some more convincing horseshit about "it's an echo of Thanos doing that snap, one guy killing half his universe resonated across parallel realities that are currently either living or dying based on the fate of an individual because snap." Give me a fscking FIG LEAF here. (How did this universe survive long enough for the TVA to be founded?) Yes I expect fourth wall breaks (which in-universe are treated as Wade being mental because head full of tumors, but here they're... not?), but I've been completely pulled out of this movie a half dozen times and we're not CLOSE to half an hour in. "I am aware of voyeuristic extradimensional entities who mimic my reality for their entertainment" is not the same as "I'm going to grab the microphone and pull it into frame because this is not real even to its own characters". I can't make the airplane playback go 2x to speed through it more tolerably. I'm glad I didn't see this in theatres because I might have walked out.
Darn it, I was hoping to like this one. (Unlike Moana 2 which I'm just not bothering with. At least Aladdin II was forwards looking not backwards looking. That was trying to be the PILOT for a TV series, rather than "we cancelled the TV series and frankensteined the corpses of a dozen episodes into a single theatrical release. This time the "sucky direct to video plot" is recycled leftovers made from something that already explicitly failied and will not be happening. They're showing it to us because they already paid for it, not because they LIKED it. They explicitly DIDN'T like it enough to finish the series, but hey, sit through it in theatres! Kids are too dumb to know better! That's... Ouch. No. Do not sully the memory of the original like that. Yes I have the option to watch that on this plane. Or "Mufasa" which FUCK no, a PREQUEL to the LIVE ACTION REMAKE??? Which WASN'T LIVE ACTION BECAUSE CGI ANIMALS WITH NO EXPRESSIONS OR BODY LANGUAGE AND... AAAAAAAHHHHHHH!!!!!!)
Ooh goddess, Paradox's villain rant at 25 minutes is... Couldn't they at least get whatsisname, the Butler from Clue and Sweet Transsexual From Transylvania to do it? He went full muppet in Muppet Treasure Island, he could pull this off. The random empty suit they have here is failing to ham it up OR be convincing. It's neither serious nor camp, it's just sad.
At 26 minutes it's trying to provide motivation, and just isn't. He literally established that "everyone I care about is in this room" around 10 minutes ago, but they won't transplant the contents of THAT ROOM. Why? No reason! Ryan Reynolds actually emoted a little bit (maybe 10 seconds), but the guy in the suit is just nothing. He's neither Darth Vader nor a punchclock villain, his motivation is READING A SCRIPT.
Possily "kill your family to join post-endgame marvel alongside Quantumania and The Eternals" is not a coherent pitch for a villain to even make. It keeps showing clips of The Avengers from 2012, but they already did Endgame, that's over. And Deadpool already asked to join The Avengers like 5 minutes ago, this is supposed to take place AFTER that. The first movie did nonlinear storyteling, but you could piece it back together pretty easy. Deadpool 1 asked "why did this happen" and then backed up to show you. Why does he want to join The Avengers here instead of the X-Men? Yeah yeah meta Disney but it makes no sense IN UNIVERSE. The emotional stakes are BACKWARDS, Disney thinks it's hot shit and that the audience cares more about behind-the-scenes making-of drama than the story ON THE SCREEN. This movie is failing to tell a coherent story.
I remember how, to me at least, far and away the weakest part of the Dr. Who 50th anniversary special was the meeting with The Curator. Because it was all nods and winks that made no sense in-universe. If they'd established "this was one of the leisure hive clones of the 4th Doctor who survived that episode and will eventually decay into The Watcher and merge back into The Doctor when the 4th doctor regenerates between Logopolis and Castrovalva (as we saw on screen, thus ANSWERING a question instead of asking one), and in the meantime the clone gets a couple hundred years of scurrying around behind the scenes pulling strings to balance the fallout from the Logopolis entropy wave destablilizing the universe, which was why the Key to Time had to be assembled to save most of the universe from that particular disaster, and as long as he was cleaning THAT up he took a quick swipe at the time war on his way out... they could easily have made a fantastic story out of that. Tom Baker's elderly character getting increasingly pale and frail as his time runs out, adding "the watcher" makeup and racing against the clock to finish his tasks until the current Doctor drops him off near where Tegan's Aunt Vanessa's car broke down at the end. But "all these nods and winks mean NOTHING in-universe" was self-indulgent nonsense that I found painful to watch. It served no story purpose. The story needs to emerge from the motivations of the characters. If that's not what's driving the plot then there are no emotional stakes and I have no investment in what's going on.
Seriously, this is freshman level writing 101.
And half the point of the Loki series was apparently that the "sacred timeline" read like a cult to everybody outside the TVA, so now there are other timelines the TVA allows to exist (they no longer prune everything)... but one of them is still scared? How does that work? I'm sorry, what did the Loki series accomplish exactly? "We no longer prune" THEN THERE ISN'T ONE SACRED TIMELINE. Especially since going forward past Endgame even Disney's audience doesn't know what "the" timeline is, it's branching all over the place. That whole Doctor Strange and The Olson Twins' Mommy movie, Spiderman Across The Universe's Live Action Remake with Toby and Andrew, and didn't that forgettable "Ms. Marvel having a three-way" movie involve Monica Rambeau winding up exiled to X-Cheers where Blue Beastle is played by Frasier? (I remember Monica's name because I used to read the comics, I remember her getting her silver costume from a mardi gras rack: Binary was off with the Starjammers and Inflation Fetish Lass wasn't a thing yet.) Which of all those timelines is the "sacred" one, exactly? Didn't most of them fork off a common base, and then interact with lots of OTHER timelines? "We visited this other timeline, came back to ours, you pruned the other one so it never existed but we were there for quite a while breathing its air and interacting with people so why are we still here now if part of our personal past no longer exists..."
They're not being consistent even _within_ the Tennesee Valley Authority. The timeline was sacred because that actor they hyped up to replace T'Challa (who then got fired for domestic violence) pruned all but one timeline, because he invented a flawed time machine that forked the universe to death and was going to smear everything into a lifeless fog otherwise, or some such? And then there were two seasons of plot where Loki won the Game of Thrones so yggdrasil could grow out of his ass and now there can be lots of timelines going in parallel without spaghettification... but one of them is still sacred? (I didn't see the series, I don't have Google+ and I'm not planning to buy Google Glass to watch it, but I saw a bunch of clips on youtube because Loki's actor remains engaging and Tall Round is adorable as OB. I should not have to be up to speed on the minutia of TV series to follow THEATRICAL offerings, but from what little I know everying this movie is saying is nonsense even WITHIN the context of what they'd already established about the TVA.)
And hang on, the movie showed Ryan travelilng to Earth 616 to talk to Happy Hogan about joining the 2018 Avengers (long before the TVA showed up, outside of the opening credits which was a flash-foward). It SAID "Earth 616". How did he cross timelines to do that? Cable's time machine was "forward and backwards" not sideways. (The TVA guy said Deadpool made a mess of HIS timeline. The single timeline Deadpool is from, number ten thousand something. Cable's thingy was not interdimensional travel. As with the Tardis, it can't navigate SIDEWAYS. There's no coordinate settings for other universes, it doesn't inherently know how to go that way, it only accidentally ever winds up in pre-existing alternate univeses like Inferno or E-space or the new Cybermen's universe due to external factors dragging it off course, and you then have to VERY CAREFULY BACK OUT through the hole you came in to get home.) And wouldn't going to work for The Avengers in another universe have involved leaving his family behind to do it? Since that wasn't his universe? I'M CONFUSED.
28:30: did they ever establish that transplanting a Logan would work? If so, why couldn't the TVA just do that? (I WANT TO LIKE THIS MOVIE. PLEASE STOP FAILING AT STORYTELLING! I WOULD VERY MUCH LIKE IMMERSION while confined to an uncomfortable chair for twelve and a half hours.)
Sigh. It picked up a bit once Huge Ackman got a chance to act, but I made it a little past the 50 minute mark and just stopped. The evil bald lady from Star Trek The Motionless Picture starting a cult is just not my problem. I do not care. I cannot BRING myself to care. Not after Human Torch America broke his neck falling, but was then resurrected so she could kill him again, because she's so dumb Deadpool could trivially manipulate her into killing a random stranger he just met who had tried to be kind to them. For no in-universe reason. What do any of these people EAT? How are they protected from this giant all-devouring smoke monster from "Lost" when they have a large visible base in a fixed location on the surface? Deadpool viscerally murdered like a hundred TVA agents, and then comes back to have a civil conversation with their boss. Instead of killing him (which you'd think a disintegrator stick that looks like that COULD DO) they put him a prison that Loki ALREADY ESCAPED FROM back in that TV series that explicitly took place before this movie. They still put people in there, with a giant death monster that CAN kill them, but won't RELIABLY do so, and otherwise leave them unguarded. Why?
This is just random unconnected things happening on screen, I'm out.
Preparing for my flight to Tokyo tomorrow morning: full backup of my laptop less than a week ago is probably good enough. I did a git fetch on repositories I might want to poke at on the flight, and linux-kernel had stuff but busybox hasn't been updated since February 9th and musl since February 12th. I've been feeling really guilty about not going faster on toybox, but DUDE...
As with last trip, I'm stress cooking. Trying to leave Fade with All The Food Boxes for work lunches and dinners while I'm away.
I had to go to Target to get more jeans. I feel guilty about spending any money there until their DEI cowardice/appeasement gets undone (or never if it doesn't, still not on Faceboot, still not using Windows), but at the moment I can't think of a better place to buy clothing in Minneapolis. (Well I dunno what's around.) I bought everything else I usually get at Target at Cub instead, which is more expensive and has worse selection, but is not the subject of an active boycott I'm aware of. (They may be terrible, but if so they were QUIETLY terrible. They did not publicly preemptively appease a would-be dictator as a show of performative fealty.)
The FSF remains surprisingly incompetent:
$ tar xvf gdb-6.12.xz
$ cd gdb-6.12
$ tar xvf ../gmp-6.2.1.tar.xz
$ mv gmp-* gmp
$ tar xvf ../mpfr-4.2.0.tar.xz
$ mv mpfr-* mpfr
$ ./configure --target=sh2eb-elf
$ make
...
target-float.c:1160:10: fatal error: mpfr.h: No such file or directory
1160 | #include <mpfr.h>
It's RIGHT THERE. Configure passed, you built 8 gazillion subdirectories, and then suddenly you can't find YOUR OWN COMPONENT THAT'S IN THE TREE. ("Autoconf is useless" is still to the tune of "every sperm is sacred".)
$ make clean; make | wc -l
2565
It made it 2500 lines into the build before going "boing", most of those one file being compiled per line. Sigh... (I'm ONE GUY and I try to test all the mkroot targets before each release. They can't test all their build targets. Oh well...)
I have been pointed at a small simple public domain blake3 implementation, which seems a good thing to glue into toybox.
It means I'm skipping blake2. And there's no agreed-upon /etc/shadow "$1$salt$hash" indicator for either hash. (In part because there's no standards authority for that!)
Ah, wait. "man 5 crypt". $6$ is sha512, $5$ is sha256, $sha1$ is sha1. Nothing for blake2 or blake3 (was there a blake1?), or sha3. I tried to look at "yescrypt" but it's an intentionally obfuscated magic implementation that SMELLS like a scam. I suppose I could ask Michael Kerrisk? (I know he handed the man pages project off, but the new guy DOES NOT HAVE A WEBSITE.)
I want to replace TOYBOX_LIBCRYPTO and TOYBOX_LIBZ (and WGET_LIBTLS) with a single global switch that says always use internal implementations even when there's an external library with a potentially faster version of stuff. Then the default behavior (when the config symbol is disabled) would be to __has_include() the relevant header and use it if it's there, but use the internal one (or disable the functionality) if it's not. It can check for LIBCRYPTO and fall bback to checking for LIBTLS (because if both are installed it would pull hash functions from libcrypto already so might as well use it for everything).
The problem is, what to call the new switch. TOYBOX_NOLIBS? TOYBOX_INTERNAL? Hmmm... Until I implement my own https "internal" isn't right bcause the switch would disable https support. NOLIBS isn't the est name, but it's sort of what's going on here? TOYBOX_NO_EXTERNAL? TOYBOX_NODEPS?
Woo, I edited past the blockage and may actually have 2024 up and be on to a 2025 public blog file soon! (Before the end of February even!)
[Spoilers: nope.] [Futher spoilers: I'm editing this on April 8th. There was more politics and ball-curling before the anger/spite caught up.]
I often make bullet point todo lists while working, it's a good way to organize my thoughts, but they tend to look like this which is not the same as a blog entry. And I usually edit such lists a bunch of times as I go along. And there's the temptation to do a bullet point list in a blog entry, and sometimes that becomes my active "keep track of current work" list because it's the one that's up to date, and I should really know better by now because it never ends well. Editing should be "is this coherent, does it render well, did I finish my thoughts, look up the URLs I meant to link to". Simple, quick to do stuff. Not "completely rewrite this for hours to explain what it means".
The second bug making mkroot/testroot.sh hang doing "toybox timeout -i 10 bash -c ./run-qemu.sh -drive format=raw,file='$TEST'/init.$BASHPID < /dev/null 2>&1" was that for recursive command calls, toy_exec() wasn't clearing the old command's signal handlers, so potentially calling an inappropriate function and segfaulting if it received a signal after the fork. This one was ANNOYING to track down, so many printf()s to dredge through to the failure point. Also, it had 4 interacting processes (timeout forked toysh which backgrounded a shell script using the ampersand, and that shell script called toybox's dirname command. In a defconfig build all four of those were toybox processes forked from the parent toybox process, and ASAN positively lost its MARBLES at that. Me, I just started each printf() with the current pid number, ala dprintf(2, "%d message", getpid()); so I could keep the output straight. Seriously, you can debug just about anything with printf()s.
The November 28 blog entry is being really annoying to edit and post, because I did my normal todo bullet point notes-to-self as just a blog entry while working on the shell function call stack redesign, and it does NOT translate to HTML easily. And alas Google Chrome has been absolutely terrible about <pre> tags forever, because if you don't explicitly set the font size the default font size for the monospaced font ISN'T the size of the previous font, it's 1. I.E. the smallest possible size of tiny unreadable font, yes it's an obvious bug, no they haven't fixed it in... 6 years I think? Because every page should have a stylesheet and if it doesn't that's just silly, even though stylesheets regularly make stuff worse. I note that <pre> tags without gratuitous micromanagement render just fine on firefox, and I made puppy eyes at the Vivaldi guys first time I tried using that.
Look: I don't know if you prefer a white or black background for your text, why would I be making these decisions for you? Here is some test, with paragraph breaks, links to other pages, and the occasional bold and bulletpoint list. If your browser can't render that, it is a CRAPPY BROWSER that's less capable than Mosaic was 30 years ago back BEFORE netscape hired its developers away with silicon valley VC money. (VC money: turning open source internet infrastructure into exploitative gatekeeping spyware since William Shockley moved all the way accross the country from Bell Labs because he was such an asshole nobody wanted to have anything to do with him. Capitalism is a bad thing.)
The tiny desk thing remains better than nothing. If nothing else, it prevents the laptop from overheating when it's directly on a blanket. It's ungainly and trying to get up from under it I have already spilled a lemonade (clipped it with a wooden leg) and caused a surprising amount of laundry. Still, soldiering on...
Suspend and resume fixed two finger scrolling in xfce. I have no idea what's going on what that. (After a suspend and resume it attached the touchpad hardware to a different driver? What, race condition or uninitialized variable or something in the driver? Who knows. If linux-kernel wasn't so intensely self-fellating these days I might try to track it down, but...)
I was curious if Tim Bird'sboot log scraper would run on mkroot (I.E. under toysh: spoiler yeah but the UI is apocolyptically bad and does a complaint-reboot cycle adding command line options and kernel boot arguments multiple times until eventually mollified, at which point I had a large text file I wasn't entirely clear about what to do with).
To get images to run this script on (and confirm that toysh _can_ run the script, and maye use it as a test load to fix any missing features it needed), I did a mkroot build all (mkroot/mkroot.sh CROSS=allnonstop LINUX=~/linux/linux) in a newly cloned toybox directory, which was also a chance to regression test the current linux-git against my ongoing patch stack. Everything built fine but of course mkroot/testroot.sh initially failed for all targets because I hadn't switched the "timeout bash -c blah" to "timeout /bin/bash -c blah" because toysh has a bash alias so it recursed into toysh, which fails to run the relevant command line for some reason. I should track that down and fix it.
The first bug is that once upon a very long time ago, getval() (or whichever equivalent I was using early in toysh's development) returned the whole name=value string, and the current behavior just returns the value, so adding 6 to skip SHLVL= is wrong because the function I called to fetch the data already did that for me. (This only happens in the fork/exec path, and I've mostly been testing the nommu subshell path, so I hadn't spotted it.)
But there's a second bug, and ASAN totally craps the bed on it: $ ASAN=1 make clean toybox && ./toybox timeout -i 10 bash -c "root/i686/run-qemu.sh -drive format=raw,file=root/build/test/init.sqf < /dev/null 2>&1" goes:
AddressSanitizer:DEADLYSIGNAL
=================================================================
==20711==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x7f2c7659de7e bp 0x5e24a6a36a05aade sp 0x5e24a6a36a05aade T0)
==20711==The signal is caused by a READ memory access.
==20711==Hint: this fault was caused by a dereference of a high value address (see register values below). Disassemble the provided pc to learn which register was used.
AddressSanitizer:DEADLYSIGNAL
AddressSanitizer: nested bug in the same thread, aborting.
AddressSanitizer:DEADLYSIGNAL
[Repeat twice more with different numbers]
AddressSanitizer: nested bug in the same thread, aborting.
There are no threads. Toybox is not a threaded program. It forks and backgrounds processes, but they are NOT THREADS.
I've been variants of under the weather for over a month (the stress
isn't helping), and haven't even gone
out to the apartment's front office since... new year's? Nor have I done
a lot of sitting at the kitchen table counter-island-thing,
because it's not very confortable
(tall thin chair, doesn't really work for me), nor the desk in the bedroom
(less uncomfortable, but still kind of terrible and I find the room
claustrophobic.)
I currently have whatever cold Fade spent saturday through tuesday home sick with. (Monday was Not My President's Day and Tuesday was a snow day, so she got a 4 day weekend to be sick without having to use PTO for it. Wasn't HAPPY about going back to work on wednesday, but was capable thereof.) This is a different sickness from the one she had a couple weeks before that where she only managed to work 2 days out of 9, the rest being basically bedridden. Winter in Minnesota! And a public schoolteacher taking light rail to a bus to a half-dozen rooms full of small children each day tends to pick up ALL the colds. (Her employer warned her this would happen her first year, until her immune system ramps up to cope with an endless stream of small children.) At least it's warmed up enough it's not quite so BRUTALLY dry in here... merely painfully dry. Anyway, I been sick.
Fade had a tiny little wooden desk thing (it's a shelf on two folding legs) that she tried to use for her laptop on the bed, but the dog didn't like it (it dampened his cling). I've fished it out from under the bed and am trying it on the couch. It's... better than nothing?
I am tired and sick and grumpy. Fairly certain these are related.
Dreamhost wants money. They do this every 2 years, and given how often I lose/cancel debit cards they probably wouldn't keep a payment method on file if I DID give them one, so I tend to mail them a check. And every time I have to look up how to do that again, and they do NOT make it easy to find because it's not how they want stuff to go. I have been unable to find instructions on their website this time, and google is utterly useless now (I expect the AI feature would HAPPILY suggest a way to do that if I ever did a search that didn't end -ai these days, but I would NEVER send money to its suggestion), and Dreamhost's web page has a chatbot that tries to answer my question, fails, asks if I want a human, and then says humans are available monday through friday starting at 9am. Oh well, try again during normal business hours I guess...
Either LInux or Xfce randomly broke two finger scrolling. Digging into it, the problem is xfce's control panel is seeing two mouse sources: an ALPS GlidePoint touchpad (which has a two finger scrolling option in the touchpad tab, which is enabled), and an ALPS mouse, which does not have a touchpad tab. If I disable the "mouse" entry, the pointer freezes. If I disable the GlidePoint entry, there's no difference.
I'm not entirely sure when this broke, but it worked until recently. I didn't do anything obvious to break it. Thunderbird's UI is terrible enough as it is (I can't figure out how to get XFCE to give me the little move up/down by one line arrows at the end of scrollbars back, I had that before the forced version "upgrade" from Devuan Bronchitis to Devuan Diptheria but it no longer seems to be availiable because removing stuff is considered progress.
I note that hexchat still has the up/down arrows on its slider bars, because it's not using xfce's default window manager toolkit preferences. Which in this case is a good thing because I want ALL my apps to behave like hexchat is behaving. I wouldn't miss two finger scrolling if I had the darn up/down buttons back. (Clicking in the empty space int he slider above/below the grab thingy jumps by more than a full screen, meaning it misses entries. Open source development cannot do user interface design.)
Ken Burk had backups my the 2015 ELC talks, and sent me copies of my shrinking C code (outline) and toybox status update (outline) talks!
Thank you! (I poked Tim Bird in case he wants to get any of the others back online...)
Over on the gnu/coreutils gnu/mailman gnu/list somebody claimed the linux kernel's mineral rights for Richard Stallman, and while trying to write a civil reply I cut and pasted this out as too inflammatory for that list:
You realize that Stallman's entire rationale for sticking gnu/ on stuff was to claim credit for the larger system, because it's not like Coherent shipped the first full Unix clone 3 years before Stallman announced the FSF, BSD Net-1 shipped a year before Hurd (after starting work in either 1976 or 1974 depending whether you credit Bill Joy or Bob Fabry), and of course Linus developed his own kernel under Andrew Tanenbaum's Minix and announced it on comp.os.minix).
No, clearly nobody ever thought of cloning Unix except Stallman, and therefore it was his gnu/idea. Because his attempts to get people to stop calling it "Linux" at all failed back in 1998. (And when I read that page back in 1998, sadly I tried to explain marketing to RMS, which did not go well.)
Interestingly, Stallman himself said in the above ("his attempts") link that the first Hurd based system shipped in 1996, but Linux not only came out in 1991 (0.0.1 booted and ran) it already had its own mailing list (here's a few selected interesting posts from 1991 and 1992), having migrated off comp.os.minix when professor Tanenbaum got back from summer recess and objected to the off-topic project culminating in the famous Tanenbaum-Torvalds debate. (There's no similar Stallman/Torvalds debate because Stallman didn't matter.)
Tanenbaum declared Linux as off-topic on his list because while it used Minix's filesystem format and compiled under minix, Tabenbaum did not claim ownership of Linux. He acknowledged it was a separate project, needing a separate discussion list. Meanwhile Stallman, who never submitted a patch to Linux or even put out a Linux distro, seamlessly transitioned from railing against Linux (insisting the Hurd would replace it) to retroactively trying to claim credit for Linux, and politically "embrace and extend" it ever since. He's also been keen to recapture errant forks such as glibc from Ulrich Drepper (see "And now for some not so nice things" at the end of this release announcement), or recapturing gcc from egcs with that "steering committee" business...
And now you're sticking GNU/ explicitly onto the _kernel_. I'm sure Alpine Linux (based on busybox) or Android (no GPL in userspace) would be happy to receive such a "correction".
Along the way I stumbled upon Oliver's page on this stuff, which is quite good (and I could probably add a dozen things to). I feel REALLY BAD about not being able to properly harness his enthusiasm. It's a me problem, and I know it. It seems like the kind of thing that could easily spin into the toybox version of Alpine Linux if I'd played my cards right, but I just don't have the skillset and have been curled into a ball at the prospect of "what the Boomers are going to break next" since moving out of Texas because politics and climate change had made staying untenable.
And then on my SECOND attempt at writing a civil on-topic reply, I cut THIS out and pasted it here, again as too inflammatory.
The kernel Linus started on comp.os.minix using the minix filesystem and compiling it under minix, which got its own mailing list in 1991 after the "Tanenbaum-Torvalds debate", 5 years before the first Hurd-based system shipped?
Systems built under LLVM (as Android's been doing since 2015) using another libc (like musl or bionic or bsd) can be built and run without a single line of gnu/code in them. Alpine Linux uses busybox, Chimera Linux uses uses BSD userspace, etc.
This is often done to avoid GPLv3 (and in Android's case even GPLv2, hence the lack of busybox). Or for performance. Or various other reasons.
Are you claiming Stallman invented the idea of cloning Unix? Because Coherent shipped the first full Unix clone in 1980, and BSD development started in either 1974 (under Bob Fabry) or 1976 (under Bill Joy) depending how you want to count it. Andrew Tanenbaum responded to AT&T's 1983 licensing policy change (enabled by the Apple vs Franklin decision) by starting work on Minix even before Stallman announced he was gonna gnu, and he finished it and sent it to his book publisher to stick on a floppy in the back of a textbook in 1986, having written his own kernel, compiler, and userspace in 3 years.
(There were lots of others, even Dos 2.0 was all about adding Unix features to a CP/M clone and I actually _used_ Vax "Eunice" as a child. The concept of "subdirectories" came from Unix. Stallman was not on the Posix committee, and while wikipedia's been edited to include his claim to have named the Portable Operating System unix project "POS-IX", you'll note the "citation needed". Similarly, when I drove to Boston to interview him for computer history research in 2001 he told me (in person, to my face) that he gave the BSD guys the idea of shipping their own operating system, which Kirk McKusick actually laughed at when I relayed it to him at Ohio LinuxFest in 2013. Stallman tends to retroactively insert himself into stuff.)
And now you're saying "GNU/Linux kernel". Really. This was 10 years ago. Stallman didn't write GPLv2, Eben Moglen did. Linux used libc5 first. The driving force behind gcc development (making it better than pcc or minix's compiler or any of the MANY others available at the time) was Sun's VP of Marketing Ed Zander unbundling" the compiler from the base OS during the SunOS->Solaris migration (which was about AT&T shaking down vendors with IP claims and forcing them to switch from BSD to System V codebases, as explained in Red Hat co-founder Robert Young's book "Under the Radar") and selling the compiler and command line tools like "tar" as add-ons you had to pay extra for.) That had NOTHING to do with Stallman, he was hoping Project Jupiter would ship as a PDP-10 successor so he could keep maintaining MIT's ITS, he only retrenched to Unix when his previous project FAILED.
Sigh, there's a 2010 rant on this already. (Which was itself a sequel to the earlier history posting that links to at the start...)
I keep thinking shell trap handling works differently than it actually does. Interactive shell editing is NOT interrupted by trap handlers:
$ trap 'echo hello' USR1
$ (sleep 1; kill -s USR1 $$)&
[1] 5531
$ so now what
hello
bash: so: command not found
[1]+ Done ( sleep 1; kill -s USR1 $$ )
Curled up in a ball for another week. I was functional-ish while caring for Fade, but now she's back at work and the Circular Firing Squad is of course causing yet more collateral damage. (The Boomers will die, and thus stop voting for every nigerian prince email and scam phone call. The crazy 27% is just a tie breaker, without the Boomers they go back to being a shouty minority. A 78 year old man with Progressive Supranuclear Palsy will die (and xi is 71, and putin is 72). A cult of personality is not transferrable. Florida is already uninsurable and will be uninhabitable soon. The EU as a whole generated more electricity from solar than from coal last year. Oligarchs only die of natural causes when they DON'T stir the pot. The rubble can be rebuilt around basic income, subsidized food and housing (instead of subsidized fossil fuels and suburbs), a proper right to privacy, and a policy of guillotining anyone who retains control of a billion dollars longer than 30 days. Dave Barry predicted Boomerdamarung back in 1996. The Boomers will die.)
I've spent so much time wearing headphones (listening to distractions) that I've developed an ear infection and had to stop wearing them for a bit. Quite possibly just a zit I picked at too much, but still. Alas, earbuds still cover it and I want it to heal, so...
If kexec needs to work from a single processor kernel (because I can't figure out how to get the second processor back into the power-on state, and I don't want to edit the turtle board startup code to handle two cases which each get half as much testing), then I need to be able to power cycle the Turtle board to get it to reload that UP kernel from which I can kexec the new kernel I want to test. That way, I can set up a remote test environment where I can boot a newly built kernel without sneakernet.
I know USB hardware can do this: I've read the specs and gone through low level programming registers for various hardware. The host can cut power to a device and restore power to a device. But it looks like Linux can't, because it did not occur to the Linux kernel clique that intentionally power cycling a USB device from software is a thing anyone would ever want to do.
It looks like there potentially used to be support but it was "improved" away. (Or at least attempts to write 0 and such to the control thingies under sysfs all say "illegal write: invalid argument" from sudo /bin/bash's echo, but they accept "auto" which is the default value and apparently the ONLY value so why does the knob exist...) And the various suggestions online about how to set the autosuspend timeout to 1 millisecond or unbind drivers from the device are both A) useless (the LEDs on the board stay on, it is clearly still getting power) and B) persistent (it doesn't work as a serial device anymore despite unplugging and replugging it, I think I need to reboot my laptop to get that back). They've replaced manual "do this" switches with automatic transmission nonsense that does the wrong thing in 5 different ways, all behind a black box.
All this is DESPITE the USB fan I was plugging into the thing when programming at the UT geology building's picnic tables (in the dead of summer when it was still 90 degrees at 2am) breaking a couple years back because Linux would power it down after 30 seconds despite being a dumb device like a USB book light. So the device I did NOT want to power down would power down, and the device I _do_ want to power down can't be powered down. The 6.x kernel! Knows better than you do, and will not let you gainsay its decisions.
I'm trying to get a "cursor up, press enter" style compile-and-test out of my turtle board, like I have with qemu. If I have to stop and fiddle with hardware then the friction of testing on turtle is way higher than testing in the emulator so I won't do it as often. (I know me.)
This is why I want kexec, so the kernel that loads from the sdcard doesn't have to be the kernel I'm testing. It's best to have a known good kernel boot first anyway, so I have easy recovery if I send it something that didn't work: if I was replacing the boot kernel on the sdcard and I gave it a bad one, I'm back to popping the card out and sneakernetting it back to the host to do recovery, and that's a pain. Being able to power cycle the board if I gave it a kernel that hangs is also generally good for test cycling: I need USB to be able to switch the board off and back on.
The rest of the stuff seems like a solved problem, if a bit awkward. The USB connection provides a serial console through which I can easily transfer files to the remote board via uuencode/uudecode and similar, so they wind up in initramfs without having to go to flash at all, avoiding wear and tear on the finite planned obsolescence technology capitalism moved us to (storage that wears out with use so you have to buy more). Power cycling the board means the Association of Computing Machinery serial port (/dev/ttyACM0) goes away and comes back, so I have to re-bind microcom or similar to it, but that all seems scriptable. (With sleeps and/or spinning.)
If it plugged into a wall outlet I could buy any number of "smart outlet" variants with bluetooth or serial connections for doing exactly this. But the problem is this USB connection also provides the serial port I want to talk to the device through. The problem is modern Linux kernels are _less_ capable than $15 crap from digikey.
Sigh. I'm going to wind up buying an outlet-powered USB hub just so I can power cycle THAT with a software controllable wall outlet, aren't I? Because the Linux kernel clique is too focused on rusting the kernel to pieces rather than actually letting people control the hardware.
One problem is kexec.c didn't redo the crt0.c setup from the bootloader, so when it enters the new kernel the inherited stack pointer is pointing into unclaimed memory and the registers aren't in a known state. That's probably part of the problem.
I still need to do my hello world kernel spinning writing to the serial port, because I need to stick printf() into stuff to debug things. When I can move the printf() I can see incremental progress. Without that it's just throwing darts at a bullseye and hoping to get lucky.
By the way: sh2eb-linux-muslfdpic-cc --static -s kexec3.c && toybox uuencode kexec < a.out | xclip -sel c and then in the turtle board run uudecode with no arguments and paste the clipboard into the terminal. Easy way to fling a binary onto the board via serial console. Of course this would be more convenient if running the binary DIDN'T brick the board and require me to power cycle it each time, but for NORMAL compile-and-test cycles on an embedded board, that's an old trick and part of the reason uuencode/uudecode is in toybox.
The kexec I wrote for turtle doesn't work yet.
The first problem is putting CPU1 back into the state the kernel expects isn't really possible with this hardware. At power on, CPU1 starts in a perpetual memory read stall with the Vector Base Register set to 0, and you have CPU0 poke a register to unblock it, at which point it runs a reset interrupt loading PC from vbr[0] and SP from vbr[1]. The turtle SOC maps a small SRAM at physical address 0 (ouch) so Linux SMP bringup just has to write two pointers to the zero page and poke the unblock register.
Note that this stall unblock register is NOT in board.h, it's memory mapped off in la-la land. It's mentioned in the bootloader device tree but it's not a properly documented hardware block. So that's nice.
The problem is, when SMP Linux is already up I can run code on CPU1 (using the same taskset+SCHED_RR trick I use to take over CPU0), and I can lobotomize the interrupt controller (typecast DEVICE_AIC0_ADDR from board.h to (unsigned *), then aic[0] = 0 to stop the Programmable Interval Timer, and aic[3] = 0 to mask IRQs, although in this case I'd probably want to use DEVICE_AIC1_ADDR to write to CPU1's interrupt controller instead). And I THINK I can even put CPU1 back into the stall state (write a 0 into the control register). But I can't call a reset interrupt from assembly, it's not a normal raiseable interrupt.
What I need is for CPU1 to go into the read stall WHILE trying to run the reset interrupt. It should block trying to read the PC from memory location 0. Otherwise, when it unblocks it's going to read the next instruction from wherever PC points to and try to execute it, and when I hand off to the new kernel no area of memory is guaranteed not to get rewritten (not even that zero page), so whatever PC points to is unprotected. (The most precise control would be to have CPU1 do the stall poke, but then it would try to advance to the next instruction when unblocked. I might be able to do Branch Delay Slot shenanigans to have that advance go anyway, although in that case (possibly ANY case due to CPU pipelining?) it would read and decode the next instruction from the old memory contents before hanging trying to do a read from more memory at some point down the line. (It's a 5 stage pipeline, instruction read is always at LEAST one clock ahead of execution.)
In theory there's a design element to fix this! Jeff thought about it, and if you write 7 (bottom 3 bits set) to that PIT control register in AIC, it should reboot that processor. Unfortunately, the contractor who implemented it didn't hook that reset line UP to anything. (It raises a reset line that's not plugged in. Great. Well we never tested this in FPGA, that was an ASIC feature. Different SOC layout.)
I can change the vmlinux bringup code to take control of CPU1 via an IPI (have the reset vector go to an infinite loop and then the IPI jumps us to the real entry point), but changing the kernel's bringup for kexec is a bit dodgy.
So I punted on all that and built a non-SMP kernel, and just wrote a very simple KEXEC that doesn't mess with CPU1 at all. That way you can have a simple stage 2 bootloader (UP linux) that hands off to an SMP kernel, and CPU1 is in the state it's been programmed to bring up. Just load the kernel into memory, SCHED_RR ourselves, disable AIC0, do the ELF relocation, and jump to the entry point. The downside is you can't kexec from the REAL kernel (yet anyway), and to do automated boot tests you'd need to be able to power the USB port down and back up to forcibly reset the board (there's likely a /sys/bus/usb thingy I can poke at, the fact /dev/ttyACM0 goes away and comes back each time this happens is awkward but a script can work around it).
And it hangs. No further output from the board. Which is always the most annoying kind of thing to try to debug.
I got older again. Happens every year. Odometer ticking over...
I considered baking myself a cake, but wasn't up for it. I looked up an orange bread recipe. (Well, three of them and averaged them out.) Might try to make that, but it calls for two oranges and Fresh Thyme wants half a kidney per orange right now. (It has a "free for kids" bowl of tiny oranges and bananas in front of the sushi display, but that would be cheating.)
Back when I first moved to Austin in 1996 I found a SUPERB orange bread recipe on yahoo, and had a printout magneted to the fridge for a while but lost the piece of paper (thinking I could always print it out again) and could never track down the URL again. It was a simple quick bread: orange juice, flour, baking something (I can never keep powder and soda straight), possibly a bit of salt, MAYBE another dry ingredient? And the second time I made it I added a shot of lemon juice (which was a suggestion, along with poppy seeds which I didn't because I don't socially know any vampires who aren't allergic to citrus). I recall being impressed there wasn't any liquid other than the orange juice (no eggs, milk, water...) and the instructions were very explicit about stirring JUST enough to moisten the ingredients and NO MORE. (I think it suggested folding it over like meringue, in fact. I had to look up what that meant.) But alas my attempts to recreate it since without ratios or cooking instructions resulted in an inedible brick, twice.
Or I could just go to target and get a box of spice cake and a tub of cream cheese frosting, which always seems way fancier than it is. But Fade is usually more enthusiastic about cake than I am and she's still sick.
Fade got me a second humidifier for the living room (bringing the apartment's total to FIVE running at once). It is RELENTLESSLY dry here when the heat's on, and I've found that when I get sufficiently dehydrated A) I stop being thirsty, B) all my joints ache, C) my skin becomes very easily nicked and abraded (my knuckles look like I've been in a fight because of getting oven mitts and such out of the kitchen drawers). Being able to feel 15 years younger by downing two cans of Arnold Rimmer's half-lemon tea (reasonably priced at Aldi's, although the "lite" version is still 80 calories of sugar per can) is... disturbing. Better than NOT being able to do that, I suppose.
The new ELC call for papers came out, and I find I have nothing I want to say to the community. I'm happy to learn go, zig, oberon... I would LOVE more excuses to do lua. But the last version of the kernel I can patch rust out of is the last version I run. (And python 3 goes in the GPLv3 bucket: you can pay me to do it, but never for free.)
Fade was out sick wednesday and thursday (teaching tuesday trashed her voice to full laryngitis), and is in today so she doesn't fall TOO far behind then has the weekend to recover. I still haven't fully recovered from whatever this is either (cough cough), but am doing my best to be a dutiful wife for the breadwinner of the house.
Had a long call with Jeff walking through the turtle VHDL code, and I think I understand the pieces needed to implement kexec now. Or at least we tracked down the answer to all the questions I knew to ask. (Half the output of this should be BETTER DOCUMENTATION.)
Fade took monday off, but has now gone back to work. Where was I...
So I was adding sh_fcall layers from the signal handler, but that meant two consecutive references to TT.ff might not refer to the same structure instance, and about halfway through triaging all the call_function() instances to make sure we were consistently using the returned pointer rather than fiddling with TT.ff to initialize the new object, hit the one in run_command() and traced through the use of the "prefix" variable and just went "this is not sustainable" and changed to having a separate linked list of pending signals which the handler appends to (registered to be called with all signals blocked until it returns so two handlers don't interfere with each other) and then run_lines() processes under sigprocmask(sigemptyset()) as the other half of the locking.
This of course meant I didn't NEED all the changes to make sure sh_fcall initializers were using the returned value instaead of the global list pointer, and backing them out was where I transplanted the cleanup work I was doing to run_command() over to the _previous_ checkpoint (in the toybox/toybox directory, as apposed to its extension in toybox/clean3).
Which means the "TT.signal" value I added and nobody uses can go away again because that was my first stab before going "no, I can leave THAT signal blocked until the loop handles it but I can't leave all the OTHER signals blocked, so this needs to be a list of signals seen, meaning I can't trust atomic assignment but have to make sure signals are disabled in both places the list is modified".
This is fundamentally the same problemthat adding to TT.ff had, but that list also has a bunch of USERS that could be inconvenienced by the list changing out from under them, and the new list has no existing users who aren't being, sigh: essentially thread-safe, about it. (I can DO threaded programming. OS/2 was heavily threaded, SMP in the kernel is "threaded but slightly worse", and realtime programming on bare hardware is often the same general mindset. I just don't WANT to when I don't HAVE to, it's like introducing nuclear isotopes into an engineering project: containment is key, if it spreads all over the place it will not end well, and it's so much easier to just not go there in the first place. This still ISN'T threading, meaning all libc internal locking nonsense isn't necessary. Signal handling already has its own rules about what is and isn't safe to call from signal context. So does vfork() child context, just generally not as explicitly documented. :)
.Current status: doing "diff -pu file1.c file2.c | nl | less" to work out the line ranges so I can do (as it turned out) diff -pu ./toys/*/sh.c ../toybox/toys/*/sh.c | sed '3,28d;132,$d' | patch -p1" and yes the ./ on the first argument was so patch -p1 had the appropriate number of directory levels to eat. (I _could_ have added enough to match the other one, but it tries both and takes either one that works.)
Because it's easier than manually editing the patches, that's why.
What happened was I was in the middle of a largeish structural change to a file, encountered a hunk of code I really needed to clean up to reason through how it worked (another way to say it is that reasoning through how it worked suggested multiple simplifications, mostly leftover scar tissue from before the most recent round of changes) and then I wanted to check in just that part because the change was getting huge and it's good to have checkpoints. So I diverged into trying to split up a large change, which is itself a lot of work.
*shrug* The usual.
Fade has my cold, which says it's a cold and not just "four humidifiers are not enough" dryness from the building's heating trying to keep up with the loss of the polar vortex. (The northernmost layer of jetstream used to contain the freezing air up north. Now it's gone intermittent which lets bursts of arctic cold leak out. It apparently first collapsed in 2014 and has been unstable ever since. Yes, this is because global warming.)
It's a pet peeve of mine: people keep saying "we're having a polar vortex" and no, the problem is we're NOT having one. That's why the cold that should be UP THERE is instead DOWN HERE. It's about the same as trying to cool yourself on a hot day by leaving your refrigerator open: you get a cool breeze but all the food goes bad. Any snow/ice added down _here_ will be gone by june, meanwhile the permafrost isn't and the glaciers aren't reforming after another summer of melt. THAT'S how blizzards in texas are a sign of global warming.
I wonder if Florida will submerge fast enough to drown the invasive pythons? Probably not. Most snakes can swim.
Trying to do kexec for j-core (turtle boards), because it came up recently and I think the guys in japan could use it too. It basically lets us use Linux as a stage 2 bootloader, so you can just power cycle a turtle and feed it a kernel+initramfs+cmdline (and maybe dtb) so test cycles don't involve sneakernet but could be done entirely remotely/automatically. Heck, you could feed it a tarball over serial console, or have the builtin ethernet wget something from a web server. You have linux running arbitrary initramfs as your secondary bootloader.
Since this is a nommu board, I can theoretically do all this from userspace, although it's a bit awkward. Jeff pointed me at the existing bootloader code that loads an ELF image: basically just a series of flash load commands that grab the header, confirm it looks like the right kind of ELF, iterates through the program header segments loading each PT_LOAD entry into memory at its p_vaddr, copying p_filesz many bytes from storage and zeroing from the end of that to p_memsz. (Presumably bss has a zero filesz?) If I pass a vmlinux pointer in RAM the result is just an ELF32_Ehdr typecast at the start, a for loop over Elf32_Phdr array,
The ELF header check and relocation are easy enough to port from the bootloader code, the big change is swapping out the flash_load() calls with memmove() calls, although I simplified it a bit.
The relevant guts of taskset for putting ourselves on CPU0 is just long x = 1; sched_setaffinity(getpid(), sizeof(x), &x); usleep(100);
The header check is making sure it's the right kind of ELF and e_phnum (the program header count) isn't more than 4 segments (presumably code, data, rodata, bss).
The relocation iterates through the program header segments loading each PT_LOAD entry into memory at its p_vaddr, copying p_filesz many bytes from storage and zeroing from the end of that to p_memsz. (Presumably bss has a zero filesz.) If I pass a vmlinux pointer in RAM the result is just an ELF32_Ehdr typecast at the start, a for loop over Elf32_Phdr array checking p_type==PT_LOAD and calling memmove() and memset(), and then a void (*e_entry)(void) function pointer call.
I've got the header check first, the taskset, and then the relocation code at the end, followed by the jump to the entry point. Turtle vmlinux is conventional ELF so it's all absolute addresses, meaning I don't even have to calculate relative to the start of physical memory or anything, just copy and jump where it says in the file.
But between the taskset and the relocation is the quiescing of the old kernel, and that's a pain. Trying to read through device tree code and kernel's arch/sh/kernel/cpu/sh2/smp-j2.c is NOT FUN. The C code is looking up cpu-release-addr out of the device tree, but arch/sh/boot/dts/j2_mimas_v2.dts hasn't got that field... Ah, because the device tree it's actually USING is the one out of the boot ROM, and that DOES have it (in cpu@1) which says we're writing 0x8000 to address 0xabcd0640. But that's to ENABLE the second processor, what do I do to DISABLE it?
(I could also have my kexec command fork(), taskset the child to the second processor, renice() itself to hard realtime priority, spin in a for (;;) loop for a quarter second, and then call the assembly HLT instruction. Giving CPU0 time to switch off the timer and interrupt controller. But that involves inline assembly, asynchronous timeouts... smells a bit janky.)
In THEORY, stopping CPU1, stopping the interrupt controller, and stopping the timer is three pokes.
I really hope there isn't a race where an interrupt can happen the clock AFTER we write to the AIC to disable interrupts, so we wind up in the kernel and it can't get back OUT again, stuck in some kernel thread or something. I don't THINK so? Question for Jeff...
Trying to finish the trap instruction, but it's hard to concentrate when it's so dry breathing HURTS. (And that's _with_ three humidifers going full time.) I can't pace the halls because THAT's so dry I have to come in and chug a beverage after five minutes. Not doing wonders for my sleep schedule either.
My blog doesn't have comments so I get emails and/or mastodon posts, and one of the emails replying to a post said:
Kind of like how the Thumb2-base of CortexM4 patent expired, I am curious as to whether DM&P needed an x86 license to produce the 386. While it's been 40 years since the 386 was made, would they try to block the manufacture of it at 22nm or 40nm in large enough numbers? (Also, there might not be as much demand- with new generations preferring phones that can run tiktok apps and youtube, etc...but still)
To which I replied:
Linux yanked 386 support and made 486 the baseline years ago because they wanted to assume the existence of lockless atomic cmpxchg. (Even UP kernels can take interrupts in the middle of stuff these days, and leverage the SMP plumbing to task switch out of the middle of system calls, mostly to run high priority kernel tasklets outside of interrupt context.)
That said a 486 might have some demand, but the thing is JIT compiling of bytecode was a big deal back in the 90s, then transmeta proved you can support foreign instruction sets relatively cheaply back in 2000, and qemu kinda took over the world translating a page of instructions at a time and keeping the cached ones around (even with the first dyngen code) and was everywhere by around 2005.
And of course apple had the m68k->ppc transition, the ppc->intel transition, and the intel->arm transition each with an emulation layer for running old binaries doing JIT-style dynamic translation of one instruction set to another.
The thing about the x86 instruction set is it's a horrible hack of extension prefixes where the longest documented instruction is something like 17 bytes.
Of course an odd byte length means means the instruction starts are not aligned, so jumps need all the bits of precision and can't cheat like other architectures do. Now ask yourself what happens when instruction decoding crosses a page boundary and requires a cache line fetch which fails and generates a fault partway THROUGH instruction decoding. There have been security thingies about that!
This is why nobody wants to clone an x86 chip. If you're going for big fire breathing 64 bit extensively vectorized parallel nonsense, half your chip is going to be an x86 translation and reordering pipeline (which was true back in the original Pentium) which is a recipe for security problems (like spectre/meltdown) and you'll waste tons of effort discarding speculative execution results which means power consumption sucks because you're doing a lot of work you don't keep. And if you want something small and power efficient, just do something sane like j-core and then run a dynamic translation layer to run legacy x86 binaries.
I do long writeups in email all the time, I just thought I'd copy and paste that one here because I haven't been blogging reliably ever since the return of fascism became likely. Kinda undermines the desire to do anything productive instead of just repeating "The Boomers will die" as a calming mantra.
Toybox 0.8.12 is out.
I am very tired.
Ok, the release notes are caught up. I'm not happy with the paragraph breaks in it (too chopped up), then again collating different topics into a big run-on sentence isn't great either. But that's nitpicking.
I rebuilt the mkroot targets against linux-6.13-rc7 and there were no obvious differences: the same patches apply, and the same targets pass. I was thinking of waiting for 6.13 proper but even if that does happen this sunday, with the reichstag fire scheduled for monday I need to get this out of the way NOW, while I can still convince myself programming has meaning. (My burst of productivity has already shrunk to every other day. Yes the worst people in the world are forming a circular firing squad just like last time, but nowhere is safe to stand when that happens. Schedenfreude isn't the same as hope, and it's hard to program productively while simultaneously longing for a carrington event leading to kessler syndrome.) So I'm going with 6.12 this time.
Of course mkroot/testroot.sh fails all the targets in a clean checkout because I still need to manually patch the timeout line to call /bin/bash instad of bash out of the $PATH, because otherwise toysh has an alias for bash so toybox calls the internal command and I haven't tracked down what's going wrong here and fixed it in toysh yet (workaround found, todo list entry added, haven't gotten back to it yet). But I don't want to open any new development cans of worms THIS release, and am not holding the release for anything in toys/pending even if it IS load bearing pending. (A category that should not exist.)
I need to tag a commit to build relese binaries (so --version says the right thing), and I should check in the release notes to do so, and there's always a bit of a chicken and egg problem here in that I want to check in the release notes at the last possible moment in case I hit something while cutting the release, but need the tag to build the binaries I'm testing and uploading... Circular dependencies!
This is why I have a release checklist. First thing in it: make distclean defconfig tests" which fails because of that toolchain bug I hit when I upgraded debian versions. Right, add a release note about switching off mkpasswd to pass "make tests", because of the debian ASAN toolchain bug breaking crypt(). (Toolchain bug! Like pending: not holding up the release! Yes I could de-promote mkpasswd like I did passwd, but... NEXT release.)
The version lives in 3 places (grumble): toys.h and www/header.html need to match www/news.thml, but since 2 of those are documentation it's kinda awkward to have a Single Point of Truth for that.
Ah: scripts/mkstatus.py (to update the status.html page) says #!/usr/bin/python, which Devuan Deathwish (I.E. Debian Brainworm) removed from the repository for being too useful. (The path no longer having "python" in it, just "python3", would be like the path no longer having "cc" just "c++". Or just "c99" (which is what posix said to do, because posix was fscking stupid). Nope, that's not how it works. Yes I need to move off of python now that "python" no longer exists, and Ray Gardner did send me an awk rewrite of the rss generator for this blog, but I am NOT fiddling with scripts/mkstatus.py right now. (It will never be rewritten in python++. I might do a bash version, but mostly I want to make the NEED for it go away by finishing and promoting enough stuff.)
Luckily, since python was open source back when it wasn't dead, the source and build instructions are still available, and I can feed it /usr/local/bin as the prefix because it's easy to wipe and reinstall all that if I need to. (Mine's just got toybox, qemu, and now python 2 in it.) I'm not adding whatever LFS's security patch was, nor installing any optional packages... heh, autoconf didn't even find a c++ compiler. (It's in the $PATH. Needed to build musl-cross-make toolchains. Dunno why that faceplanted but it's apparently only needed for modules I didn't use...)
Of course since I didn't install it at /usr/bin (because I'm not mixing repository and locally compiled packages at the same level) I have to say "python scripts/mistatus.py" but that's fine. Except THAT still fails because despite ./configure and install knowing where it put the files, it tries to load a shared library it can't find. (The BLFS instructions didn't say how to statically link it.) Ok, prepend LD_LIBRARY_PATH=/usr/local/lib and... yay, it ran.
Yay, the release note writing process has consumed all the toybox commits up to the master tag! I am caught up, and can cut a release!
Alas, this is step one of like two dozen, and I'm tired now.
I've been trying to finish and commit the shell trap builtin, but I may have to punt it to next release.
I want to queue up the shell function from the signal handler, but leave the signal disabled, and only re-enable it _after_ the shell function returns. The recent redesign allows me to add a new sh_fcall layer with a function call from signal context, but if you spam signals to the shell I don't want it to interrupt the signal handler function call in the middle with the same signal again. But I don't want to DROP them either, or at least the same signal coming in while the signal is being handled should probably be queued to restart (one instance of) the signal handler at the end. (So you can still stunlock the shell with constant signals, but not establish a BACKLOG for it to process.)
So what I want to do is leave the signal blocked when I exit the signal handler. I _think_ what I do for that is use sigaction(SA_SIGINFO) and then toggle the appropriate bit in context->uc_sigmask, because man 7 signal says the kernel restores the signal mask from that when the signal handler returns? (How would I test this?)
The function to atomically re-enable the signal mask when popping the fcall stack is sigprocmask(SIG_UNBLOCK). If I call sigprocmask(SIG_BLOCK) from within the signal handler, would that leave it blocked when it returns? I don't THINK so, it's already blocked in the current signal mask while the signal handler is running and then the old signal mask is restored on return from the function, sigprocmask() would have to know to modify the signal mask that's waiting to be restored. I actually hit a bash bug in 2011 where longjmp() out of the signal handler left the signal handler blocked, and bash was doing that and thus my shell script having an alarm timeout left SIGALRM blocked for all children, causing autoconf to hang in an aboriginal linux build. (I debugged into the kernel and back out again finding that one.) But "it used to do it this way doesn't mean it still DOES", is something I've ALSO hit on multiple occasions. And alas "try it and see" then becomes "Does macos have this field of this structure with the same name treated the same way? What does posix say? Where would this even BE in posix if it is mentioned..."
Another fun corner case is the vfork() callback setting up children needs to restore all the signal handlers to their default values, because who knows what's blocked at any given moment.
And then, of course, there's interrupting/restarting builtins. The main reason I'm NOT trying to use signalfd() for all this is I want to make sure blocking builtins like "read" and "wait" (and for that matter echo > /dev/thing-that-blocks) get handled properly. So what counts as "properly"? Should they abort? Restart their operation? I can't REALLY execute arbitrary shell stuff in the same process and then return to the middle of a shell builtin function, that's a recipe for disaster. (Do a blocking "read i" and have a signal handler assign to i in the middle, without leaking memory; I mean MAYBE, but...)
Right now everything's mostly trying to SA_RESTART so the OS restarts the operation behind the scenes when you do things like suspend/resume the process. Which is a signal delivery, which causes syscalls to return short reads and -EAGAIN, and yes I hit this years ago, and it broke stuff I had to fix. (That wasn't even the first time, before THAT suspending and resuming pipelines could return zero length reads that piping stuff to tar interpreted as premature EOF. Because tar or gzip or whatever it was didn't check EAGAIN when it got the zero length read, and wasn't doing the SA_RESTART thing that made the kernel auto-restart instead of returning EAGAIN except (at the time) there were times it COULD still return EAGAIN so you still had to check! Probably all fixed now, but I was gun-shy about suspending and resuming piped processes for years after that.)
Anyway, getting signal handling basically in: easy. Thinking through all the corner cases and making sure they're covered (and working out what the right behavior even IS): not easy.
So if you SIGSTOP/SIGCONT the shell while doing a "wait" it probably should NOT return prematurely. If you kill SIGUSR1 with a trap 'echo hello' USR1 it should presumably print "hello" immediately. The question is, does it then RESUME waiting? Which means restarting the builtin, since that had to return.
If you "echo $MEGABYTE_OF_CRAP" to something that accepts the data as multiple short writes (like a serial port), and the echo gets interrupted by a signal, you clearly don't want it to restart at the beginning, just flush the REST of the data...
All this stuff boils down to coming up with tests that demonstrate a corner case it needs to get right. Alas, that's REALLY HARD...
Trying to close things down for a release makes the todo list longer, every single time. I've been reminded to resurrect my diff rewrite at the start of next dev cycle.
Going through and auditing the github issues and pull requests for things I've already closed, or anything that's REALLY easy to fix. The current process involves finding the date the commit/issue was opened, fishing through my email archive to find the email sent to me, and doing a "reply list" to that which goes to the magic hash that appends my comment to the relevant discussion.
And I made a gitwhack.sh script that takes the number of the issue or pull request to close as its argument (they seem to share a namespace), prepends a "Closes #123" comment to an explanatory file about microsoft github embracing and extending away my access to the web interface, the commit itself being "touch dummy; git add dummy; git commit blah; rm dummy", pushes the commit, waits 5 seconds to make sure microsoft's server side processing gets to do all its data harvesting and AI training, and then "git reset HEAD^1; git push --force" to expunge the commit from the history. And a local git gc for good measure.
Seriously, if I'd EVER responded well to ultimatums my entire career would have progressed very differently.
Sigh. So one of the things I've been cleaning up with the Money Concierge is collating retirement savings. (The fact I have ANY is because I'm at the financially lucky end of Gen X, mostly due to a pathological avoidance of debt since graduating college, and because Fade and I have not proven fertile together (in a sufficiently non-obvious way that the medical establishment gets dollar signs in its eyes at any mention of trying to track down why), so our expenses remain low. But if I tried retiring now the money would run out in a single digit number of years, though maybe with 10-15 more years compounding I could live very modestly? Especially now that Fade's working and we're no longer paying for a second residence. It would be nice if social security still existed, but people recently voted to cash that out and hand it all over to billionaires for some reason, so...)
I've been saving for retirement ever since my very first job (IBM, straight out of college), but since I'm not a Boomer nor were my parents rich, I had a lot of debt to pay off. I've also been part of the precariat my entire career, meaning I've had Financial Crisis Du Jour that made me pull money OUT of retirement savings (and pay both taxes and 10% penalty on multiple occasions), sometimes closing the resulting account and sometimes leaving small amounts in it. Hence lots of little accounts to clean up. I remember when I cashed out my IBM stock long ago, a quarterly dividend payment deposited to the account right after I'd cashed it out because date-of-record edge case, so for years I was getting printed mail about a fraction of a share worth less than a dollar. I eventually sat down and dealt with that because they were spending more to print and deliver each of those envelopes than was in the account, and I wasn't exactly _guilty_ but... Please stop.
But some accounts still actually had some money, and one of them was the old 401k (from either Pace or Polycom, both renamed themselves since I worked there and I can't remember which is which) that somehow expired and got sold(?) to Inspira Silicon Valley Scams, which immedately renamed itself Millennium Trust Me Bro, which I've wanted to GET IT OFF ME ever since because DUDE (No!) And the money concierge helped me cash that out and roll it over into my existing pre-tax IRA at this bank, because that was the corresponding tax status account I could put it in without having to fork over thousands of dollars of taxes, and the money's been sitting there since not actually invested (and thus not accumulating anything, in fact losing to inflation) ever since because paperwork. (I refuse to say "earning" there, the same way you don't "earn" a lottery or insurance payout. That is not the word for what happens there.)
This is a pre-tax IRA, as opposed to the Roth IRA we've been putting money in more recently. This account has been around forever: Fade and I went down and set up matching IRAs when we first moved in together, and neither of us had much spare cash at the time so we did the pre-tax version that could get us a deduction, and it's more or less coincidentally at the same bank I'm collating stuff in now (actually at their semi-attached brokerage firm)... and it turns out the salesbeing at the time put us in a micromanaged IRA account instead of a self-directed one. (Probably he asked Fade what she'd prefer and then applied it to both our accounts.) So if I were to move the money into an index fund in this account (same as the other account), I'd suddenly owe them over a thousand dollars in fees for the "investment advice". Which... ow?
I have the option to convert the account type to self-directed, and I asked to do that, but I can't use online banking because I refuse to agree to the Binding Arbitration shenanigans (if you're not planning to screw me over, you don't need to preemptively take away my ability to sue). So I had them mail me paperwork, and today I sat down to sign the paperwork... which ALSO has half a page on agreeing to binding arbitration this time. So that's a no.
So I asked the Money Concierge if he can just roll it over into the existing Roth IRA account I already have (again been there for years, grandfathered in from before binding arbitration became scam du jour among finance bros), and I'll just take the tax hit. (It shouldn't do the 10% penalty because it's still in a retirement account, just tax-free compounding instead of tax-deferred, but that means investing post-tax money so it gets taxed now.
The invention of the ROTH IRA was a trick Bill Clinton pulled back in the 1990's to balance the federal budget, giving people a good deal on rolling over their retirement money into "we will never tax the interest this accumulates ever again" status, so he could get a big one-time-hit of tax revenue up front when the existing retirement money was rolled over (and taxed once as income for that year), with which to balance the federal budget. And then he KEPT it balanced once it had BEEN balanced back when shame worked on legislators. Until George "putting the duh in W" Bush intentionally restarted the oligarch embezzlement trough by proclaiming that the government running a surplus (and thus slowly paying down the giant debts accumulated by Ronald Reagan and his father Bush Sr.) meant the american people were "being overcharged" and he was "demanding a refund". Which is not CLOSE to how any of this works. (We could have had Norway's soverign wealth fund! But instead we had republicans.)
Anyway, that's why the Roth IRA was actually a good deal in the long run, but a big tax hit in the short run. And it's why the money paperwork continues.
Huh, Elliott hit a thing, and I don't see anything obvious in his wrapper that would cause it? Says #!/bin/bash at the top, which should work? It's not reproducing here, the test works for me when built with the Android NDK. I want to help but I'm just not seeing it...
I went "git log master..0.8.11" which produced no output, because it only accepts "git log 0.8.11..master". (Why not show the range in reverse order? I know there's some sort of --reverse option somewhere, but git UI having "good" and "bad" hardwired backwards half the time is not a new issue.)
And then "git --stat log 0.8.11..master" barfed and I'm going "is it --status or something?" and read through "git help log" (I.E. man git-log) forever until line one thousand, seven hundred, and forty six finally went "no, it's --stat". Because it's "git log --stat" not "git --stat log", of course.
I'm a bit out of practice with git, and not instinctively avoiding all the sharp edges its UI is constructed entirely out of.
Anyway, time to make the release notes. (We've ALMOST got linux-6.13 out, I should retest everything against -rc6 and wait, but if I don't get a release out before the 20th I suspect my mental health may take a dip again.)
Mailing list threads! I mostly haven't been blogging here, instead there's the long thread on qemu-devel that eventually wandered to linux-sh and and from there to private email about the turtle ethernet driver, various posts on linux-embedded about the boot time stuff that were mostly me doing unsolicited computer history infodumps at people, plus some things that would have been blogs went to the toybox list instead just because I've been so behind on editing and uploading entries for so long that if I _want_ interactive feedback from people, putting it here is kind of moot. (I figured the issue out on my own anyway...)
Been busy, which remains a bit of a relief. It's just hard to tell from the commits and list posts because I have SO MUCH BACKLOG to shovel through now I'm out of the rut I was in. (Off to find a NEW rut. "A whole new rut..." to the tune of Aladdin.)
I think I've cleaned up the mkroot images about as much as I can at the moment, and confirmed that at least two of the targets that still fail require patching qemu instead of the kernel. Punt to next release.
Alas, musl-1.2.5 from 11 months ago is still the current release, so not much point rebuilding the toolchains (and I need to migrate off musl-cross-make anyway). Punt THAT to next release.
Sitting on my hands about more shell work before release. I've done most of "trap" support locally, need to fix the remaining $BROKEN tests' underlying issues and have SO many more tests in local "sh.txt" notes-to-self files. I can probably do command editing and history now (I long ago learned that a polished GUI makes people assume the plumbing must be all done), and triage the TODO entries in sh.c... Ahem. Release first.
In a conversation on qemu-devel I said:
There are some targets I have to poke harder, armv5l and armv4tl have QEMU="arm -M versatilepb -net nic,model=rtl8139 -net user" for some reason... Huh, apparently I've been doing that since 2007?
And digging through my blog I found the commit saying "switch to using the rtl8139 driver because PIO doesn't work on the qemu-system-arm PCI controller yet so I need something with mmio." Maybe that's fixed by now and I can go back to the default network card there?
So hw/arm/versatilepb.c says the default is smc91c111 and the kernel driver for that is CONFIG_SMC91X but it won't enable because kconfig has (!OF [=y] || GPIOLIB [=y]) in one of its stanzas, so if you ENABLE device tree support it DISABLES the driver, unless you enable some extraneous GPIO support library I _actively_ don't want to have to care about, which seems like it's "selected" by 8 zillion things in a horrificly micromanaged staircase.
Why does this driver not have a "selects GPIOLIB" if it needs it? Why would it have a BLOCKING DEPENDENCY that prevents the driver from showing UP if something unrelated isn't selected, instead of just SELECTING IT?
This sort of cleanup is hard because I have to repeatedly prove a negative. But minimizing variables is science, and "circle the pot widdershins three times for luck" is alchemy. Accumulating endless dependencies is NOT SCIENCE.
Anyway, fixed now I should copy the relevant text back into the original conversation...
And the TODO item that comes out of this is figuring out how to use the provided example to add -hda for or1k...
Nuts to your white mice.
The sh4eb network thing is weird, when I sntp or wget, eth0 lists a bunch of dropped packets, but loopback shows the same number of packets sent/received. I don't even know how you'd screw that up, but it smells kernel-side? I also tried qemu 9.2 and 8.0 and it behaved the same way, so probably not qemu. Which makes sense since qemu shouldn't know what the loopback interface IS, that's an abstraction within the kernel not emulated hardware.
I hate when I bisect a problem (in this case the sh4eb kernel not seeing qemu's emulated hard drive) to a merge commit. Right, two parents, first parent works, second parent panics during boot because "irq123: nobody cared". Use "git describe" on that second parent to find the last tagged commit, check the tag: which works. Bisect between the tag and the parent commit... And the breakage is "sh: Convert the last use of 'optional' property in Kconfig" which seems like it's just mangling config stuff? Except diffing the .config files produced by the two commits, the change is adding CONFIG_CMDLINE_OVERWRITE=y and CONFIG_CMDLINE="console=ttySC1,115200" which doesn't SEEM like it would cause this, but... The patch itself is adding CONFIG_CMDLINE_FROM_BOOTLOADER=y to a bunch of defconfigs, which is the DUMBEST SYMBOL EVER. (The default value is NOT to listen to the bootloader, but to use a hardwired command line. That's the DEFAULT now. You have to switch on a symbol to NOT do that. Bravo.) Ok, add the symbol to my miniconfig to tell it not to be so FUCKING STUPID, and... working again. What does that do to 6.12... and that's working.
So once again just bisecting where a new config symbol needed to be flipped. Wheee. The network card's still borked though, although now it THINKS it's working, but no packets are passed and attempts to use it time out.
Fixing up the mkroot targets: microblaze's network isn't working, because the ethernet driver isn't binding. Checking current qemu's source, qemu-system-microblaze is running -M petalogix-s3adsp1800 which runs the qemu source qemu/hw/microblaze/petalogix_s3adsp1800_mmu.c which is pulling in qemu/pc-bios/petalogix-s3adsp1800.dts, which says compatible = "xlnx,xps-ethernetlite-2.00.a" which lines up with EMACLITE from mkroot.sh's microblaze section.
So it's TRYING to enable the driver, but although that string is in the miniconfig it does not wind up in the resulting root/microblaze/docs/linux-fullconfig, why is that... Grepping the 6.12 kernel source, drivers/net/ethernet/xilinx/Kconfig says config XILINX_EMACLITE depends on HAS_IOMEM, which grep does not find under arch/microblaze at all. According to "git annotate" that dependency was added in commit 46fd4471615c in April 2021 by Randy Dunlap to fix a build break, and git describe --tags on that hash says v5.12-rc7, so back up to the last release before that...
Huh. I did a git checkout v5.12 which SEEMS like a thinko (that's the release AFTER that -rc7, not the one before), but the dependency isn't there in the file and "git log" is saying the 46fd commit _isn't_ in v5.12? And doing a "git log" from the 46fd commit doesn't find the hash for the 5.12 release. I think there's some git branch shenanigans going on here, git describe is finding a misleading last common ancestor. Oh well, the point is v5.12 is a release that does NOT have the commit.
Alas, my first attempt at feeding the current miniconfig into 5.12 does not give me any serial output from qemu, and rather than debug THAT let's just build the board's defconfig... huh, arch/microblaze/configs only has one file "mmu_defconfig". Microblaze was one of the first nommu targets I was introduced to, but while this arch has a nommu build option the kernel ships no defconfig for it. How nice. Anyway, it built, and using the run-qemu.sh command line against the vmlinux... broke with a register dump. Try the other endianness? That's spinning eating cpu, hung with no output.
This smells familiar.... because it is. Except 6 months ago was 2 kernel releases back tops, so I was trying to get something like 6.10 working, not 5.12. (Did this EVER work for me? The downside of a year out of control is I'm not entirely sure what my baseline was and switching debian versions and different qemu builds underneath mkroot, it's a bit of shoveling to reestablish a baseline.) Still, let's try firing up menuconfig and switching the endianness... nope. The resulting vmlinux hangs for 8 seconds, _then_ barfs with "unaligned PC=12" whatever that means.
Right. So if 5.12 builds the driver witout either the spurious dependency or the reported build break, but doesn't produce serial output so I can't TEST it, that's the classic "too many variables changed" problem of doing science outside of laboratory conditions. I have a mostly working current (6.12) build, let's walk back from the version I can TEST through the older kernel versions to see where the miniconfig stops producing serial output. (It's not "looking for my keys under the streetlamp" if I use the streetlamp as a base camp and install a chain of mirrors to build an illuminated path out from that to where I last saw the keys. It's just tedious. Or "systematic" if you're feeling posh.)
Ok, 6.10 works, 6.5 works, 6.0 works. The 5.x series goes to 5.19 so 5.15... no output. Ok, git bisect between the v5.15 and v6.0 tags, but once again I have to reverse "good" and "bad" because the old one is broken and the new one is working and git calls the old behavior "good" and new "bad" which is was always a terrible assumption. Bisect, bisect, bisect... commit 8f0f265e6cf5 made it start working again, which replaced the "memset" implementation because gcc 10 apparently introduced a stupid "optimization" that turned providing your own memset implementation into a recursive call to itself. (Libc is not special!) That commit went in on top of 5.18-rc1, so how far back does the patch apply to unbreak earlier versions with current compilers? Hmmm, it applies to v5.12 but the result still produces no output... So bisect between THOSE, and commit f8f0d06438e5 is what made it start producing output. (Sigh: global kconfig shenanigans.) Ok, checkout v5.12 and apply BOTH those patches to it and... I get a shell prompt! Woo! But still no network interface. Fire up menuconfig and pull up the symbol help... it's because of an unmet dependency on a gratuitous CONFIG_NET_VENDOR_XILINX menu guard symbol. And NOW the ethernet interface showed up.
Hang on, was THAT the missing symbol back in 6.12? The gratuitous menu guard? Yes it was, and now the network is present there too. Kind of a long walk to get there, but hey, problem solved.
So next question, why isn't -hda working... Because the board qemu emulates has 16 megabytes of flash but no other obvious storage devices, and no probeable busses (pci, usb, etc) to dynamically attach storage to. So they didn't wire -hda up to anything because there's no obvious way to dynamically insert a hard drive (nothing for it to attach to). Sigh, I suppose it could use a network block device, but part of what I'm trying to TEST on each of these boards is block device support. I suspect I need another mkroot variable that's "how to add -hda" since qemu decided to abandon -hda as a reliable user interface concept. I can export a variable into the run-qemu (sigh at the clutter, but doable) but how to _use_ it when everything else is just "./run-qemu.sh -hda file.img"...