These demos demonstrate the full color palette which the NES is capable of - 410 colors - on the screen simultaneously. This is done via the trickery of changing the palette registers mid-scanline, 14 times, in conjunction with using color emphasis bits every several scanlines. The NES color palette is typically cited as containing 52 unique colors, but the aforementioned color emphasis bits allow tinting of the video output, producing additional colors.
The CPU does not have direct access to the palette registers, which exist in the PPU address space, therefore the CPU must access them through writes to a pair of address and data registers. The address register is also used by the PPU during rendering, therefore background (and supposedly sprite) rendering must be disabled in order to write to the palette mid-scanline. Writes to the data register autoincrement the address by your choice of 1 or 32 bytes.
Since rendering is disabled, you'd expect it necessary to continually change the zeroth palette register at $3F00, whch controls the overall background color. If the address register were configured to increment by 1, we'd have to reload the address each time to reset it to $3F00. Adding two absolute store instructions to reset the address would take 8 clock cycles each time. The CPU is very slow compared to the video signal - an NTSC NES PPU outputs 3 pixels during each CPU cycle, so 8 CPU cycles would increase the width of each color cell in the picture above by 24 pixels. They wouldn't fit on the screen! Alternatively, the 32 byte increment mode (really intended for updating vertical stripes of background tiles) saves us this trouble, because the palette registers repeat every 32 bytes between $3F00 and $3FFF, so each increment leaves us pointing at a mirror of the $3F00 palette register ($3F20, $3f40, etc.) This works until the eighth write, when we wrap around to $0000, which isn't a palette register at all. At that time, we'd still have to do our two writes to reset the address to $3F00, introducing a wider stripe in the middle of the screen which clearly doesn't exist in the screenshot above. Or perhaps you could do three writes (the address is latched after the first two), so that you only have to do one write to reset the address, but that's not what's going on here.
So how does it do it? The demo relies on two odd features of the hardware. Most importantly (and perhaps by accident), when rendering is disabled and the address register points at a palette register, that color will be displayed rather than the expected background palette entry (that is, $3F00). This allows the built-in autoincrement by 1 to cycle through which palette register will determine the color at the current raster position. Because this is a post-increment, the display color will always be ahead of the colors we're writing, so it actually displays the palette entries we programmed on the previous scanline, but this isn't a problem. Using this approach, unrolling the two instruction loop "stx $2007; inx" allows us to change the color in 4+2=6 CPU cycles, or every 18 pixels, which corresponds to the width of each color stripe in the screenshot. Notice that the color stripes are not smooth, but rather have a rough edge. To the best of my knowledge it is not possible to synchronize precisely with the previous scanline and straighten this out, because the scanline width in pixels (including the horizontal blank) is not an integral number of CPU cycles.
The other peculiar point, which might occur to anyone familiar with the NES hardware, is that the machine really organizes palettes into groups of three colors (plus a background/transparency color) each, four for background tiles, and four for sprites. Despite having four distinct palettes which can be applied to background characters, normally the first entry of each mirrors the overall background color in the first palette, at $3F00. If that held in this case, one in every four of those color stripes would be the wrong color, which isn't occurring. So it seems there are three more palette registers ($3F04, $3F08, $3F0C), but during normal rendering they all defer to $3F00, which seems a tremendous waste (unless, of course, you can explain this in a way which doesn't require these three registers to exist, such as the last written color value getting latched somewhere, but that's not what the guys who've actually done the reverse engineering say). I wonder what the rationale was for designing it that way.
With respect to the emulator, I realized I only needed to add one or two lines of code to emulate this effect. During normal rendering, my background and sprites are rendered instantaneously at the start of the scanline. If the palette were to change in the middle of that, it would indeed require reworking the video rendering so that you could interleave it with the CPU execution, mindful of timing (being called upon to catch up with the CPU after some number of cycles when a control register is about to be changed). My audio code already works in this way, as does the handling of the color attribute and mono bits (which are filled in a buffer parallel to the current line's color buffer and combined by the video output filter before the next line). Then I realized I'd overlooked the obvious - you only do these mid-scanline palette tricks when rendering is disabled anyway, so I'm free to fill over the contents of the color buffer using the same catchup mechanism. So it fell out for free.