What's new

Emulating ARM under ARM (link) interesting blog

Exophase

Emulator Developer
Sorry, really late response! I need to checks these forums more! It's always interesting to read about stuff like this.

I've done an ARM->ARM recompiler too if anyone wants to compare, although it's not especially high quality. It looks like we both took the same approach of static allocation of registers based on some statistical analysis.

Since this was for a GBA emulator I had to do cycle counting. One weird trick I did, which I'm sure is pretty bad on newer ARM CPUs (this was for ARM9 originally), was to use the most significant bit of the cycle counter to increment the PC, in order to skip a branch at the end of blocks. This way flags weren't modified in the process. The GBA emulator vBagx would go on to use this approach too.

Another big difference is that I had to emulate memory accesses in software, which I'm sure this doesn't. Unfortunately that also involves flag wrecking. I have an ARM recompiler lying around (that doesn't target ARM.. yet) that has much more extensive liveness analysis for flags and registers. This way you can know when it's okay to wreck flags, or at least some flags.. where that makes sense on ARM. And you can grab dead registers when you need temporaries, limiting the number of registers you can't statically allocate.

But for these reasons I didn't bother trying to turn conditional ops into conditional ops; like with other platforms I targeted I just compiled them into branches. But I at least tried to group them where possible.

As for cache clearing, which he mentioned as a caveat. What I try to do is clear the data cache where the new code was written, then clear a single icache line where the new code starts in the translation cache. It's not enough to just clear icache, because otherwise the freshly generated code might not exist in main RAM when it needs to be loaded into icache. This might not be the case if the page is configured is inner attribute write through instead of write back (or it's not write allocate and you only write the instructions without reading them), but it's something to watch out for.
 
OP
Cyberman

Cyberman

Moderator
Moderator
I thought it would be helpful, especially with big ARM machines in the works (64 bit arm) and things like the Beagle board and Pandora.

I was considering those 2 systems for GBA emulation. They both use an ARM is7 A8 core with neon extensions. They both also have a high performance GPU (sprites are easy with with EGL), and a high power DSP which can be used for synthesis of extremely high quality sound and other things. That's how I found it, but again I got busy with other stuff. :D

Cyb - my next emulation project will be a whole lot humbler I suspect LOL
 
Last edited:

Stoppers

New member
Thanks for the interest!

Sorry, really late response! I need to checks these forums more! It's always interesting to read about stuff like this.

I'm the author of the referenced blog.

I only discovered the stats functionality of blogger quite recently, and it sent me here (as well as to some spammy sites). A comment or two wouldn't have gone amiss!

I've done an ARM->ARM recompiler too if anyone wants to compare, although it's not especially high quality. It looks like we both took the same approach of static allocation of registers based on some statistical analysis.

Since this was for a GBA emulator I had to do cycle counting. One weird trick I did, which I'm sure is pretty bad on newer ARM CPUs (this was for ARM9 originally), was to use the most significant bit of the cycle counter to increment the PC, in order to skip a branch at the end of blocks. This way flags weren't modified in the process. The GBA emulator vBagx would go on to use this approach too.

I don't really understand what you mean; GBA = GameBoy Advance? But why did you have to count cycles (and are they cycles in the emulated device, or the processor it's running on)? I'm avoiding privileged modes, so I don't have access to the copro registers.

Another big difference is that I had to emulate memory accesses in software, which I'm sure this doesn't. Unfortunately that also involves flag wrecking. I have an ARM recompiler lying around (that doesn't target ARM.. yet) that has much more extensive liveness analysis for flags and registers. This way you can know when it's okay to wreck flags, or at least some flags.. where that makes sense on ARM. And you can grab dead registers when you need temporaries, limiting the number of registers you can't statically allocate.

I don't have to emulate memory accesses, you're right, I just ensure that the emulator code sits somewhere that RISC OS applications never look at, and use mmap to locate things where I want them. I do have a more complete machine emulator on x86 that gets quite a long way into RISC OS (to the desktop), but it's pretty slow and there's some major bug in it.

But for these reasons I didn't bother trying to turn conditional ops into conditional ops; like with other platforms I targeted I just compiled them into branches. But I at least tried to group them where possible.

As for cache clearing, which he mentioned as a caveat. What I try to do is clear the data cache where the new code was written, then clear a single icache line where the new code starts in the translation cache. It's not enough to just clear icache, because otherwise the freshly generated code might not exist in main RAM when it needs to be loaded into icache. This might not be the case if the page is configured is inner attribute write through instead of write back (or it's not write allocate and you only write the instructions without reading them), but it's something to watch out for.

Again, sticking to user mode, I just have to rely on the implementation of clear_cache to work well.

I've made a couple new posts recently, one with bad news, and the next with better; I'm working on cache-independent speedups at the moment.

Cheers,
Simon
 

Top