> What’s New ? <
crest¶
23Mar23¶
@ Expose the extra operations ARM v6M CPU cores are capable of to Mecrisp Stellaris
@ with the same optimisations available for other unary operations.
@ The extra operations added by this file differ only in their opcodes which is captured in a single macro argument.
.macro extra_op, op
@ -----------------------------------------------------------------------------
Wortbirne Flag_foldable_1|Flag_inline|Flag_allocator, "\op" @ ( x -- x' )
@ -----------------------------------------------------------------------------
1: \op tos, tos @ Perform the operation on the top stack element
bx lr @ and return to the caller.
Alloc_\op:
2: mov r1, pc @ Save PC to r1 allowing the shared allocator find <OP> tos, tos
b extra_op_shared_alloc @ and tail-call the shared allocator.
.endm
@ -----------------------------------------------------------------------------
@ The common allocator shared by all operators
@ -----------------------------------------------------------------------------
@ To save code space all operators share a common allocator, but
@ the shared allocator still needs to know which opcode it should produce
@ for each invocation.
@
@ All operators in this file including their stub allocators are identical
@ except for their opcode from which follows that the offset between the
@ stub allocator entry points and their opcodes is the same for all operators.
@ All operations use the same instruction encoding format:
@ a 10 bit opcode followed by the source and destination register.
@ The source and destination is always register r6 (TOS).
@
@ Exploit these properties to recover the opcode from the allocator entry point address
@ saved by the stub allocators into register r1 before tail-calling the shared allocator.
.equiv PC_BIAS , 4 @ On ARM v6M (in Thumb mode) the PC-value read by MOV is biased by 4 bytes.
.equiv OPCODE_DELTA , PC_BIAS + (2f - 1f) @ Calculate the (biased) offset of <OP> r6, r6 from MOV r1, pc.
.equiv OPCODE_REGS , (6 << 3) + (6 << 0) @ The register source and destination suffix common to all operations.
extra_op_shared_alloc:
dup @ Save the top stack element in accordance with the calling convention.
subs r1 , #OPCODE_DELTA @ Calculate the address of the instruction,
ldrh tos, [r1] @ fetch the 16 bit instruction, and
subs tos, #OPCODE_REGS @ replace <OP> r6, r6 with <OP> r0, r0 as template for kernel allocator.
ldr r1 , =smalltworegisters+1 @ Load the address of the kernel allocator befitting these operations,
bx r1 @ and tail call it to keep the call stack as shallow as possible.
@ Define Forth words using a macro loop over the opcodes by invoking the extra_op macro for each opcode.
.irp op, rev, rev16, revsh, sxth, sxtb, uxth, uxtb
extra_op \op
.endr
irc¶
<crest> i've cleaned up the optimising words for the arm v6m data manipulation instructions normally unreachable from mecrisp stellaris
<tp> thats cortex-m0 ?
<crest> among others
<crest> it works for all cortex m0-m7, but the m3,m4,m7 have additional instructions this code doesn't know to generate
<crest> but it should work on a m0 core like the f051 as long as you use the register allocator
<crest> e.g. the stm32f051-ra kernel
<crest> JUST INCLUDE IT SOMEWHERE IN THE PORT WHERE IT DOESN'T CAUSE PROBLEMS
<crest> (by exceeding the range of 16 bit branches or putting a literal pool out of range)
<crest> the arm cores have useful data processing instructions like sign extend for dsp code
<crest> or extracting 8/16 bit values from 32 bit values
<crest> or endian conversion including packed data
<crest> e.g. swap the bytes in both 16 bit halves of a 32 bit register
<crest> lets say your external audio ADC returns two 16 big endian bit samples
<crest> one of the added instructions could covert the endianess of both samples in a single cycles
<crest> two more instruction would split the samples into their registers
<crest> one to shift the upper 16 bits down into the lower 16 bits, the other UXTH to clear upper 16 bits
<crest> i've found some crazy uses for them already
<crest> e.g. using endian swaps as part of fast and small address calculations to combine a boolean flag and 8bit variable into the i/o register address
<crest> by computing the reversed endian address i can add immediates to the lower byte and swap the address
<crest> which is faster and avoid the need to allocate temp registers to hold the shifted constants
<crest> one of this would be a problem on arm v7m cores (m3, m4, m7) because they support shifting immediates before using them in the ALU
<crest> such features make it fast at reasonable code size to write the straight forward code for those bigger cores
<crest> but the smaller m0 and m0+ cores lack such quality of life features
<crest> and the obvious code for them would often be (noticeable) slower and bigger
<crest> but sometimes a clever programmer can find ways to express his intend through unusual combinations of those limited instructions
<crest> e.g. : led! ( ? -- ) led pin! ;
<crest> compiles to this with my optimisations
<crest> 200113AC: 2380 movs r3 #80
<crest> 200113AE: 049B lsls r3 r3 #12
<crest> 200113B0: 1E76 subs r6 r6 #1
<crest> 200113B2: 41B6 sbcs r6 r6
<crest> 200113B4: 0FF6 lsrs r6 r6 #1F
<crest> 200113B6: 06B6 lsls r6 r6 #1A
<crest> 200113B8: 36D0 adds r6 #D0
<crest> 200113BA: BA36 rev r6 r6
<crest> 200113BC: 6173 str r3 [ r6 #14 ]
<crest> 200113BE: CF40 ldmia r7 { r6 }
<crest> 200113C0: 4770 bx lr
<crest> and the really crazy part is that i didn't write a special purpose allocator just for pin! to accomplish this
<crest> instead i tought mecrisp stellaris about additional cpu instructions mathias didn't include
<crest> and wrote an inline cache for pin!
<crest> which is basically a list of forth words and constants the compiler replays as fake input when the word is compiled in
<crest> from your folk docs i know that you've used : my-word ... inline ;
<crest> those words care really just copied instruction for instruction
<crest> words with an inline cache instead have a number of tokens stored just after their return instruction
<crest> instead of flushing the cached stacked elements to sram
<crest> the compiler just treats the call to be compiled as equivalent to those tokens
<crest> an inline cache can do just a well as if you typed the equivalent code every time you used the function having the cache
<crest> the important difference is that compiler doesn't have to flush the top up to five stack elements from their allocated registers
<crest> for small simple words flushing the stack cache to ram is often more expensive than the call you avoided with a simple inline
<crest> but writing inline caches is totally undocumented and i had to reverse engineer how it works from the un(der)documented code
<crest> the amazing thing about mecrisp stellaris is that this puny little compiler is able to take my inline cache
<crest> and fold all the constant expressions away
<crest> to produce optimal code for all combinations of compile time constants and variable arguments to pin!
<crest> it really reduces the general case to the special case just using the optimiser