> What’s New ? <

crest

23Mar23

https://gist.githubusercontent.com/Crest/aea89f9d7a072b15eb2b6787c54dac21/raw/e2db6ecb5363fc5b7867f1ff4aa1ecac3be814c5/extra-ops.s

@ Expose the extra operations ARM v6M CPU cores are capable of to Mecrisp Stellaris
@ with the same optimisations available for other unary operations.

@ The extra operations added by this file differ only in their opcodes which is captured in a single macro argument.
.macro extra_op, op
@ -----------------------------------------------------------------------------
  Wortbirne Flag_foldable_1|Flag_inline|Flag_allocator, "\op" @ ( x -- x' )
@ -----------------------------------------------------------------------------

1:  \op tos, tos                  @ Perform the operation on the top stack element
    bx  lr                        @ and return to the caller.

Alloc_\op:
2:      mov r1, pc                @ Save PC to r1 allowing the shared allocator find <OP> tos, tos
        b   extra_op_shared_alloc @ and tail-call the shared allocator.
.endm

@ -----------------------------------------------------------------------------
@ The common allocator shared by all operators
@ -----------------------------------------------------------------------------

@ To save code space all operators share a common allocator, but
@ the shared allocator still needs to know which opcode it should produce
@ for each invocation.
@
@ All operators in this file including their stub allocators are identical
@ except for their opcode from which follows that the offset between the
@ stub allocator entry points and their opcodes is the same for all operators.
@ All operations use the same instruction encoding format:
@ a 10 bit opcode followed by the source and destination register.
@ The source and destination is always register r6 (TOS).
@
@ Exploit these properties to recover the opcode from the allocator entry point address
@ saved by the stub allocators into register r1 before tail-calling the shared allocator.

.equiv PC_BIAS        , 4                   @ On ARM v6M (in Thumb mode) the PC-value read by MOV is biased by 4 bytes.
.equiv OPCODE_DELTA   , PC_BIAS + (2f - 1f) @ Calculate the (biased) offset of <OP> r6, r6 from MOV r1, pc.
.equiv OPCODE_REGS    , (6 << 3) + (6 << 0) @ The register source and destination suffix common to all operations.

extra_op_shared_alloc:
        dup                                 @ Save the top stack element in accordance with the calling convention.

        subs r1 , #OPCODE_DELTA             @ Calculate the address of the instruction,
        ldrh tos, [r1]                      @ fetch the 16 bit instruction, and
        subs tos, #OPCODE_REGS              @ replace <OP> r6, r6 with <OP> r0, r0 as template for kernel allocator.

        ldr  r1 , =smalltworegisters+1      @ Load the address of the kernel allocator befitting these operations,
        bx   r1                             @ and tail call it to keep the call stack as shallow as possible.

@ Define Forth words using a macro loop over the opcodes by invoking the extra_op macro for each opcode.
.irp op, rev, rev16, revsh, sxth, sxtb, uxth, uxtb
extra_op \op
.endr

irc

<crest> i've cleaned up the optimising words for the arm v6m data manipulation instructions normally unreachable from mecrisp stellaris
<tp> thats cortex-m0 ?
<crest> among others
<crest> it works for all cortex m0-m7, but the m3,m4,m7 have additional instructions this code doesn't know to generate
<crest> but it should work on a m0 core like the f051 as long as you use the register allocator
<crest> e.g. the stm32f051-ra kernel
<crest> JUST INCLUDE IT SOMEWHERE IN THE PORT WHERE IT DOESN'T CAUSE PROBLEMS
<crest> (by exceeding the range of 16 bit branches or putting a literal pool out of range)
<crest> the arm cores have useful data processing instructions like sign extend for dsp code
<crest> or extracting 8/16 bit values from 32 bit values
<crest> or endian conversion including packed data
<crest> e.g. swap the bytes in both 16 bit halves of a 32 bit register
<crest> lets say your external audio ADC returns two 16 big endian bit samples
<crest> one of the added instructions could covert the endianess of both samples in a single cycles
<crest> two more instruction would split the samples into their registers
<crest> one to shift the upper 16 bits down into the lower 16 bits, the other UXTH to clear upper 16 bits
<crest> i've found some crazy uses for them already
<crest> e.g. using endian swaps as part of fast and small address calculations to combine a boolean flag and 8bit variable into the i/o register address
<crest> by computing the reversed endian address i can add immediates to the lower byte and swap the address
<crest> which is faster and avoid the need to allocate temp registers to hold the shifted constants
<crest> one of this would be a problem on arm v7m cores (m3, m4, m7) because they support shifting immediates before using them in the ALU
<crest> such features make it fast at reasonable code size to write the straight forward code for those bigger cores
<crest> but the smaller m0 and m0+ cores lack such quality of life features
<crest> and the obvious code for them would often be (noticeable) slower and bigger
<crest> but sometimes a clever programmer can find ways to express his intend through unusual combinations of those limited instructions
<crest> e.g. : led! ( ? -- ) led pin! ;
<crest> compiles to this with my optimisations
<crest> 200113AC: 2380  movs r3 #80
<crest> 200113AE: 049B  lsls r3 r3 #12
<crest> 200113B0: 1E76  subs r6 r6 #1
<crest> 200113B2: 41B6  sbcs r6 r6
<crest> 200113B4: 0FF6  lsrs r6 r6 #1F
<crest> 200113B6: 06B6  lsls r6 r6 #1A
<crest> 200113B8: 36D0  adds r6 #D0
<crest> 200113BA: BA36  rev r6 r6
<crest> 200113BC: 6173  str r3 [ r6 #14 ]
<crest> 200113BE: CF40  ldmia r7 { r6 }
<crest> 200113C0: 4770  bx lr
<crest> and the really crazy part is that i didn't write a special purpose allocator just for pin! to accomplish this
<crest> instead i tought mecrisp stellaris about additional cpu instructions mathias didn't include
<crest> and wrote an inline cache for pin!
<crest> which is basically a list of forth words and constants the compiler replays as fake input when the word is compiled in
<crest> from your folk docs i know that you've used : my-word ... inline ;
<crest> those words care really just copied instruction for instruction
<crest> words with an inline cache instead have a number of tokens stored just after their return instruction
<crest> instead of flushing the cached stacked elements to sram
<crest> the compiler just treats the call to be compiled as equivalent to those tokens
<crest> an inline cache can do just a well as if you typed the equivalent code every time you used the function having the cache
<crest> the important difference is that compiler doesn't have to flush the top up to five stack elements from their allocated registers
<crest> for small simple words flushing the stack cache to ram is often more expensive than the call you avoided with a simple inline
<crest> but writing inline caches is totally undocumented and i had to reverse engineer how it works from the un(der)documented code
<crest> the amazing thing about mecrisp stellaris is that this puny little compiler is able to take my inline cache
<crest> and fold all the constant expressions away
<crest> to produce optimal code for all combinations of compile time constants and variable arguments to pin!
<crest> it really reduces the general case to the special case just using the optimiser