.. index:: crest > :ref:`What's New ?` < .. _crest: .. Created crest.rst: Thu 23 Mar 2023 05:40:27 AEDT .. Full Path: /home/tp/projects/programming-languages/forth/mecrisp-stellaris/mecrisp-unofficial-doc/crest.rst .. Author Copyright 2023 by t.j.porter \ .. Made by /home/tp/projects/scripts/makerst.sh -->/usr/local/bin/makerst .. license: MIT, please see COPYING crest ===== 23Mar23 ------- https://gist.githubusercontent.com/Crest/aea89f9d7a072b15eb2b6787c54dac21/raw/e2db6ecb5363fc5b7867f1ff4aa1ecac3be814c5/extra-ops.s :: @ Expose the extra operations ARM v6M CPU cores are capable of to Mecrisp Stellaris @ with the same optimisations available for other unary operations. @ The extra operations added by this file differ only in their opcodes which is captured in a single macro argument. .macro extra_op, op @ ----------------------------------------------------------------------------- Wortbirne Flag_foldable_1|Flag_inline|Flag_allocator, "\op" @ ( x -- x' ) @ ----------------------------------------------------------------------------- 1: \op tos, tos @ Perform the operation on the top stack element bx lr @ and return to the caller. Alloc_\op: 2: mov r1, pc @ Save PC to r1 allowing the shared allocator find tos, tos b extra_op_shared_alloc @ and tail-call the shared allocator. .endm @ ----------------------------------------------------------------------------- @ The common allocator shared by all operators @ ----------------------------------------------------------------------------- @ To save code space all operators share a common allocator, but @ the shared allocator still needs to know which opcode it should produce @ for each invocation. @ @ All operators in this file including their stub allocators are identical @ except for their opcode from which follows that the offset between the @ stub allocator entry points and their opcodes is the same for all operators. @ All operations use the same instruction encoding format: @ a 10 bit opcode followed by the source and destination register. @ The source and destination is always register r6 (TOS). @ @ Exploit these properties to recover the opcode from the allocator entry point address @ saved by the stub allocators into register r1 before tail-calling the shared allocator. .equiv PC_BIAS , 4 @ On ARM v6M (in Thumb mode) the PC-value read by MOV is biased by 4 bytes. .equiv OPCODE_DELTA , PC_BIAS + (2f - 1f) @ Calculate the (biased) offset of r6, r6 from MOV r1, pc. .equiv OPCODE_REGS , (6 << 3) + (6 << 0) @ The register source and destination suffix common to all operations. extra_op_shared_alloc: dup @ Save the top stack element in accordance with the calling convention. subs r1 , #OPCODE_DELTA @ Calculate the address of the instruction, ldrh tos, [r1] @ fetch the 16 bit instruction, and subs tos, #OPCODE_REGS @ replace r6, r6 with r0, r0 as template for kernel allocator. ldr r1 , =smalltworegisters+1 @ Load the address of the kernel allocator befitting these operations, bx r1 @ and tail call it to keep the call stack as shallow as possible. @ Define Forth words using a macro loop over the opcodes by invoking the extra_op macro for each opcode. .irp op, rev, rev16, revsh, sxth, sxtb, uxth, uxtb extra_op \op .endr irc --- :: i've cleaned up the optimising words for the arm v6m data manipulation instructions normally unreachable from mecrisp stellaris thats cortex-m0 ? among others it works for all cortex m0-m7, but the m3,m4,m7 have additional instructions this code doesn't know to generate but it should work on a m0 core like the f051 as long as you use the register allocator e.g. the stm32f051-ra kernel JUST INCLUDE IT SOMEWHERE IN THE PORT WHERE IT DOESN'T CAUSE PROBLEMS (by exceeding the range of 16 bit branches or putting a literal pool out of range) the arm cores have useful data processing instructions like sign extend for dsp code or extracting 8/16 bit values from 32 bit values or endian conversion including packed data e.g. swap the bytes in both 16 bit halves of a 32 bit register lets say your external audio ADC returns two 16 big endian bit samples one of the added instructions could covert the endianess of both samples in a single cycles two more instruction would split the samples into their registers one to shift the upper 16 bits down into the lower 16 bits, the other UXTH to clear upper 16 bits i've found some crazy uses for them already e.g. using endian swaps as part of fast and small address calculations to combine a boolean flag and 8bit variable into the i/o register address by computing the reversed endian address i can add immediates to the lower byte and swap the address which is faster and avoid the need to allocate temp registers to hold the shifted constants one of this would be a problem on arm v7m cores (m3, m4, m7) because they support shifting immediates before using them in the ALU such features make it fast at reasonable code size to write the straight forward code for those bigger cores but the smaller m0 and m0+ cores lack such quality of life features and the obvious code for them would often be (noticeable) slower and bigger but sometimes a clever programmer can find ways to express his intend through unusual combinations of those limited instructions e.g. : led! ( ? -- ) led pin! ; compiles to this with my optimisations 200113AC: 2380 movs r3 #80 200113AE: 049B lsls r3 r3 #12 200113B0: 1E76 subs r6 r6 #1 200113B2: 41B6 sbcs r6 r6 200113B4: 0FF6 lsrs r6 r6 #1F 200113B6: 06B6 lsls r6 r6 #1A 200113B8: 36D0 adds r6 #D0 200113BA: BA36 rev r6 r6 200113BC: 6173 str r3 [ r6 #14 ] 200113BE: CF40 ldmia r7 { r6 } 200113C0: 4770 bx lr and the really crazy part is that i didn't write a special purpose allocator just for pin! to accomplish this instead i tought mecrisp stellaris about additional cpu instructions mathias didn't include and wrote an inline cache for pin! which is basically a list of forth words and constants the compiler replays as fake input when the word is compiled in from your folk docs i know that you've used : my-word ... inline ; those words care really just copied instruction for instruction words with an inline cache instead have a number of tokens stored just after their return instruction instead of flushing the cached stacked elements to sram the compiler just treats the call to be compiled as equivalent to those tokens an inline cache can do just a well as if you typed the equivalent code every time you used the function having the cache the important difference is that compiler doesn't have to flush the top up to five stack elements from their allocated registers for small simple words flushing the stack cache to ram is often more expensive than the call you avoided with a simple inline but writing inline caches is totally undocumented and i had to reverse engineer how it works from the un(der)documented code the amazing thing about mecrisp stellaris is that this puny little compiler is able to take my inline cache and fold all the constant expressions away to produce optimal code for all combinations of compile time constants and variable arguments to pin! it really reduces the general case to the special case just using the optimiser