I’ve been thinking a lot about how to make the generated code faster, smaller, and use less memory. Probably I’ve been thinking too much. A simple approach seems feasible, with the promise of significant efficiencies.
First, I assume we want to preserve the execution model. Which is documented as only guaranteeing a topological-sort order. Not noted is that a Node is only evaluated if at least one of its inputs is dirty. Implied in the “semi-instantaneous” evaluation within a transaction is that things sort of happen in parallel (which is very desirable). The documentation, though, explicitly says that parallel execution is not promised: “even M1-M2-M3”. But I think we should promise some kind of time-slice like parallelism, since that gives is a huge advantage over trying to write the code procedurally (cf. all the Scratch style systems).
The current implementation appears to do a topological-sort, then a breadth-first ordering. This gives a nice “interleaving” of all the graph’s branches, and a time-slice like parallelism.
It looks like you could do a simple translation of your current code generation into something that can run much faster and use less memory.
Your current implementation runs the sorted node list, and calls evaluate if it the node is dirty. You perform many memory fetches.
I propose unrolling the loop (same sort-order) which then makes the code susceptible to optimizations, particularly cross-function-boundary optimizations. Something like this:
void loop() {
xod::idle();
// transaction
if isNodeDirty[0]
g_wiring[0].evaluate(...)
if isNodeDirty[1]
g_wiring[1].evaluate(...)
...
}
Now, a clever compiler can lift code out of the functions, and apply register reuse, elide store/fetches, inline whole functions (e.g. “add”), precalculate the evaluate() call avoiding a fetch, etc.
I further propose that c++ objects be used, though it may not make a difference (I suspect it might for method-calls). I just convert your wiring and storage to a class:
virtual class NodeBase {
// might as well have the base class, costs nothing extra
DirtyFlags dirty;
Wiring wiring ....;
void evaluate(...);
}
class xod_core_add : NodeBase {
// no "storage" needed
int OUT;
void evaluate()
emit( xxx, wiring.input_a + wiring.input_b)
}
}
xod_core_add node_1(…);
Note that the compile should be able to avoid the method call altogether:
// generated code
....
if (node[3].is_dirty) { // one of the "add" node
node[3].evaluate()
...
Should be optimized to:
...
static int _node_1_out; // lifted instance variable. possibly register only.
static DirtyFlags _node_1_is_dirty; // Object completely gone.
if ( _node_1_is_dirty ) {
_node_1_out = node[0].out + node[1].out; // inline
// if node[0] can be "inlined", then that method call goes away
dirty appropriate nodes....; // partial inline
...
The compiler is allowed to reorder code, and may change our “perfect” interleaving. Probably it won’t change it that much. There are techniques for putting in “fences”.
Ideally, a “constant” node should optimize to:
const static int node_0_out = 5;
And, any expressions using it may optimize away if they can be calculated at compile time.
I observe that XOD level optimizations become more obvious. E.g., “isdirty” can be skipped if a node has one input:
a -> flip-flop -> ...
if a.isdirty
a.eval()
if flip_flop.isdirty
flip_flop.eval()
It is easily seen that flip_flop is dirty if-and-only-if a is dirty:
if a.isdirty
a.eval()
flip_flop.eval() // the only time it can be dirty, and is dirty
I believe other such ideas become more apparent once unrolling is done.