Optimizations: Making generated code run faster

I’ve been thinking a lot about how to make the generated code faster, smaller, and use less memory. Probably I’ve been thinking too much. A simple approach seems feasible, with the promise of significant efficiencies.

First, I assume we want to preserve the execution model. Which is documented as only guaranteeing a topological-sort order. Not noted is that a Node is only evaluated if at least one of its inputs is dirty. Implied in the “semi-instantaneous” evaluation within a transaction is that things sort of happen in parallel (which is very desirable). The documentation, though, explicitly says that parallel execution is not promised: “even M1-M2-M3”. But I think we should promise some kind of time-slice like parallelism, since that gives is a huge advantage over trying to write the code procedurally (cf. all the Scratch style systems).

The current implementation appears to do a topological-sort, then a breadth-first ordering. This gives a nice “interleaving” of all the graph’s branches, and a time-slice like parallelism.

It looks like you could do a simple translation of your current code generation into something that can run much faster and use less memory.

Your current implementation runs the sorted node list, and calls evaluate if it the node is dirty. You perform many memory fetches.

I propose unrolling the loop (same sort-order) which then makes the code susceptible to optimizations, particularly cross-function-boundary optimizations. Something like this:

void loop() {
	xod::idle();
	// transaction
	if isNodeDirty[0]
		g_wiring[0].evaluate(...)
	if isNodeDirty[1]
		g_wiring[1].evaluate(...)
	...
	}

Now, a clever compiler can lift code out of the functions, and apply register reuse, elide store/fetches, inline whole functions (e.g. “add”), precalculate the evaluate() call avoiding a fetch, etc.

I further propose that c++ objects be used, though it may not make a difference (I suspect it might for method-calls). I just convert your wiring and storage to a class:

virtual class NodeBase {
	// might as well have the base class, costs nothing extra
	DirtyFlags dirty;
	Wiring wiring ....;
	void evaluate(...);
	}

class xod_core_add : NodeBase {
	// no "storage" needed
	int OUT;
	void evaluate()
		emit( xxx, wiring.input_a + wiring.input_b)
		}
	}

xod_core_add node_1(…);

Note that the compile should be able to avoid the method call altogether:

// generated code
....
if (node[3].is_dirty) { // one of the "add" node
	node[3].evaluate()
...

Should be optimized to:

...
static int _node_1_out; // lifted instance variable. possibly register only.
static DirtyFlags _node_1_is_dirty; // Object completely gone.
if ( _node_1_is_dirty ) {
	_node_1_out = node[0].out + node[1].out; // inline
	// if node[0] can be "inlined", then that method call goes away
	dirty appropriate nodes....; // partial inline
...

The compiler is allowed to reorder code, and may change our “perfect” interleaving. Probably it won’t change it that much. There are techniques for putting in “fences”.

Ideally, a “constant” node should optimize to:

const static int node_0_out = 5;

And, any expressions using it may optimize away if they can be calculated at compile time.

I observe that XOD level optimizations become more obvious. E.g., “isdirty” can be skipped if a node has one input:

a -> flip-flop -> ...
if a.isdirty
	a.eval()
if flip_flop.isdirty
	flip_flop.eval()

It is easily seen that flip_flop is dirty if-and-only-if a is dirty:

if a.isdirty
	a.eval()
	flip_flop.eval() // the only time it can be dirty, and is dirty

I believe other such ideas become more apparent once unrolling is done.

Wow. I like the way you think!

Utilizing rich compiler optimization features is very desired. Especially, the possible inlining. I scratched my brain too trying to find a realistic way to give all required hints to the compiler to make it possible. But I have never thought about the unrolling.

At first sight, it could work although some problems likely would appear during implementation. We should try.

Few comments on the code samples:

// I suspect the virtualness is not used anyhow, i.e. no runtime polymorphism
// and vtbl overhead
virtual class NodeBase {
    // Flags are going to be fragmented in memory. So we can no longer call
    // memset once to reset all dirtieness among all nodes. Not a big deal perhaps
    DirtyFlags dirty;
    // Wiring is immutable and should not appear in RAM. On Harvard architectures
    // (e.g. AVR) I’m not sure one is allowed to mix RAM-space and Flash-space
    // data in a single class instance
    Wiring wiring ....;
    void evaluate(...);
    }

class xod_core_add : NodeBase {
    // no "storage" needed
    // And that’s cool. I see that access to the permanent state would be much
    // more natural as well
    int OUT;
    void evaluate() {
        // Do you mean that `xxx` is a some kind of reference to the `OUT`?
        emit( xxx, wiring.input_a + wiring.input_b)
    }
}

I will definitely perform an experiment and try to resemble what you offer. If you’d like to hack it too, I’d be glad to point to the key points in XOD source code to do that. The root is in xod-arduino package and if you’d need any help with project building or source code navigation feel free to ask.

Thanks, I was worried you had already had a better way. I assume you’ve looked at all the other flow-based-stuff? E.g. https://groups.google.com/forum/#!forum/flow-based-programming . I have no idea what they have figured out in terms of optimizations.

Yes, yes. Programming is much easier if you don’t write code.

Obviously we are trading some code space for some of the previous node-graph space. I suspect not much difference in total space though. And, optimizations ought to make it less.

I think I did the right thing, and there is no dynamic polymorphism, and everything is obvious to static analysis. It looks like the polymorphism is statically analyzable, so everything can be precalculated: straight function calls. Which should expose the methods for cross-function-boundary optimizations, inlining, and common-expression elimination, etc. etc.

Ah. If it works in “struct” it would work. Of course, does it work in struct? Oh well, I really only used classes/objects for clarity. I bet you could make it do the right thing with allocation helpers…

I wonder if “const” would always let the compiler inline the values?

virtual class NodeBase {
    ...
    const Wiring wiring = { ... }

Maybe that’s getting too tricky and reliant on exact optimizations.

Possibly the “wiring” could be “hard coded” avoiding the instance variables. I thought something like this:

if a.dirty
    a.eval( src_a, src_b, src_c )

But, didn’t think about how that works exactly.

Though really it just makes the generated code a little clearer (and easier to think about). It could change how a .cpp “native” part is written: it’s just the body of a class (instead of a namespace).

Yes! I am lazy. “emit” also does the appropriate “make dirty” work, and so on. The goal would be to make “emit” amenable to extensive inlining/lifting by the compiler. Emit is essentially: assign and mark downstream dirty.

I wish I had time. I think XOD is very worthwhile and I want to help it develop and succeed. If I find time, I’ll try.

Not quite. I have thought about graph coloring and assigning a shared RAM storage for each color. It would save precious RAM but alone will not help with evaluate inlining. Maybe these two ideas can co-exist together.

I’ve looked at few FBP systems like NoFlo and Node-RED. They have a luxury of much more computing resources, so they’re able to use “very high-level” things like pub-sub queues. Not realistic on low-end MCUs. What’s more interesting in XOD context is efforts to implement LISP and Haskell (ouch, lost the link) on Arduino; and JS interpreter (Espruino) on STM32. Only Haskell does the unroll and I can’t say that any system has an optimization idea that cause “Ahaaa!”.

Yes, why not. At least I can still try to keep two lands (static and dynamic) explicitly, i.e. have two instances of objects per node.

No unfotunately. Constness is a recommendation, not an enforcement. One can easily shoot his leg of by doing *reinterpret_cast<int*>(aConstantPointer) = 42.

Can’t figure out how it is possible. We have a single evaluate for every instance of a particular native node. Different instances refer to different data. That references should be stored somehow. They’re immutable, so currently they are stored inside Flash-space Wiring struct instances. Otherwise, we have to replicate a slightly different evaluate per instance. Intuitively it would lead to big code bloat and extensive Flash usage.

Hmm… but it sounds like another unroll… I need to think about it.

Of course. Time is invaluable :slight_smile: Thank you for sharing the ideas!

I’ve got some time to experiment with evaluation unroll. Here are some intermediate thoughts.

A simple unroll on its own would not give much. That’s because nodes (hence their output values) are stored globally. Here’s a trivial example:

void setup() {
}

static bool n1;
static bool n2;

void loop() {
  bool x = digitalRead(7);
  n1 = !x;
  n2 = !n1;
  digitalWrite(13, n2);
}

Compilation gives a binary with 796 bytes of Flash and 11 bytes of RAM usage (796/11 for short). Now, let’s try:

void setup() {
}

static bool n1;
static bool n2;
static bool n3;
static bool n4;

void loop() {
  bool x = digitalRead(7);
  n1 = !x;
  n2 = !n1;
  n3 = !n2;
  n4 = !n3;
  digitalWrite(13, n4);
}

This gives 804/13. And the usage grows linearly as I add new nX’es. However, if a compiler saw a room for optimization the program size should not grow regardless of nX’es count.

Now, let’s move nX’es inside the loop:

void setup() {
}

void loop() {
  bool n1;
  bool n2;
  
  bool x = digitalRead(7);
  n1 = !x;
  n2 = !n1;
  digitalWrite(13, n2);
}

774/9

And add more Not’s:

void setup() {
}

void loop() {
  bool n1;
  bool n2;
  bool n3;
  bool n4;
  
  bool x = digitalRead(7);
  n1 = !x;
  n2 = !n1;
  n3 = !n2;
  n4 = !n3;
  digitalWrite(13, n4);
}

A miracle! 774/9 again. And it stays the same regardless of nX count. The compiler is able to optimize it. A slight rewrite toward XOD realities gives us:

namespace xod__core__not {

struct Inputs {
    bool* input_IN;
};

struct Outputs {
    bool output_OUT;
};

struct Context {
    Inputs inputs;
    Outputs* outputs;
};

void evaluate(Context ctx) {
    ctx.outputs->output_OUT = !*ctx.inputs.input_IN;
}

} // namespace

void setup() {
}

void loop() {
    bool x = digitalRead(7);

    xod__core__not::Context ctx1;
    xod__core__not::Outputs outs1;
    ctx1.inputs.input_IN = &x;
    ctx1.outputs = &outs1;
    xod__core__not::evaluate(ctx1);

    xod__core__not::Context ctx2;
    xod__core__not::Outputs outs2;
    ctx2.inputs.input_IN = &outs1.output_OUT;
    ctx2.outputs = &outs2;
    xod__core__not::evaluate(ctx2);

    xod__core__not::Context ctx3;
    xod__core__not::Outputs outs3;
    ctx3.inputs.input_IN = &outs2.output_OUT;
    ctx3.outputs = &outs3;
    xod__core__not::evaluate(ctx3);

    xod__core__not::Context ctx4;
    xod__core__not::Outputs outs4;
    ctx4.inputs.input_IN = &outs3.output_OUT;
    ctx4.outputs = &outs4;
    xod__core__not::evaluate(ctx4);

    digitalWrite(13, outs4.output_OUT);
}

The same story here: a compiler is able to optimize the code. Its size and content do not depend on xod__core__not evaluations count. But this is possible only if we move Contextand its members inside a function.

That’s OK if a node is purely functional and its results are usable when available only in a single transaction scope. But we can’t do the same for nodes with side effects (e.g. flip-flop) and when different branches converge.

One solution I see is keeping computations of functional and transient nodes inside a transaction function and placing other data to the global scope. It would require a clear notion of whether a node is pure or not, but that’s possible technically. Another problem to solve is proper convergence handling.

I was more concerned with execution speed, though hoping for some memory savings

I was thinking about that too, though more in terms of where the generated code goes.

I meant hard-code in the calling code: it looks like you did this in your “slight rewrite toward XOD realities”. No need for a list of the wiring, just generate the code with the right “context” in the call. Saves the memory for the wiring list.

...
xod__core__not::evaluate(ctx1);
...

It would be interesting to compare your “slight rewrite toward XOD realities” style vs. the current generated code and see how code-size differs. And how speed differs. I would think that a run time speed up is enough to make this worthwhile, even if there isn’t much memory savings. If I’m lucky, I may find time to do a test of this. Then, further optimizations could be made (like a “const” node becoming a single line).

Thanks @awgrover, that’s valuable. We definitely going to do loop unroll someday. If you or someone would help with preliminary research to understand how exactly the desired code should look like, it would help the issue to get a high priority.

Some tests of unrolling the 0.14.0 XOD generated code

See the XOD project in xod-optimization-test/. This README, and the test code, and the unrolling code, are at https://github.com/awgrover/XOD-Unroll-Test.

The test XOD project is a fairly simple one that just adds some numbers, does an ifelse and a digitalWrite. It would only run once, since the top of all the branches are constants.

Configuration

XOD 0.14.0
Arduino ID 1.8.1
Compile for UNO

Measuring

We’ll compare the standard “generated code” (“Deploy: Show Code for Arduino”) with some “hand” attempts at unrolling. I’ll compile for the UNO as a reference, using the Arduino IDE.

I’ll make several arduino projects, see all the subdirectories in the git archive:

  • The original xod generate code
  • Unrolled

And a doctored version of each with timing code, and dirtying the graph each time (so it re-runs the graph):

void loop() {
    static int loop_count = 0;
    static unsigned long start_time = millis();
    
    // the original XOD body:
    xod::idle();
    xod::runTransaction();
    
    // doctoring: say how long 100 executions takes
    memset(g_dirtyFlags, 255, sizeof(g_dirtyFlags));
    loop_count++;
    if (loop_count >= 100) {
        Serial.println( millis() - start_time );
        start_time = millis();
        loop_count = 0;
        }
    }

Unrolling

I’m more concerned with execution speed than RAM usage, but expect to see some memory improvements.

I expect that unrolling, and especially inlining the “evaluate” call, should make more static analysis possible for the compiler, so code size should decrease (unrolling would increase it though!). And, inlining allows removing some stored values. Also, the unrolling and inlining removes many fetches (most of the fetches are from PROGMEM, so are particularly slow).

For the first test, unroll the loop, and inline the “evaluate” call.

The Loop

The generated code has done a topological-sort of the graph, and then a breadth-first traverse to produce the node-list(s). It runs the “program” like this:

for (NodeId nid = 0; nid < NODE_COUNT; ++nid) {
    if (isNodeDirty(nid)) {
        evaluateNode(nid);
...

Data structures

Basically, the data is several lists, holding the dirty flags, evaluate function pointer, parent-nodes, dependant nodes, etc. All the data is put in PROGMEM (except the dirty-flags).

isNodeDirty(nid)

Is just

g_dirtyFlags[nid] & 0x1;

and

DirtyFlags g_dirtyFlags[NODE_COUNT] = {
    ... for each node

So, I’ll just use that code.

evaluateNode

Gets the function pointer for the node-class’s “evaluate” and calls it with the node-index:

EvalFuncPtr eval = getWiringValue<EvalFuncPtr>(nid, 0);
eval(nid);

and

g_wiring[NODE_COUNT] PROGMEM = {
    pointer to a Wiring struct
    ... for each node

‘Wiring’ is a structure, possibly different for each type of node:

// xod__core__add's:
struct Wiring {
    EvalFuncPtr eval;
    UpstreamPinRef input_X;
    UpstreamPinRef input_Y;
    const NodeId* output_VAL; // a list of node-indexes
};

And an “instance” of Wiring looks like:

const xod__core__add::Wiring wiring_6 PROGMEM = {
    &xod__core__add::evaluate,
    ....

Some gyrations have to be gone through to return the function pointer from PROGMEM storage.

I’ll just put the function call inline.

A program to edit the XOD code

Of course I’ll use Perl to edit the generated code and create the unroll code. The XOD code is very stereotyped, so it should be reliable to just do text processing on it.

First Unrolling Performance Test

I only unrolled the loop, inlining the “evaluation” call. I even left the unused EvalFuncPtr in the Wiring structures.

./unroll arduino_optimization_test/arduino_optimization_test.ino

Project    Flash/RAM    code-growth%    ms/100-loops    speedup%    
(UNO board)    32256/2048            
XOD Generated Code    2756/105            
XOD w/timing    3990/294    45%    23ms    
Unroll1    3214/105    17%        
Unroll1 w/timing    4464/294    12%    18ms    22%

Adding the timing code causes a large increase in code and RAM: I think that’s all due to Serial.

As I guessed, unrolling speeds things up quite a bit (22% faster). I’m still guessing that is because of skipping a lot of fetches from progmem.

Code size goes up, not surprisingly, since I unroll the loop. The percentages don’t take into account the overhead for the basic arduino code (444/9 for null void() and setup()), etc. A better percent would take into account the XOD library code, etc.

Without the timing code, RAM usage doesn’t change. That’s not surprising, since the RAM is almost all the dirty flags and the node’s output values.

Conclusion

Significant speed improvement by unrolling, but it’s not obvious that there was much code-optimization.

Remove EvalFuncPtr Performance Test

I deleted the unused EvalFuncPtr in the Wiring structures.

code/ram code-growth%
3070/105	14%

It’s not surprising to see an improvement, but we just saved 14.4 bytes per node: 144 bytes, 10 nodes. That seems like a funny number.

1 Like

Have you seen “Meta-Compilation of Language Abstractions” by Pankaj Surana? A way to plug-in compiler optimizations: really it is a tree-traversal mechanism and discipline.

I’d like to see that be the mechanism for future optimizations of the generated code in XOD.

It does not seem to be accessible on the web anymore. I put it in: https://github.com/awgrover/XOD-Unroll-Test/raw/master/SuranaThesis.pdf

@awgrover, thanks for the research :heart: Using Perl to generate the code from code is smart.

The optimization proposed would be quite easy to implement. The speedup is obvious, but I worry about the Flash consumption growth. So I’d restrain myself from implementing the unroll right away. As you said, the Flash consumption increased because of multiple explicit evaluate calls and it always should take more space compared to the current loop. However, the compiler should be able to eradicate some evaluate calls completely if their implementations are trivial and could be inlined at compilation time. AFAIK that’s not possible until the data referred inside evaluate is stored globally (or have static storage qualifier). Correct me if I’m wrong. So, to make the loop unroll an absolute winner, we have to find a way to move some/all transient data right into loop body.

So:

  • Stage 1: done, thanks, Flash efficiency -14%, RAM efficiency +0%, performance +22%
  • Stage 2: yet to be done, should defeat Flash efficiency drop

No. Just scanned the paper, it looks like a great reading. Thank you for sharing. Gone to read it :wink:

I’ve read a half (yet) of the paper you recommended and it blew my mind. The ideas behind it are brilliant!

Just not to forget what I thought reading the dissertation, here are points what XOD could borrow from it:

  • A node is given few “plug-in points” triggered at various points of the lifetime. Starting from the decision of how the node looks like (what inputs, outputs, with which labels) down to how it affects C++ AST.
  • The plug-in points are implemented by library authors in JS independently of XOD core itself and get published as first-class citizens along with nodes.
  • Core nodes use the same mechanism to implements many core optimizations and features.

For example:

  • We have the continuously node. On transpilation all its instances are squashed into a single one for better performance. This is done by a hard-coded transpilation stage although with the plug-in mechanism, the continuously node could do the same by itself.
  • We have a special meaning for not-implemented-in-xod node. It barks if a C++ implementation for a given node is not found. With plugins it could do the same for itself.
  • One could make a pure node that acts as a marker for a patch and affects the generated C++ AST so that it declares node’s inputs as transient.
  • One could make a math-expression node that accepts an Excel-alike formula and generates a node with the corresponding inputs and outputs which depend on the expression entered. That could be possible without XOD core code intervention.
  • One could implement enumeration type in XOD without touching the core code.
  • One could create a meta-node that takes a URL to machine-readable API definition and generates itself as that API consumer

…ooh. Such mechanism could bring many great and unforeseeable contributed extensions if done properly. Need to think about it in more detail. :brain:

Thanks for the PDF

1 Like

Yes, I agree. I was hoping that optimizations might balance out. I haven’t looked at the asm, but I assume there was some optimizations.

I was playing around, and it looks like many functions are inlined at this point. So, I think most address+offset calculations are done/doable at compile time. Evaluate()'s do NOT seem to be inlined (add, for instance): I put “attribute((always_inline))” on add’s just to try, and saw code size increase (that’s weird since I only have 1 ‘add’).

The evaluate()'s fetch from structures, and set structures. If getValue() gave a reference, that might help. Or, use your ideas about moving things into registers. I tried some of this below.

The unroll makes some optimizations possible: drop is_dirty/eval of all constant nodes, for example. That’s worth about 50 bytes of code each (and speed-up of course).

If you know that a node always dirties the output (e.g. add), then you can fold the is_dirty() test:

if (isNodeDirty(6)) {
    xod__core__add::evaluate(6); // output of 6 is 7
    xod__core__greater::evaluate(7);
    xod__core__if_else::evaluate(8);
}

That saves about 40 more code bytes each (and is faster again).

When I did the constant-nodes removal, and the above folding, I actually got smaller code than the “rolled” version. Of course, it all depends on how many constants, and how much folding you can do.

I want to believe that the compiler could do the folding for us. If evaluate() could be inlined, and it’s dirty-output(x) could be seen so the if(is_dirty) could be folded in.

So, I tried this:

// xod__core__add::
__attribute__((always_inline)) void evaluate(Context ctx, Number input_X, Number input_Y, NodeId output_nid, Number &output) {
    // auto x = getValue<input_X>(ctx);
    // auto y = getValue<input_Y>(ctx);
    // emitValue<output_SUM>(ctx, x + y);
    output = input_X + input_Y;
    markNodeDirty(output_nid);
    markPinDirty(ctx, 1);
}

// in runTransaction(), unrolled stuff:
if (isNodeDirty(6)) {
    // xod__core__add::evaluate(6); // was
    // I looked up the wiring, and put the values in, nb, output will be by ref
    xod__core__add::evaluate(6, storage_0.output_VAL, storage_1.output_VAL, 7, storage_6.output_SUM);
    if (isTimedOut(6))
        clearTimeout(6);
}
if (isNodeDirty(7)) {
    ...

Sadly, the compiler can’t know that the only place we mark-dirty is in add’s evaluate, so it can’t fold this:

markNodeDirty(7);
if (isNodeDirty(7)) {

But, it does seem to fold if it can trace the dirty_flag (40 bytes less of code):

g_dirtyFlags[7] = 0; // clear the flag just before the upstream
if (isNodeDirty(6)) {
    // xod__core__add::evaluate(6);
    xod__core__add::evaluate(6, storage_0.output_VAL, storage_1.output_VAL, 7, storage_6.output_SUM);
}
if (isNodeDirty(7)) {
    xod__core__greater::evaluate(7);
    if (isTimedOut(7))
        clearTimeout(7);
}

Including removal of EvalFuncPtr storage, and the change to add’s eval (and explicity clear dirty) gives me:

XOD Generated Code  2756
simple unroll       3214    +20%
use & in add::eval  2982    +10%

That’s promising.

That seems to suggest that the dirty_flags should be cleared one at a time, rather than a bulk memset.

The dirty flags should be able to be moved into registers/local:

Scheduled nodes can set dirty at the beginning of each transaction.

Emit can set dirty during a transaction.

Dirty is always cleared at the end of a transaction, so they don’t need to be global/static.

In a perfect world, that would required only 1 or 2 dirty flags, instead of 1 per node. And they could even be in registers.

You’d have to move idle() into runTransaction, and unroll it! I haven’t tried that. Hmm, move the idle() functionality into the runTransaction loop:

for (NodeId nid = 0; nid < NODE_COUNT; ++nid) {
    TimeMs t = g_schedule[nid];
    if (t && t < now)
        markNodeDirty(nid);
    if (isNodeDirty(nid)) {
        ...

You can then drop the clearing of dirty flags.

You can do this now I think, it doesn’t require unrolling runTransaction().

yes, yes! I’m a fan of Surana’s idea, especially the user-library-added optimizations. And, many (all?) of the optimizations I’ve been testing could be done.

I just found Surana’s code from the paper, https://github.com/suranap/sausage. Written for scheme.

Surana seems to currently be at Symbiont.io. He has a:

https://mobile.twitter.com/pinkusurana
https://surana.wordpress.com/

Not sure what I wanna ask him, he told everything in the paper. But anyway thank you, @awgrover

Well, first things first. I’ve created an issue on GitHub to do the basic loop unroll. Will try to do it in very observable future. Hope it will succeed.