Why you should design your own operating system.

(Hint: I said design...)

Regardless of who writes your operating system, a successful embedded software project must carefully design the operating system.  Not necessarily design every aspect of it, but design the principal aspects of it.  Even if you eventually end up using lots of available or off the shelf operating system software, your effort in design will ensure that what you finally choose will meet your needs.

Why your operating system should be a polling loop.

(If that works for you...)

The simplest operating system type for your embedded design is the polling loop.  This single thread of execution can be nicely expressed as a single looping subroutine (albeit one that calls other subroutines) , easy to follow, easy to debug, no problem (usually) with race conditions.  And as a side benefit, no time wasted with locking resources or changing context.  Why bother with messy interrupts?

    for (;;)
{
checkA();
checkB();
checkC();
checkD();
}
Congratulations, you've written an operating system.

If a polling loop won't work, then make it work.

Polling loops may not work for two principal reasons.

First: There may be single operations which cannot be easily shortened and that take longer than the shortest realtime polling requirement.  You can make a lot of ugly code by artificially trying to break the operation into little pieces, or you can make ugly code that aborts the operation after periodic checks to see if something else should be going on... this can work if things don't get too complicated, but otherwise multi-threading shows a significant advantage.  Ok, so a single polling loop doesn't always work.

 Second: the shortest realtime requirement is significantly longer than any single operation, but shorter than the sum of all operations.  Suppose A is the critical operation, then you could do:

    for (;;)
{
checkA();
checkB();
checkA();
checkC();
checkA();
checkD();
}
If it doesn't get too much more complicated than that, you may be home.  Otherwise, it is time for the next step in sophistication.  If you are lucky enough to have time-counting hardware with reasonable resolution, the following may work:

    for (;;)
{
time = gettime();
checkA();
checkB();
if ( gettime()-time > max_loop_time ) continue;
checkC();
if ( gettime()-time > max_loop_time ) continue;
checkD();
}
The above assumed that A, B, C and D are in decreasing order of realtime priority.

For larger systems, especially those that should be dynamically configured, it may be better to use a polling node scheme, e.g.:

    struct polling_node { struct polling_node *next; void *check_fnc(void);  }
for (;;)
{
time = gettime();
struct polling_node *next = ...;
while ( next )
{
(next->check_fnc)();
if ( gettime()-time > max_loop_time ) break;
next = next->next;
}
}

The simple scheme above (which omits how the linked list of nodes was created...) is remarkably effective, particularly for soft realtime systems. It can be made somewhat more efficient by calling the check_fnc only if a flag is set in the node saying that it needs to be called (this flag being set by other code generating the need to call a given code;  or set everytime for certain code that must always be called).

If you don't have a decent time source, then you may be able to fake it by having the check function return an approximation of how long it took; this can be boolean (short vs. long) in which case we should
go back to the top of the list every time we took a long time to process a check.

If a polling loop won't work, use two polling loops.

In the example above,  you can also reduce the load at the low priority end by having a second inner loop that processes a second node list sorted by deadline, executing nodes only when they approach their deadline.  This can work even if you have only a low resolution time source, if the low priority items have realtime requirements that are longer than the time source resolution.

Or you may use polling loops in two different threads; a high priority thread with a high priority polling loop with rapid cycle time, and a low priority thread with a low priority polling loop with lengthy cycle time.  But now we're into multi-threading, things start to get messier.

Once you have multiple threads, life is never the same.

One you have multiple threads, almost  every piece of code that uses a resource will need to be labelled (if only implicitly) to indicate what threads it can run in; or what locks must be taken per resource (if you go that route).  The loss of innocence is something a project never gets over... practice safe coding! 

Essentially each thread is it's own virtual cpu, with your virtual cpus sharing common memory.  You may communicate with other threads only by carefully thought out means.  Every case where two threads share common memory must be carefully evaluated for race conditions... dangerous stuff.

Don't lock resources; lock threads.

You will (I confidently predict) have many more resources than threads.  If you use a mutual exclusion lock for every resource, then you will have a lot of locks (and waste a lot of cycles locking and unlocking).  So instead, assign every resource to a thread, but using as few threads as possible.  Try if at all possible to use a resource only from the thread it belongs to.  This avoids time wasted taking locks, and keep things clean.   But if you do need to use occasionally use the resource for a short period from another thread  (e.g. debugging thread) this can be facilitated by having a mutual exclusion lock per thread, and having the thread loop look like:
    for(;;)
{
... obtain lock for our thread ...
... check for a bunch of stuff to do and do it...
... release lock ...
... sleep ...
}
The above code is efficient in as much as it does the bunch of stuff while locking the resources just once, instead of having lots of lock calls that waste cpu time.

Your debugging thread can of course do something like:
        ... obtain lock also used by thread owning resource ...
... do something fast ...
... release lock

Don't use both interrupt service routines and normal task threads

Most common operating systems have a distinction between normal task threads versus interrupt service routines (ISRs), where ISRs are strictly higher priority and can interrupt a task thread at almost any time (with some disorganized exceptions, such as locking out all interrupts). In such software systems an interrupt service routine (ISR) is a would-be task thread that suffers from lack of priveleges (can't make many kernel calls etc.) and yet at the same time makes task threads suffer due to it's inflexibly high priority. The last issue is the most distressing one... your very high priority widget processing task  thread may have trouble meeting realtime deadlines because you keep getting ethernet packet  interrupts which really could wait or even be lost... do you write a hack to disable the ethernet controller while you are processing widgets? What is needed is a more organized approach, but using both ISRs and normal task threads in the same system makes it hard to get organized.

There is a better way... though usually one very specific to your system (thus you need to design your own operating system... :~).

A commonly recommended fix for the above mentioned problem is to have the ISR do some minor hardware twiddling and then wake up a waiting task thread (e.g. using a kernel semaphore), where the waiting task thread is properly prioritized with the kernel.  This works well for some systems, but not for others.  For some systems, the hardware twiddling turns out to be not so minor and/or the extra context switching time is a problem.  Given sufficient hardware buffering (which is frequently the case with all but the cheapest hardware), it would be better if the hardware interrupt could as directly as possible wake up a properly prioritized task thread.  This is possible if you write your own kernel; you will still have a kernel ISR but it will directly perform a context switch as appropriate.  The trick is to properly prioritize the interrupts that are allowed to interrupt at any given time.

Let's review interrupts and interrupt priorities.  There are some systems where the hardware provides very little buffering and so exceptionally fast interrupt service (both to begin AND to finish service) is essential; i've seen a cpu where the service routine could be written in one instruction and might need to be in some cases.  I've also seen people write interrupt routines that would run for milliseconds at a time. The number of interrupt vectors that a cpu has can range from one to many, but this really doesn't matter except to provide a small amount of optimization in the search for the source of the interrupt (important for that fast extreme case but not otherwise).  The really important thing is how many levels of interrupt priority are supported, and whether the priority can be dynamically assigned to an interrupt source (and with what granualrity of course).  I have yet to see a cpu that had anywhere near as many interrupt priority levels as there were interrupt sources; a popular cpu such as the ARM7 family has only two levels, and the power PC has only  one.  The inevitable result is that interrupt handling ends up being poorly prioritized; more pressing interrupt sources have to wait until an already running service routine for  a less pressing purpose finishes.

So what i'm getting at is that at least sometimes we need a way for a thread to wait as directly as possible for an interrupt. The first trick is to disable (mask out) all interrupt sources except for those that are being waited for by threads of higher priority than the currently running one.  Most (all?) hardware provides a way of masking out individual interrupt sources.  This does require some code; you will need a minimum kernel interrupt service routine.  Suppose your cpu has only one interrupt level; the algorithm for your custom kernel interrupt routine can be something like (assuming that the task control blocks are in a linked list in order of inverse priority):

Enter kernel ISR with all interrupts disabled via e.g. cpu status register.
Save registers of current task thread.
Begin new, empty set of interrupts to enable.
For task in prioritized task linked list, loop:
If task was waiting for a now-active interrupt source, break loop.
If task is waiting for a non-active interrupt source, add that source to set.
End loop.
Here with the task to activate selected in the loop:
Enable in hardware only those interrupt sources that are part of the newly computed set.
Restore registers of newly selected task.
Dispatch the new task thread by doing a return from interrupt to it.

Incidentally, for many systems it takes only a very few more cycles to save all of the registers of a thread than it does to save just those registers required to run a traditional ISR.  And the above algorithm searches for the interrupt source in the best way... by looking for the highest priority case first.  The discussion does leave out some issues about how the threads get prioritized in the first place; note especially the common safeguard of priority inversion prevention.  You might also note that the above algorithm can be extended to handle all types of events, not just interrupts.

Or course, there are hardware designs where you might still want some very short interrupt code snippets, e.g. written in assembler and using only a few registers.  You can use higher interrupt priority levels for this, or your kernel interrupt routine could handle these special cases directly before going to the general case to wake up threads.

Make friends with the hardware designer

If you are so lucky as to be able to design hardware and software at the same time,  pay especial attention to how you design the system interrupt controller.  I'm talking about custom hardware that has interrupt status for each of the principal peripherals.  You want to design this to hardware to make software work the best.  A common design is to have a status register with a bit for each peripheral, and matching mask and latch clear registers.  To this you should add a set register, designed such all bits set to 1 set cause an interrupt to be latched.  Also make sure that there are a bunch of bits that don't correspond to any actual peripheral.  Now instead of waking up another task with a software construct such a semaphore, you can much more efficiently to the job with a single store instruction, causing an interrupt that will (with appropriate software; see above) wake up a waiting thread.

It's 10 p.m. -- do you know where your code is?

There is no substitute for timing your code components, which in turn requires hardware.  See if your hardware guys can give you two timers for debugging... one timer for actual elapsed time and one for elapsed time in the current thread only.  It will take assistance from the multi-threaded kernel to swap the thread timer as part of the thread context, and to log all of the context switches.  And you will have to add code to log the start (and perhaps stop) times of every software component.  And memory to hold the log.  And organization to give labels to every software component and thread. You will need code reviews to see if some components hide code that could potentially cause the component to take much longer than normal.  You will need testing and statistics gathering to find the expected time for checks that do nothing and checks that do significant processing (and the frequency of the latter).