The Wayland Zombie Apocalypse is Near

Quite some time ago I received a report of a nasty Wayland bug: under certain circumstances a Wayland event was being delivered with an incorrect file descriptor. The reporter dug deeper and determined the root cause of this; it wasn’t good.

When a client deletes a Wayland object, there might still be protocol events coming from the compositor destined for it (as a contrived example, I delete my keyboard object because I’m done processing keys for the day, but the user is still typing…). Once the compositor receives the delete request it knows it can’t send any more events, but due to the asynchronous nature of Wayland, there could still be some unprocessed events in the buffer destined for the recently deleted object.

The Zombie Scourge

This is handled by making the deleted object into a zombie, rather, THE zombie, as there is a singleton zombie object in the client library that gets assigned to receive and eat (like a yummy bowl of brains) events destined for deleted objects. Once the compositor receives the delete request it will respond with a delete event, and at that time the client object ceases to be a zombie, and ceases to exist at all. Any number of objects may have been replaced by the zombie, a pointer in the object map just points to the zombie instead of a live object.

When an event is received, it undergoes a process called “demarshalling” where it’s decoded: a tiny message header in the buffer indicates its length, its destination object, and its “op code.” The type of the destination object and the op code are used to look up the signature for that event, which is a list of its parameters (integers, file descriptors, arrays, etc). Even though file descriptors are integer values, for the purposes of the Wayland protocol, integer is distinct from file descriptor. This is because when a Wayland event contains a file descriptor, that file descriptor is sent in a sort of out-of-band buffer along side the data stream (called an ancillary control message), instead of in the stream like an integer.

The demarshalling process consumes the main data stream and the ancillary buffer as it parses the message signature. Once a complete message is demarshalled, it is dispatched (client callbacks for that object plus op code are passed as parameters, and the client program gets to do its thing). When an event is destined for the zombie object, this demarshalling process is skipped. The length of data from the header is simply used to determine how much data to discard, and we proceed to the next event in the queue.

Here Lies the Problem

The file descriptors aren’t in the main data stream, so simply consuming that many bytes does not remove them from the ancillary buffer. The signature for the object is required to know how many file descriptors must be removed from the ancillary buffer, and the singleton zombie doesn’t (and can’t) have any specific event signatures at all.

So, if an event carrying a file descriptor is destined for a zombie object:

  • At best, the file descriptor is leaked in the client, is unclosable, and counts towards the client’s maximum number of open file descriptors forever.
  • At worst, there is a different problem; Since the file descriptors are pulled from the ancillary buffer in the order they’re inserted, if there is a following event that carries a file descriptor for a live object, it will get the file descriptor the zombie didn’t eat. The client will have no idea that this has occurred, and no idea what the file descriptor it received is actually going to provide for it. Bad things happen.

Not the Fix

We can’t change the wire protocol (to indicate the number of fds in the message header) because this would break existing software. We can’t simply keep the old object alive and mark it as dead, the object interface that contains the signatures is in client code, possibly in some sort of plug-in system in the client, and the client is allowed to dispose of all copies of it after deleting the object.

The Fix? More Zombies!

I recently sent a new edition of a patch series to fix this (and other file descriptor leaks) to the Wayland mailing list. The singleton zombie is permanently put to rest and is now replaced by an army of bespoke zombies, one for any object that can receive file descriptors in events, created at the time the object is instantiated (see, you can’t create at time of object destruction because it requires memory allocation, and that would allow deletion to fail…).

When the object is no longer viable, its zombie will live on, consuming its unhandled file descriptors until such time as the compositor informs the client it will no longer send any.

Author: Derek Foreman

Derek has a long history of writing graphics software and has spent much of his career working in open source.

11 thoughts on “The Wayland Zombie Apocalypse is Near”

  1. So, is it turning into hack monster that X11 was or is this fix temporary and we’ll see a proper fix with new major version of Wayland protocol?

    1. I’m puzzled by the implication that this fix isn’t proper – if you have issue with it, now is the time to add your voice to the review on the wayland mailing list.

      Wayland isn’t turning into a “hack monster”, and I don’t feel it’s in danger of doing so. There have been a handful of cases where things might have been solved a little differently if a problem had been fully understood sooner, but nothing that’s resulted in measurably inefficient or hard to understand code.

      If we had the ability to break wire protocol to solve this then every event and request would be made at least 1 byte larger, despite the fact that the vast majority of them don’t contain any file descriptors at all, so this approach would still be under consideration.

  2. I don’t mind breakage, better do it sooner rather than risk not being able to do it later (ie X11).

    1. X11 may be guilty of a number of sins, but a stable wire protocol isn’t one of them.

      I think we’re already past that “sooner” point, and breakage of the wire protocol has been unacceptable for a while now. The parts discussed here are very core – not just in some unstable protocol xml file, but actually in the central event/request marshalling/demarshalling code.

      Changes to that have the potential to break every Wayland client and compositor all at once. It doesn’t take too many commits like that before the community disappears and the project dies. People have to be confident that their deployed apps will continue to work (and perhaps work better) after upgrading the Wayland libraries.

  3. @Riya Sateesh

    Too late for that — Wayland is already in widespread use. Just invoking “because X11 heheh” isn’t an argument.

  4. This might not be a change to the protocol going between clients but what this anti lost fd is doing need to go into the “Wayland Protocol and Model of Operation” why its simple you have altered a behavior. There are people attempting to implement libwayland-server and libwayland-client in rust and the just fixing it in the libraries and not writing into protocol documentation is kind of a mistake.

    Ideal world is someone picks up only the protocol documents implements as per that document and their implementation is compatible. Little things like garbage collection not documented in protocol documents can lead to incompatible implementations.. So even alterations that don’t effect over the wire but effect what happens at server and client ends needs to be documented.

    Basically what is a zombie fd and when will it come zombie and who is required to dispose of zombie fd all need to be in the protocol so that those implementing the protocol stand a chance of all doing it the same way..

    1. This is actually only a change to the wayland reference implementation in C – whether “zombie objects” exist at all is really implementation defined. I don’t think it’s an accident that (Wayland Protocol and Model of Operation) doesn’t use the word “zombie” at all. They’re part of the C implementation and not part of the protocol.

      All wayland implementations should not leak file descriptors, and should not deliver file descriptors in the wrong events. How that’s accomplished is up to the implementer, but it’s certainly not a change in documented/expected behavior.

      A rust implementation won’t need any changes to continue to communicate perfectly with clients and compositors that use the C implementation, so nothing in the Wayland Protocol documentation changes (which is good, as changing that would likely indicate broken API or ABI).

      1. The issue is ch04 refers to creating objects then does not state who is responsible for cleaning them up.

        **All wayland implementations should not leak file descriptors, and should not deliver file descriptors in the wrong events.**

        Unfortunate due to the protocol documentation not stating who is responsible to clean up file descriptors or objects a valid to protocol implementation can have the fault of leaking file descriptors or objects. So in Chapter 4. Wayland Protocol and Model of Operation there need to be an entry mirroring the Creating Objects being Disposing of Objects stating who is responsible to cleaning up created objects and when. Then the fault you just fixed would have been implementation not operating to protocol if responsibilities were stated in the protocol documentation..

        Of course adding responsibilities to the protocol documentation will not require implementations that are functioning properly to-do anything. Also does not mean implementer cannot not find other ways of meeting their responsibilities stated in the protocol..

        The problem you just fixed is undocumented behavior because its not stated what party has responsibility for the objects/fd items and the fix is still undocumented behavior where you cannot point to somewhere in protocol documentation that states where clean-up/garbage collection must be performed.

        I know it hard but if it not documented implementer is free not to-do it no matter how bad the outcome is. This is how standards end up being abused/broken. Fixing the documentation means if a future implementation turns up with the issue they can be called non conforming.

      2. General object lifetime management is straying towards off-topic here, but there are plenty of people (including myself) that would be happy to discuss it on the wayland irc channel and mailing list.

        If I understand your point, you are correct that a wayland library that never frees objects could be considered spec compliant. I don’t think such an implementation would find widespread use, as many developers don’t like libraries that silently leak memory behind their backs.

        Back to the topic at hand, our use of sendmsg() to pass file descriptors between processes doesn’t really leave any options regarding the responsibility of clean-up. If the client library receives a file descriptor and is unable to deliver it to the client application for handling, then it is the only entity capable of closing that file descriptor. It is then common sense that it must do so in order to prevent a leak (which is a generic bug condition for any library in any language).

        If we overdefine the specification we risk forcing C-like paradigms on non-C developers, or binding ourselves to data structures we may wish we could replace later. If we underdefine we risk interoperability problems. It’s definitely a bit of a balancing act. I do feel we’ve got it right in this case though…

  5. For the moment, more zombie processes is ok… if we are assuming that this fix is temporary (no matter how long that is).

    I am going to have to agree with Riya Sateesh that in a prefect world, fixing this, no matter what breaks, would be a good thing to do. That said, you are right… this hits a rather sensitive part of the protocol, which I would not go making extreme changes unless you have to.

    I get the feeling that even with this ‘fix’… it will come downto that, it’s just a question of when. I would assume that even with this patch accepted, it would not close the bug?
    If that is the case, great because this is one of those bugs that you keep an eye on.

    1. Just to be clear, we’re not talking about “zombie processes” at all, we’re talking about zombie objects in the wayland object map.
      I think you understand the distinction, but I don’t want any casual readers to be confused by the typo.

Comments are closed.