Quite some time ago I received a report of a nasty Wayland bug: under certain circumstances a Wayland event was being delivered with an incorrect file descriptor. The reporter dug deeper and determined the root cause of this; it wasn’t good.
When a client deletes a Wayland object, there might still be protocol events coming from the compositor destined for it (as a contrived example, I delete my keyboard object because I’m done processing keys for the day, but the user is still typing…). Once the compositor receives the delete request it knows it can’t send any more events, but due to the asynchronous nature of Wayland, there could still be some unprocessed events in the buffer destined for the recently deleted object.
The Zombie Scourge
This is handled by making the deleted object into a zombie, rather, THE zombie, as there is a singleton zombie object in the client library that gets assigned to receive and eat (like a yummy bowl of brains) events destined for deleted objects. Once the compositor receives the delete request it will respond with a delete event, and at that time the client object ceases to be a zombie, and ceases to exist at all. Any number of objects may have been replaced by the zombie, a pointer in the object map just points to the zombie instead of a live object.
When an event is received, it undergoes a process called “demarshalling” where it’s decoded: a tiny message header in the buffer indicates its length, its destination object, and its “op code.” The type of the destination object and the op code are used to look up the signature for that event, which is a list of its parameters (integers, file descriptors, arrays, etc). Even though file descriptors are integer values, for the purposes of the Wayland protocol, integer is distinct from file descriptor. This is because when a Wayland event contains a file descriptor, that file descriptor is sent in a sort of out-of-band buffer along side the data stream (called an ancillary control message), instead of in the stream like an integer.
The demarshalling process consumes the main data stream and the ancillary buffer as it parses the message signature. Once a complete message is demarshalled, it is dispatched (client callbacks for that object plus op code are passed as parameters, and the client program gets to do its thing). When an event is destined for the zombie object, this demarshalling process is skipped. The length of data from the header is simply used to determine how much data to discard, and we proceed to the next event in the queue.
Here Lies the Problem
The file descriptors aren’t in the main data stream, so simply consuming that many bytes does not remove them from the ancillary buffer. The signature for the object is required to know how many file descriptors must be removed from the ancillary buffer, and the singleton zombie doesn’t (and can’t) have any specific event signatures at all.
So, if an event carrying a file descriptor is destined for a zombie object:
- At best, the file descriptor is leaked in the client, is unclosable, and counts towards the client’s maximum number of open file descriptors forever.
- At worst, there is a different problem; Since the file descriptors are pulled from the ancillary buffer in the order they’re inserted, if there is a following event that carries a file descriptor for a live object, it will get the file descriptor the zombie didn’t eat. The client will have no idea that this has occurred, and no idea what the file descriptor it received is actually going to provide for it. Bad things happen.
Not the Fix
We can’t change the wire protocol (to indicate the number of fds in the message header) because this would break existing software. We can’t simply keep the old object alive and mark it as dead, the object interface that contains the signatures is in client code, possibly in some sort of plug-in system in the client, and the client is allowed to dispose of all copies of it after deleting the object.
The Fix? More Zombies!
I recently sent a new edition of a patch series to fix this (and other file descriptor leaks) to the Wayland mailing list. The singleton zombie is permanently put to rest and is now replaced by an army of bespoke zombies, one for any object that can receive file descriptors in events, created at the time the object is instantiated (see, you can’t create at time of object destruction because it requires memory allocation, and that would allow deletion to fail…).
When the object is no longer viable, its zombie will live on, consuming its unhandled file descriptors until such time as the compositor informs the client it will no longer send any.