Oct 30 2009

More About Endocrys

I previously mentioned that I’ve re-acquired rights to Endocrys, and that I was excited about it. My copious free time has been spent, of late, ripping it apart and making it cleaner and applying the lessons learned over 7 years of maintaining a sizable (458 system (peak)) Endocrys network.

Endocrys has two primary modular components: Autocrys and Paracrys.

Autocrys is an extensible communication protocol atop XMPP. It governs the syntax of commands or queries sent to systems or groups, the responses of systems to those queries, how to manage their presence, and how to react to presence changes in others.

Paracrys is a database-driven deployment and configuration system. Paracrys allows module code and configuration data to be stored centrally and deployed to Endocrys nodes on-demand. Paracrys fully supports versioning, thus allowing changes to be rolled-back in the case of a major oopsie. How small can a Paracrys module be? Here’s an example that implements a command called ’shell’ that allows you to do, essentially, whatever you want on an Endocrys client:

BEGIN { $Endo::MODS{SHELL}++; $Endo::CMDS{SHELL} = \&shell; }
END { delete $Endo::MODS{SHELL}; delete $Endo::CMDS{SHELL}; }

sub shell {
 return `@_`;
}

Drop that puppy into the Paracrys MODULES table with some other data, issue a mass “fetch module SHELL; refresh;” command, and bingo, all of your systems now let you do very bad things. It’s that easy to create a command to do something… Hopefully something useful.

Of course you should note that there is no access control in the above code… How do we prevent Bad People from using our horrendously very bad shell command? That used to be managed by the Communication Masters using another database called EndoACL, but has been folded into Paracrys’ duties and drastically simplified. Each Endocrys client, when receiving the shell command, will now ask Paracrys if the user who sent it is authorized to issue that command. Previously, the clients never even received commands from users not authorized to send them, at great expense.

One of the major goals of the project originally was to have absolutely minimal dependencies on third-party code, so I reinvented the wheel in numerous places. Now that it’s mine again, those requirements are vapor and I’m ripping out large swaths of my code, and exchanging it for API calls into other code that is the de facto standard to do whatever. For example, I wrote a function that copies a file from one location to another. Ew. The File::Copy module is the Perl Way to do that, so that’s how we do it now. Less code I have to maintain, and less code you have to read to understand Endocrys.

Another major goal of the original project was absolute redundancy on all levels. With a requirement like that, I over-engineered what were called the Communication Masters (CMs) so that they heart-beated each other, transferred each other’s sessions, held elections to decide who was authoritative for which IP ranges, dealt with segmentation and partitioning, etc. All of this at the cost of highly-customized hybrid XMPP/SQL servers that weren’t readily upgradeable. Wednesday night I spent a lot of time diagramming, and tonight solidified the spec to separate the XMPP server from the SQL database, and rely on established high-availability tools like pen or an SLB appliance to ensure connectivity to a farm of XMPP servers if needed. Additionally, this separation has allowed me to use MySQL clusters for the Paracrys bits, which adds scary levels of redundancy to those very critical bits.

Lastly for this post, the entire ithread Endocrys implementation has been ripped out and replaced with EV and AnyEvent, and the Net::XMPP code has been replaced with AnyEvent::XMPP for one cohesive event loop that runs very very fast. Originally I envisioned an Endocrys client maintaining dozens of XMPP sessions while handling dozens of system events and receiving dozens of commands, so I stuck everything in threads, and allowed it to scream along on SMP boxes. While this works just fine, there is a LOT of extra complexity involved with sharing variables across threads, dealing with races, etc. and the benefits are dubious when compared against a good, strong, event-loop system. I’m not quite done yet, but the net loss should be about 30% of the main code modules, with reduced complexity for all sub-modules as well.

I don’t have an ETA as to when the code will be generally available, but I’ve had some pings from some bright people interested in hammering the retooled version in non-critical environments, so hopefully it will be this year.

No Comments

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a comment

WordPress Themes