Software enhancements and modern monitoring solutions for a large-scale, high-availability distributed control systems at CERN
My mandate as a site reliability engineer and senior software engineer at CERN is to maintain C++ real-time software for a large-scale, distributed control systems for 2500+ power converters running 24/7 operations. The responsibilities include software upgrades based on the analysis of requirements of the machines operation team, as well as integration, configuration and maintenance of 35 GNU/Linux production servers. In the article I describe software enhancements and monitoring solutions I have implemented, as well as their impact on the CERN accelerators operation team and on-call service.
Maintaining legacy architecture of the control system
The control systems at CERN use different kind of fieldbuses. One of them is MIL1553, a military based standard which has been used in an airplanes and ships since the 1970s. CERN’s implementation of MIL1553 is based on MIL-STD-1553B revision, published in 1978. However, it diverged from the standard over the years as it was adapted to the specific needs of the organization. The MIL1553 is a multidrop bus based on a 5V, Manchester encoded differential signals sent over a shielded twisted pair. The communication follows the master/slave paradigm, where one Bus Controller (BC) can manage up to 32 Remote Terminals (RT). All transmissions on the data bus (BC -> RT; RT -> BC) are initiated by the BC and are accessible to the BC and all connected RTs. The MIL1553’s messages are 20 bit long (3 bit for sync, 16 bit for payload and 1 bit for odd parity control) and are transmitted with a speed up to 1 Mbit/s. Each end of the bus must be properly terminated with a resistance to minimize the effects of a signal reflections that can cause waveform distortion. Failing to do so can cause intermittent communications failures.
Software enhancements and modern monitoring solutions for the control system
Nowadays the MIL1553 is deprecated, however it is still used at CERN to control critical elements in the accelerator complex. There are 1500+ power converters distributed across the organization which are using this fieldbus. Furthermore, many of the RTs use G64 controller cards based on different types of CPUs and firmware versions. Moreover, to synchronize the actions of a distributed control system with CERN machines, typical installation consists of an additional infrastructure for transporting timing pulses. The misconfiguration of those signals to either trigger bus transactions or trigger equipment’s actions lead to complex synchronisation problems. In addition, there is no available hardware for the fieldbus diagnostics, meaning that all relevant information about the bus condition are known only by the MIL1553 driver. Because of the described issues, the MIL1553 installations have become very difficult to diagnose and operate, especially after periods such as Long Shutdowns (LS) when the upgrades of the CERN’s software & hardware infrastructure take place.
After becoming a maintainer of the MIL1553 power converters control systems, I have decided to introduce upgrades as the current situation was not acceptable. In close collaboration with the CERN accelerators operation team, I was driving continuous improvements of the existing systems. I identified many possible enhancements in the C++ software class responsible for controlling 1500+ power converters. I set clear goals and objectives and prioritized them in issue and project tracking software, as well as I used version control system to track the changes of the control software source code. After extensive tests in a laboratory, I gradually released the binaries into 35 operational front end computers using semi-automatic tool created in a Bash scripting language. It assured configuration consistency, as well as performed basic tests on a new binaries. The upgrades concerned whole LHC injector chain and experimental areas across the organization. A very structured and organized approach towards work was essential, as a single mistake could cause serious consequences.
Many different enhancements were included in the new binaries, such as: recognition of a corrupted bus packets and RTs hardware failure; redesigned requests, status and alarms handling; faults severity dynamic definition mechanism; improved logging and information flow; timing misconfiguration handling mechanism; improved supervision and diagnostic information pages.
In addition, to have a better overview of the MIL1553 control systems in a whole CERN complex, I have decided to use a modern open source tools for monitoring purposes. I have selected the Elasticsearch stack and Grafana. Each front end computer is sending the MIL1553 control system log data to Elasticsearch server using a Filebeat daemon. Afterwards, important information are extracted by means of database queries and are shown on the supervision panels. I used the templating feature of Grafana in order to prepare the generic dashboards and provide the possibility to check the status of every part of the control system. Furthermore, a general diagnostic of an accelerator complex hosting power converters is possible. The operators must select elements of the control system they are interested in by means of drop down menus. After the configuration has been chosen, the dashboard loads and presents relevant data in a form of a statistical graphs.
Summary
The upgrade of a control systems was a great success. New and improved software solutions have been tested and are working as expected. The provided enhancements fixed several long-standing issues and significantly decreased errors experienced by the CERN machines operation team. Moreover, thanks to the new diagnostics and deployed monitoring solutions, the on-call hardware teams are able to quicker narrow down the problem and implement a fix for it, which greatly improved the availability of the machines.