2008/05/20

Organizing a 24h support team

In the changing world of information technology, each technician has received a different training and has had different experiences that progressively specialize him or her. But the support of complex systems requires versatile people who have knowledge in a wide range of subjects although not necessarily very deep. The question of the on-call service complicates the fulfillment of their work. Given the impossibility of having the entire team on duty, it is necessary that each of them has the minimum knowledge to resolve as many incidents as possible.

Client: a service company

Need: to create a support team with young staff with little or no experience, who could maintain a system of attention to breakdowns and provide 24h support.

Previous situation: the support was distributed between the company that had designed the system and the company that maintained it. They wanted to stop relying on the developer company for maintenance.

System description: The core of the system consisted of an HP cluster with Service Guard. With a distributed Oracle DBMS, Windows client applications and web servers based on Java servlets.
Implementation: Initially each of the technicians was responsible for some tasks depending on their training and experience. They were trained through workshops in which they learned how the system worked from the point of view of maintenance: clusters, Oracle DB, replicated tables, distributed queries, Unix servers, etc.. At the same time all problems that arose were included in a document and they were accustomed to consult and maintain it. A series of DB queries and Unix scripts were developed to check that everything was going well. Every morning every member of the team checked the result, to make sure they understood it. Each of them had to know what the others were doing so they rotated to exchange tasks. In order to facilitate night on-call work, an automatic query system was designed with BMC Performance Manager (Patrol), based on the aforementioned queries, which tirelessly explored the system and sent messages to the on-call mobile as soon as something abnormal was detected. Perhaps the main problem was the initial stress of the technicians to carry alone at night the responsibility of all the support of the system, but the fact of having gone through all the possible tasks during the day was progressively giving them enough confidence to reduce it.

Project cost: 400 days x 14 technicians

No comments:

Post a Comment