"Troubleshooting" is the problem of integrated active diagnosis and repair, in which the repairer must repeatedly select between (imperfect) diagnostic tests and repairs to perform to restore functionality of a faulty system, while minimising the expected outage time and/or repair cost. The problem arises in many technical systems, including power networks, cloud computing, vehicle repair, etc.
Given prior probability distributions over faults and test false positives/negatives, the troubleshooting problem is a partially observable Markov decision problem (POMDP). Finding optimal solutions to POMDPs in general is intractable. However, in practical contexts, the way that faults affect the system and the available tests often exhibits a simple structure, such as a tree. We aim to exploit that structure to identify tractable optimal troubleshooting strategies.
Troubleshooting is not the only problem that is naturally modelled as a POMDP with a specialised, simple structure. For example, the problem known as "attack planning" (automated cyber red-teaming) has also been cast into this framework. In this project, we therefore aim to identify general structures in a POMDP that allow it to be solved to optimality in an efficient way.
- Heckerman, Breese, Rommelse. Troubleshooting under Uncertainty. Workshop on the Principles of Diagnosis, 1994.
- Xu, Zhu, Sun, Tran, Weber, Fu, Bass. Error diagnosis of cloud application operation using bayesian networks and online optimisation. European Dependable Computing Conference, 2015.
- Hoffmann. Simulated Penetration Testing: From "Dijkstra" to "Turing Test++". Invited talk at ICAPS 2015.
- Vomlelova. Complexity of Decision-Theoretic Troubleshooting. International Journal of Intelligent Systems, vol. 18, 2003.