Publications

On this page you can find a complete list of my publications.

AI-based Proactive Failure Management in Large-scale Cloud Environments

Paolo Notaro, 2024
Doctoral Dissertation (link)

Modern IT infrastructures are becoming increasingly large and complex, creating challenges for O&M teams in managing and optimizing cloud services. AIOps supports O&M through the use of AI and large-scale monitoring. This thesis presents the current state of research in the domains of AIOps and proposes new methods for proactive failure management, covering applications in all layers of a cloud stack model (infrastructure, platform, software).

Command-line Risk Classification using Transformer-based Neural Architectures

Paolo Notaro, Soroush Haeri, Jorge Cardoso, Michael Gerndt, 2024 (link)

To protect large-scale computing environments necessary to meet increasing computing demand, cloud providers have implemented security measures to monitor Operations and Maintenance (O&M) activities and therefore prevent data loss and service interruption. Command interception systems are used to intercept, assess, and block dangerous Command-line Interface (CLI) commands before they can cause damage. Traditional solutions for command risk assessment include rule-based systems, which require expert knowledge and constant human revision to account for unseen commands. To overcome these limitations, several end-to-end learning systems have been proposed to classify CLI commands. These systems, however, have several other limitations, including the adoption of general-purpose text classifiers, which may not adapt to the language characteristics of scripting languages such as Bash or PowerShell, and may not recognize dangerous commands in the presence of an unbalanced class distribution. In this paper, we propose a transformer-based command risk classification system, which leverages the generalization power of Large Language Models (LLM) to provide accurate classification and the ability to identify rare dangerous commands effectively, by exploiting the power of transfer learning. We verify the effectiveness of our approach on a realistic dataset of production commands and show how to apply our model for other security-related tasks, such as dangerous command interception and auditing of existing rule-based systems.

An Optical Transceiver Reliability Study based on SFP Monitoring and OS-level Metric Data

Paolo Notaro, Soroush Haeri, Qiao Yu, Jorge Cardoso, Michael Gerndt, 2023
IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID 2023), Bangalore, India (link)

The increasing demand for cloud computing drives the expansion in scale of datacenters and their internal optical network, in a strive for increasing bandwidth, high reliability, and lower latency. Optical transceivers are essential elements of optical networks, whose reliability has not been well-studied compared to other hardware components. In this paper, we leverage high quantities of monitoring data from optical transceivers and OS-level metrics to provide statistical insights about the occurrence of optical transceiver failures. We estimate transceiver failure rates and normal operating ranges for monitored attributes, correlate early-observable patterns to known failure symptoms, and finally develop failure prediction models based on our analyses. Our results enable network administrators to deploy early-warning systems and enact predictive maintenance strategies, such as replacement or traffic re-routing, reducing the number of incidents and their associated costs.

HiMFP: Hierarchical Intelligent Memory Failure Prediction for Cloud Service Reliability

Qiao Yu, Wengui Zhang, Soroush Haeri, Paolo Notaro, Jorge Cardoso, Odej Kao, 2023
IEEE/IFIP International Conference on Dependable Systems and Network (IEEE IFIP DSN 2023), Porto, Portugal (link)

In large scale datacenters, memory failure is one of the leading causes of server crashes, and uncorrectable error (UCE) is the major fault type indicating defects of memory modules. Existing approaches tend to predict UCEs using Correctable Errors (CE). However, bit-level CE information has not been completely discussed in previous works and CEs with error bit patterns are strongly correlated with UCE occurrences. In this paper, we present a novel Hierarchical Intelligent Memory Failure Prediction (HiMFP) framework which can predict UCEs on multiple levels of the memory system and associate with memory recovery techniques. Particularly, we leverage CE addresses on multiple levels of memory, especially bit-level, and construct machine learning models based on spatial and temporal CE information. Results of algorithm evaluation using realworld datasets indicate that HiMFP significantly enhances the prediction performance compared with the baseline algorithm. Overall, around 66% of server crashes caused by UCEs can be avoided by using HiMFP.

LogRule: Efficient Structured Log Mining for Root Cause Analysis

Paolo Notaro, Soroush Haeri, Jorge Cardoso, Michael Gerndt, 2023
IEEE Transactions on Network and Service Management (TNSM) (link)

Accurate, timely Root Cause Analysis (RCA) is essential to successful IT operations as a primary step to incident remediation. RCA automation using data mining techniques in large heterogeneous systems is, however, a challenging task, because it requires correlating multimodal information across various data sources. An increasing number of services are migrating to structured logging to enable automated monitoring and debugging of complex large-scale systems. In this paper, we leverage structured logs and association rule mining (ARM) to automate RCA. We propose the LogRule algorithm, which automatically analyzes structured logs to generate a list of explanations for an event of interest. It achieves 0.921 F1-score for the diagnosis task while computing results 37x faster compared to the state-of-the-art, making it a time-efficient, accurate, and interpretable ARM-based RCA algorithm. Evaluation results show that LogRule enables RCA in complex multidimensional datasets, where the execution time of the current state-of-the-art algorithm is prohibitively large.

A Survey of AIOps Methods for Failure Management

Paolo Notaro, Jorge Cardoso, Michael Gerndt, 2021
ACM Transactions on Intelligent Systems and Technology (TIST) (link)

Modern society is increasingly moving toward complex and distributed computing systems. The increase in scale and complexity of these systems challenges O&M teams that perform daily monitoring and repair operations, in contrast with the increasing demand for reliability and scalability of modern applications. For this reason, the study of automated and intelligent monitoring systems has recently sparked much interest across applied IT industry and academia. Artificial Intelligence for IT Operations (AIOps) has been proposed to tackle modern IT administration challenges thanks to Machine Learning, AI, and Big Data. However, AIOps as a research topic is still largely unstructured and unexplored, due to missing conventions in categorizing contributions for their data requirements, target goals, and components. In this work, we focus on AIOps for Failure Management (FM), characterizing and describing 5 different categories and 14 subcategories of contributions, based on their time intervention window and the target problem being solved. We review 100 FM solutions, focusing on applicability requirements and the quantitative results achieved, to facilitate an effective application of AIOps solutions. Finally, we discuss current development problems in the areas covered by AIOps and delineate possible future trends for AI-based failure management.

A Systematic Mapping Study in AIOps

Paolo Notaro, Jorge Cardoso, Michael Gerndt, 2020
AIOps Workshop located in the International Conference on Service-Oriented Computing (ICSOC '21) (link)

IT systems of today are becoming larger and more complex, rendering their human supervision more difficult. Artificial Intelligence for IT Operations (AIOps) has been proposed to tackle modern IT administration challenges thanks to AI and Big Data. However, past AIOps contributions are scattered, unorganized and missing a common terminology convention, which renders their discovery and comparison impractical. In this work, we conduct an in-depth mapping study to collect and organize the numerous scattered contributions to AIOps in a unique reference index. We create an AIOps taxonomy to build a foundation for future contributions and allow an efficient comparison of AIOps papers treating similar problems. We investigate temporal trends and classify AIOps contributions based on the choice of algorithms, data sources and the target components. Our results show a recent and growing interest towards AIOps, specifically to those contributions treating failure-related tasks (62%), such as anomaly detection and root cause analysis.

Radar Emitter Classification with Attribute-specific Recurrent Neural Networks

Paolo Notaro, Magdalini Paschali, Carsten Hopke, David Wittmann, Nassir Navab, 2019
Submitted to International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (link)

Radar pulse streams exhibit increasingly complex temporal patterns and can no longer rely on a purely value-based analysis of the pulse attributes for the purpose of emitter classification. In this paper, we employ Recurrent Neural Networks (RNNs) to efficiently model and exploit the temporal dependencies present inside pulse streams. With the purpose of enhancing the network prediction capability, we introduce two novel techniques: a per-sequence normalization, able to mine the useful temporal patterns; and attribute-specific RNN processing, capable of processing the extracted information effectively. The new techniques are evaluated with an ablation study and the proposed solution is compared to previous Deep Learning (DL) approaches. Finally, a comparative study on the robustness of the same approaches is conducted and its results are presented.