ATMOSPHERE (Adaptive, Trustworthy, Manageable, Orchestrated, Secure Privacy-assuring Hybrid, Ecosystem for REsilient Cloud Computing) is a 24-month project aiming at the design and development of an ecosystem, a framework, a platform and applications of next-generation trustworthy cloud services on top of an intercontinental hybrid and federated resources pool. The framework considers a broad spectrum of properties and their measures. The platform supports the building, deployment, measurement and evolution of trustworthy cloud resources, data network and data services.

Within ATMOSPHERE, WP4 is a core work package whose goal is to develop advanced solutions for federated hybrid cloud resources (Virtual Machines, containers, and GPGPUs) to deal with “trustworthiness” requirements on Quality of Service, Privacy, Reliability, Performance and Transparency. This deliverable reports the activities we performed during the first year within task T4.5, developing novel hybrid machine learning models to predict federated cloud resources performance. Many factors such as multi-tenancy, virtualization, and resource sharing can affect the performance of cloud services and their resources. Performance models are used to evaluate trustworthiness properties related to the federated
systems performance and have a two-fold objective. They can be used after an application is run to assess the infrastructure performance evaluating a posteriori if the system was affected by performance degradation due to resource contention (with an impact on the overall system trustworthiness, evaluated according to the metrics defined in WP3). On the other hand, models can be used before an application is run to estimate its execution time given the allocated resources. In this way, we can determine the minimum amount of resources that must be allocated to meet an a priori deadline.

The adoption of accurate models will allow the ATMOSPHERE framework to anticipate QoS violations and to increase cloud services trustworthiness by improving their performance. These models encode how the system configuration impacts applications performance metrics and will be integrated during the second year into a performance assessment and prediction service. This service will be periodically triggered and integrated as a component of the monitoring platform developed in WP3 for checking, on the basis of the current system configuration, for some potential QoS violations. Performance estimates will feed the Proactive Policy module, which allows the specification of high-level rules. The Proactive Policy module will enact these rules, triggering the federated platform reconfiguration through Task 4.3 adaptation mechanisms to react to potential resource contention, in order to reach applications’ QoS requirements and improve the federated infrastructure trustworthiness.

In the literature, two main main performance modelling approaches have been proposed, namely, white-box analytical models (AMs) and black-box machine learning (ML) models. AMs require knowledge of the system, which is not always available and typically relies on some simplifying assumptions at the expense of losing accuracy. On the other side, black-box models based on ML try to learn from data and make predictions without the knowledge of the system internals. ML models require a training phase in which they use experimental data coming from different workloads and configurations. In order to obtain these initial data, it is necessary to run a set of experiments which is costly and time consuming. Moreover, since ML models are usually characterized by a wide set of hyper-parameters, which influence their accuracy, training should be coupled with hyper-parameter tuning to achieve the best possible results. Therefore, the training phase might take a long time. However, once trained, the prediction of ML models is very fast and usually very accurate, and this is why ML models are becoming popular nowadays in studying the performance of large systems.

The main focus of Task T4.5 during the first year was the performance modeling of cloud services with ML-based models. We consider four classic regression models (including `1-regularized Linear Regression, Neural Network, Decision Tree, and Random Forests) for predicting application execution. This selection was driven by trying to limit the number of considered techniques, but at the same time trying to cover the different possible types of models (i.e., linear, non-linear, built after feature selection, and built by ensemble methods). A ML library was also developed in order to automate the training and evaluation of ML models and their hyper-parameters tuning. Two complementary modelling approaches, namely, gray-box and black-box models, have been proposed. Both of them can be used a priori or a posteriori and, as experimental results demonstrated, can achieve different accuracy
levels. A heterogeneous set of applications including: (i) Spark SQL-based applications and ML benchmarks (core of the LEMONADE tool), (ii) an ad-hoc Spark benchmark for image processing, and (iii) the training of popular Convolutional Neural Network on systems exploiting GPGPUs (core of the ATMOSPHERE case study) were considered. Furthermore, gray-box models have been complemented with hybrid models, which use some analytical data to improve pure ML models accuracy, in particular to investigate the extrapolation capabilities on scenarios which have not been explored during their training phase.

The results of our experiments showed that there is no single ML model which always performs well or which always outperforms the others. Depending on the initial application profiling data available in the training set and on the application data size, different ML models provide significantly different accuracy levels. Therefore, it is very paramount to have a library to automate the training and hyper-parameters tuning process. In this way, several scenarios can be effectively investigated. Gray-box models, complemented with hybrid models, are effective in determining a posteriori whether an application was running with degraded performance in a system whose trustworthiness level can then be updated accordingly.

Overall, gray-box models can achieve very accurate results (about 4-25% percentage error) for performance assessment depending on the family of ML models in use. Frequently, if two models yield similar mean average percentage error (MAPE) in one scenario, their accuracy differs for different configurations or training sets. For this reason, we have also investigated the importance of the sampling schemes for choosing the number of cores used when gathering training data. In general, gray-box models work better when the application data size is fixed. Conversely, when interpolating or extrapolating on the application data size, the accuracy of gray-box models degrades, specifically when extrapolating to a smaller data size.
To overcome this accuracy degradation, we implemented a hybrid approach and evaluated whether it can reduce prediction errors in scenarios which require extrapolating on the number of cores and application data size. Our experiments with hybrid models show that the extrapolation capability on data size improved results significantly, achieving MAPE values below 30%. Extrapolation data has an important practical impact for the ATMOSPHERE case study. For example, we can predict if the infrastructure experienced resource contention even if we double the number of images analyzed by the Convolutional Neural Network used for detecting rheumatisms in heart echographies.

In regards to the accuracy of gray-box models for performance prediction analyses, results varied across the different applications but in general they performed well in most of the cases with median error below 40%. On the other hand, black-box models (achieving at most 25% error, about 7% on average) demonstrated to be particularly suitable to predict performance a priori. We considered two sets of features for the learning black-box models. The former set was composed of application and infrastructure features, containing only a priori information of the execution; the latter set comprised the first set plus some Directed Acyclic Graph (DAG)-related features that are estimated starting from the first set. Our motivation to explore these two sets was: (i) to understand whether a simpler model with less features would be able to accurately reflect the true values of the datasets and (ii) to see whether adding the estimated DAG metrics would improve the quality of these predictions. In all the application scenarios we considered, the simpler set of features yields better results for virtually all the cases. Interestingly, we observed that by using the full set of features the overall prediction error increases significantly. The inherent noisiness of the real values of these additional features combined with the error introduced by the model when using feature prediction could explain this increase.

As a final step, we compare our results with those of the Ernest method, proposed by the Spark inventors. In most cases, our best models improve upon those considerably, especially when few profiling configurations are available in the training set. Finally, for what concerns the results achieved in estimating the training time of Convolutional Neural Networks on GPUs, the results show that the proposed models can estimate execution times very accurately, both for performance assessment (MAPE is smaller than 7%) and for performance prediction (MAPE is smaller than 26%). By relying on the best models, the ATMOSPHERE KPI K4.2-Prediction accuracy has already been achieved during this first year. During the second year, performance models will be integrated as part of the ATMOSPHERE monitoring infrastructure by implementing the performance assessment and prediction service, which will be queried by the Proactive Policies module, while an algorithm able to identify the optimal initial deployment for an application providing a priori QoS guarantees (execution time lower than a fixed deadline) will be also developed. On the modelling side, feature selection techniques will be investigated and performance assessment and prediction models will be validated on the ATMOSPHERE case study.