Finding the Right Cloud Configuration for Analytics Clusters

Vanir takes the approach of quickly finding a good enough configuration and then attempts to further optimize the configuration during production runs.

  • metrics-based optimizer for the benchmarking runs
  • Mondrian forest-based performance model
  • transfer learning during production runs

Vanir is designed for a setting where a user needs to provision and set up an on-demand analytics cluster for each run of a batch processing job. In this scenario, it is often the case that a large fraction of these deployments are recurring, as supported by reports that more than 40% of the jobs in production clusters are recurring computations.

The main principle that Vanir adopts to cope with a large configuration search space is to find a good enough configuration via a fast benchmarking phase, and optimize that configuration during production runs, as the job recurs.

jointly identify both the type and number of instances for each framework. a cloud configuration is denoted as a vector \(C=\left\{\left\langle N_{1}, I_{1}\right\rangle, \ldots,\left\langle N_{n}, I_{n}\right\rangle\right\}\) where \(N_F\) is the number of instances for framework \(F \in \{1,\ldots,n\}\) and \(I_F\) is the corresponding instance type.

valid cloud configuration satisfies user-specified constraints on the maximum execution time and maximum execution cost.

Instance: CPU, Memory, Storage

uses a metrics-based algorithm as the offline optimizer, which uses CPU and memory resource utilization metrics (monitored during profiling runs) to determine the configuration of each framework.