Ray: Application-level scheduling with custom resources

Ray intends to be a universal framework for a wide range of machine learning applications. This includes distributed training, machine learning inference, data processing, latency-sensitive applications, and throughput-oriented applications. Each of these applications has different, and, at times, conflicting requirements for resource management. Ray intends to cater to all of them, as the newly emerging microkernel for distributed machine learning. In order to achieve that kind of generality, Ray enables explicit developer control with respect to the task and actor placement by using custom resources. In this blog post we are going to talk about use cases and provide examples. This article is intended for readers already familiar with Ray. If you are new to Ray are are looking to easily and elegantly parallelize your Python code, please take a look at this tutorial. 

USE CASES

  1. Load Balancing. In many cases, the preferred behavior is to distribute tasks across all available nodes in the cluster in a round-robin fashion. Examples of this include workloads that are network I/O bound. Distributing such tasks as thinly as possible eliminates network bottlenecks and minimizers stragglers.
  2. Affinity. Some applications may require affinity between tasks and generally prefer the workload to be packed as tightly as possible. This is helpful, for instance, when dynamically auto-scaling the workload in the cloud.
  3. Anti-affinity. Others may gain significant performance improvement from preventing task co-location — in direct contrast to the former. In other words, in such cases, there’s a need for a repelling force between units of work with the goal of (a) better performance avoiding resource contention, (b) higher availability and elimination of a single point of failure. This class of workloads has a preference for anti-affinity
  4. Packing. Yet other applications require both. They may have strong affinity requirements within a group of tasks or actors, with an anti-affinity constraint across those groups. Examples of such applications include distributed data processing on partitioned datasets (Figure 1), which requires affinity of data processing tasks with tasks that, for instance read a partition of a CSV file. At the same time, we want to distribute vertical dataframe partitions (to maintain column-wise locality within partitions) broadly, to maximize computational parallelism.

Comments