Disclaimer: NIST-developed software is provided by NIST as a public service. You may use, copy and distribute copies of the software in any medium, provided that you keep intact this entire notice. You may improve, modify and create derivative works of the software or any portion of the software, and you may copy and distribute such modifications or works. Modified works should carry a notice stating that you changed the software and should note the date and nature of any such change. Please explicitly acknowledge the National Institute of Standards and Technology as the source of the software.

NIST-developed software is expressly provided “AS IS.” NIST MAKES NO WARRANTY OF ANY KIND, EXPRESS, IMPLIED, IN FACT OR ARISING BY OPERATION OF LAW, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NON-INFRINGEMENT AND DATA ACCURACY. NIST NEITHER REPRESENTS NOR WARRANTS THAT THE OPERATION OF THE SOFTWARE WILL BE UNINTERRUPTED OR ERROR-FREE, OR THAT ANY DEFECTS WILL BE CORRECTED. NIST DOES NOT WARRANT OR MAKE ANY REPRESENTATIONS REGARDING THE USE OF THE SOFTWARE OR THE RESULTS THEREOF, INCLUDING BUT NOT LIMITED TO THE CORRECTNESS, ACCURACY, RELIABILITY, OR USEFULNESS OF THE SOFTWARE.

You are solely responsible for determining the appropriateness of using and distributing the software and you assume all risks associated with its use, including but not limited to the risks and costs of program errors, compliance with applicable laws, damage to or loss of data, programs or equipment, and the unavailability or interruption of operation. This software is not intended to be used in any situation where a failure could cause risk of injury or damage to property. The software developed by NIST employees is not subject to copyright protection within the United States.

Author: Timothy Blattner

Installation

Getting Started

Approach

The intent of the Hybrid Task Graph Scheduler (HTGS) API is to transform an algorithm into a pipelined workflow system, which aims at fully utilizing a high performance compute system (multi-core CPUs, multiple accelerators, and high-speed I/O). HTGS defines an abstract execution model, framework, and API.

The HTGS model combines two paradigms, dataflow and task scheduling. Dataflow semantics represents an algorithm at a high-level of abstraction as a dataflow graph, similar to signal processing dataflow graphs. Task scheduling provides the threading model. A variant of task scheduling is used in which a task is assigned a pool of threads, which processes data using a shared queue. In this way data is sent to each task, which processes each item independently using the thread pool. This method of parallelism and orchestration requires a deep understanding of an algorithm and how to decompose the problem domain, which is required when parallelizing any algorithm.

The HTGS framework defines the components for building HTGS task graphs: tasks, bookkeepers, memory managers, and execution pipelines. These components are defined in such a way to provide a separation of concerns between computation, state maintenance, memory, and scalability. Through this representation an algorithm can be approached at a high level of abstraction.

The HTGS API is used to implement an algorithm that is represented within the HTGS framework. The API is designed in such a way that the graph representation from the model and framework is explicit. This allows for mapping the analysis phase of using HTGS with the implementation (and back). The primary use of this representation is to allow rapid prototyping and experimentation for performance.

The general idea with using HTGS is to overlap I/O with computation. I/O can be represented by disk, network, PCI express, or any metric that involves shipping data closer to the compute hardware. HTGS tasks should operate within memory for that compute hardware, such that the underlying task implementation be designed to improve utilization on the desired architecture (i.e. use vector processing units). Ultimately, HTGS provides abstractions for four components, which we treat as first-class issues: (1) Memory management, (2) Concurrency, (3) Expressing data locality, and (4) Managing data dependencies.

HTGS Design Methodology

There are five steps in the HTGS design methodology, as shown below. The first three steps are white-board stages; the remaining steps are coding and revisiting of the pictoral stages.

1. Parallel Algorithm Design

Represent algorithm as a parallel algorithm where computational kernels are laid out as a series of modular functions.

2. Represent as Dataflow

Map computational functions to nodes
Edges are data dependencies
- Annotate edges with data types/parameters

3. HTGS Task Graph Design

Computational entities are tasks
Dependencies are managed by Bookkeeper

4. HTGS API Coding

Implement using the HTGS API
One Create a htgs::ITask for each computational entity.
htgs::IData is used to represent any data that is needed for each htgs::ITask
- htgs::IData is transmitted as input and output from a htgs::ITask, which is based on the htgs::ITask's template types T, U, respectively.
Process dependencies using htgs::Bookkeeper
- Uses htgs::IRule to process dependencies
- htgs::IRule determines when to produce work
Connect each task using the htgs::TaskGraphConf::addEdge or htgs::TaskGraphConf::addRuleEdge
Debug using dot file generation with htgs::TaskGraphConf::writeDotToFile
- Real-time visualization (coming soon)

5. Refine and Optimize

Modify data granularity and scheduling behavior
- Improve data locality and data reuse
Create CUDA variants with htgs::ICudaTask
- Add htgs::ICudaTask to copy data to/from GPU before/after CUDA variant
Scale to multiple GPUs with htgs::ExecutionPipeline
Use HTGS profiling (add the directive "-DPROFILE" in compilation)
- Output the graph after execution using htgs::TaskGraphConf::writeDotToFile
- Visually identify bottlenecks
Add memory edges where necessary to regulate memory with htgs::TaskGraphConf::addMemoryManagerEdge
- Restricts the amount of memory transmitted through graph
- Defines data reuse with programmer-defined htgs::IMemoryReleaseRule
Real-time visualization (coming soon)

Overview of HTGS

The HTGS API is split into two modules:

The User API
- Classes that begin with 'I' denote interfaces that are to be implemented: htgs::ITask, htgs::IMemoryReleaseRule, etc.
The Core API

The user API is found in <htgs/api/...> and contains the API that programmers use to create a htgs::TaskGraphConf. The majority of programs should only use the User API.

The core API is found in <htgs/core/...> and holds the underlying sub-systems that the user API operates with.

Although there is a separation between the user and core APIs, there are methods to add high-level abstractions that can be used to new functionality into HTGS. The htgs::EdgeDescriptor is one such abstraction that is used as an interface to describe how an edge connecting one or more tasks is applied to a task graph and copied. This can be extended to create new types of edges. Currently the htgs::ProducerConsumerEdge, htgs::RuleEdge, and htgs::MemoryEdge are used to define how to connect tasks for the htgs::TaskGraphConf::addEdge, htgs::TaskGraphConf::addRuleEdge, and htgs::TaskGraphConf::addMemoryManagerEdge functions, respectively.

Tutorials

Tutorial 0 - Getting started with the HTGS Tutorials - HTGS Tutorials Github
- Compiling and running the tutorials
- Using the command-line with the HTGS Tutorials
- Using CLion with the HTGS Tutorials
- Using Eclipse CDT with the HTGS Tutorials
Tutorial 1 - Adding two numbers in HTGS - Source Code
- Making a htgs::ITask
- Creating htgs::IData
- Using the htgs:TaskGraphConf
- Executing with the htgs::TaskGraphRuntime
- How to interact with the htgs::TaskGraphConf
Tutorial 2a - Implementing the HadamardProduct - Source Code
- Creating a dependency with the htgs::Bookkeeper and htgs::IRule
- Methods for domain decomposition
- Debugging
Tutorial 2b - Augmenting the graph from Tutorial 2a to use disk - [Source Code]()
- Managing memory
  - Basic htgs::IMemoryReleaseRule and htgs::IMemoryAllocator
  - How to htgs::ITask::getMemory
  - How to htgs::ITask::releaseMemory
Tutorial 3a - Implementing Matrix Multiplication [Source Code]()
- Complex dependencies
- Terminating cyclic graphs
- Profiling and optimizations
Tutorial 3b - Profiling and debugging the graph from Tutorial 3a.
- GraphViz visualizations
- Development experimentation for performance
- Identifying bottlenecks
- Incorporating high performance libraries into HTGS tasks
  
  Tutorials below are still under development
Tutorial 3c - Augmenting the graph from Tutorial3a to use memory mapped files.
- How to incorporate memory mapped files into a task
- Performance comparrisons
Tutorial 4 - Augmenting the graph from Tutorial 3a to use CUDA - [Source Code]()
- Sending data to/from the GPU with htgs::ICudaTask
- Techniques for data locality
Tutorial 5 - Augmenting the graph from Tutorial 4 to use the htgs::ExecutionPipeline - [Source Code]()
- Scaling to multiple GPUs for matrix multiplication
- Decomposition strategies to improve locality
Tutorial 6a - Implementing Block LU Decomposition - [Source Code]()
- Methods for handling non-uniform computation
- Understanding the effects of poor decomposition strategies
Tutorial 6b - Implementing Block+Panel LU Decomposition - [Source Code] ()
- Methods for changing decomposition strategies
- Analyzing the effects of good decomposition strategies
Tutorial 7a - Augmenting the graph from Tutorial 6b to use CUDA - [Source Code]()
- Understanding the effects of locality and data motion
Tutorial 7b - Incorporating advanced locality techniques to Tutorial 7a - [Source Code] ()
- Implementing a sliding window to improve locality
- Effects of out-of-core computation on the GPU
Tutorial8 - Augmenting the graph from Tutorial 7b to use htgs::ExecutionPipeline
- Scaling to multiple GPUs for LU Decomposition
- Profiling to understand future improvements

Table of Contents