A chalkboard with chemical equations and diagrams written on it.

High Performance Computing Cluster (HPC)

High Performance Computing Cluster: Matilda

Computing
The Oakland University (OU) central HPC Linux-based cluster (Matilda) is intended to support parallel, GPU, and other applications that are not suitable for individual computers. The Matilda HPC cluster consists of approximately 2,200 cores. All nodes are interconnected with 100Gbps InfiniBand networking.

The Matilda HPC cluster includes the following compute nodes:

40 standard compute nodes, each with 192 GB of RAM and 40 CPU Cores at 2.50 GHz.
10 high throughout nodes, each with 192 GB of RAM and 8 CPU Cores at 3.80 GHz.
4 large memory nodes, each with 768 GB RAM and 40 CPU Cores at 2.50 GHz.
4 hybrid nodes, each with capacity to include specialized accelerator cards or GPUs and 40 CPU Cores at 2.50 GHz.
3 GPU nodes, each with 4 NVIDIA Tesla V100 16G GPUs with NVLink, 192 GB RAM and 48 CPU Cores at 2.10 GHz.

Storage
The system includes 690 TB of high-speed scratch storage using a high-performance parallel file system connected via 100Gbps Infiniband to each compute node.

Home directories, project space, and shared software reside on a Dell EMC Isilon H500, a storage system with integrated backup solution. Data is replicated to a Dell EMC Isilon A2000 located in a secondary data center with independent power and HVAC systems. The Dell EMC Isilon A2000 can also provide an archival mechanism to Amazon Web Services.

Intra-networking
All Matilda HPC cluster nodes are interconnected with HDR100 Infiniband, delivering up to 100 Gbps of bandwidth and sub 0.6usec latency.

Inter-networking
The Matilda HPC cluster is connected to the Oakland University campus network with 10 Gbps connectivity to provide access to storage systems and from researchers labs and workstations.

Software
The Matilda HPC cluster includes a comprehensive software suite of open source research software, including major software compilers, and many of the common research-specific applications.

Data Center Facilities
The Matilda cluster is housed within the North Foundation Hall data center. This facility is equipped with fire suppression systems, a standby generator and environmental controls.

Base Resource Allocations
Upon request, all OU-affiliated researchers receive 50 GB of home directory storage and 10 TB of scratch storage¹ on the Matilda cluster. This allocation allows OU-affiliated researchers access to the Matilda cluster and to submit jobs as part of a PI project/group.

PIs are also provided with shared project space for research projects or group projects. These allocations are assigned to the PI and can be used by members of their group:

Compute hours²: 1,000,000 per year
GPU hours³: 50,000 per year
Shared project/group storage: 1 TB
Shared project/group scratch¹ storage: 10 TB

Compute and GPU hours are convertible, so researchers can use their allocation in whatever ways make best sense for their specific needs. The billing weight is 10x for GPU hours, meaning that 100 GPU hours is the equivalent of 1,000 CPU hours, while 100 CPU hours is the equivalent of 10 GPU hours. Consequently, each researcher has an effective annual allocation of 1.5 million hours available for use. Usage is tracked in the aggregate for the PI and their group, and usage resets to zero at the start of each calendar year.

Rates for Additional Computational Resources
Researchers who need additional computational time beyond the annual base allocation can purchase additional resources. Current costs (which will be revised every two years) are:

Compute hours²: $0.024 per hour
GPU hours³: $0.24 per hour

Additional purchased computational resources are placed in a separate account that is accessible to the researcher and any other group members they choose. Unlike base allocation amounts (which are "use or lose" - meaning unused portions do not roll over from one year to the next), unused purchased resources will remain available until exhausted. To use additional purchased hours, a researcher or group member must specify the account to be used during job submission.

Buy-In Nodes
Researchers who need hardware capacity beyond what is currently available on the Matilda cluster can purchase additional nodes. UTS staff will add purchased nodes to the cluster and manage them together with the rest of the cluster. Buy-in users and their research groups will have priority access⁴ on all cluster resources they purchase. They will also receive additional compute time (CPU or GPU, as needed or desired) in the calendar year they purchase resources, based on rates in effect at the time of purchase.

To purchase a node, contact UTS at [email protected] to discuss your needs and get a quote. The exact price will depend on the hardware chosen, plus any incidentals that may be needed to connect the new hardware to the cluster.

Rates for Additional Storage
Researchers or groups who need additional storage beyond the annual base allocation can purchase additional space, depending on their specific storage needs. There are two base storage types: storage on the Matilda HPC cluster itself, or storage in one or more OU data centers, but without direct access to/from the Matilda cluster. Current costs (which will be revised every two years) are:

Matilda project or home directory quota: $260 per TB per year
Matilda scratch space quota: $72 per TB per year
Performance tier: $170 per TB per year
Archive tier: $90 per TB per year
Replicated performance tier: $250 per TB per year
Replicated performance tier with deep archive: $260 per TB per year
Archive tier with deep archive: $90 per TB per year

Support
The Matilda HPC cluster services are provided through a collaboration with Oakland University Research Office and University Technology Services. For more information, visit the University Technology Services Research Support page or the Research Computing and HPC documentation site. To request access, fill out the Matilda HPC Cluster Access Request form (scroll down to "Matilda"; online form requires OU log in).

¹Scratch storage is short-term storage used only for working files. It is not backed up or mirrored. Inactive files (determined by the last time they were accessed) are deleted after 45 days.

²Compute hours are measured per CPU core used in a job. A job running on 40 CPU cores for one hour would consume 40 compute hours.

³GPU hours are measured per GPU requested, as typically only one job can be run on a GPU at a time. A job requesting 2 GPU resources and running for one hour would consume 2 GPU hours.

⁴Priority access means that users are guaranteed to be able to start a job on a purchased resource in less than four hours when they need the purchased resource for a research project. Priority access to purchased resources lasts for five years from the date of purchase or the anticipated useful life of the hardware, whichever is less. When the purchaser is not using a purchased resource, it will be available to other cluster users for a maximum walltime of 4 hours per job.

The Research Office

Wilson Hall
371 Wilson Boulevard
Rochester, Michigan 48309-4486
(location map)
(248) 370-2762
(248) 370-4111
[email protected]

#OaklandResearch