Neutrino Platform EHN1 Cluster Computing Model For Use By ProtoDUNEs
Overview
Definitions
- Data Taking Operations – during beam time (includes taking Beam generated and Cosmic Ray generated data at the same time)
- Cosmic Ray Operations – sustained running taking Cosmic Rays
- Commissioning – taking “ready” software and “hardware” and getting it ready for Operations
- Analysis Operations Phase - After data taking and cosmic ray operations
Processing Offline Operations and Analysis activities take place in each of these phases.
Introduction to the EHN1 Neutrino Platform Cluster
Compute resources
Goal:
|
# of Racks |
# of Nodes/Rack |
# of cores |
Odd Numbered |
6 |
23 |
138 |
Even Numbered Full |
4 |
24 |
96 |
Rack08 |
1 |
18 |
18 |
Total |
11 |
252 |
2016 |
Dec 2018:
Cores Total now |
. 1847 |
Hosts up |
232 |
Hosts Down |
20 |
Cores Final |
2016 |
some trays in the Rack08 are missing nodes. |
Validation and Metrics
Validation and Metrics for use of the Cluster are useful and need to be defined and collected:
- Nagios and Ganglia plots are presented on the web regularly.
- ProtoDUNE-DP runs benchmarks after any change in the software.
- ProtoDUNE-SP depends on the centralized DUNE Continuous Integration system. It will be useful if the EHN1 Cluster is included as a test site for this.
EHN1 Cluster Configuration and Usage Model
Accounts
During
BeamTime/Cosmic/ operations pDUNE-DP will have a np04-dataprod service account (and e-group) as a privileged account with a few administrative users only. Each account will have an associated description of the use
Outside of these times can add some job queues to allow additional users, mapped through the DUNE VO, to use the resources.
ProtoDUNE Dual Phase Model of Use
ProtoDUNE Single Phase Model of Use
NP04 will let NP02 use their share of the NP EHN1 Cluster during NP02 Commissioning, Beam and Cosmic Data Taking.
(An agreement is in progress between the 2 experiments for an equivalent number of slots on the Tier-0 to be made available from the NP02 share for use by NP04. )
Given current input from
ProtoDUNE-DP the dates of NP04 relinquishing use of the EHN1 NP cluster will be from June 1 to Dec 1 2018. (PLEASE CHECK/ADD here..) Given current HEPSPEC benchmarks we expect to discuss a ratio of about 1 Tier-0 core day to 3 EHN1 NP Cluster Core day equivalent from DP on the Tier-0.
Outside of these dates NP04 has asked the DUNE Software and Computing group to include the EHN1 Computing Cluster as part of the transparently usable distributed offline facility.
ProtoDUNE-SP will work with NP and DUNE S&C on how best to accomplish this and make use of the EHN1 NP Cluster resource. (PLEASE CHECK if this is true: at the moment the Torque job management system currently preferred by
ProtoDUNE-DP is not supported as part of the DUNE S&C distributed offline facility)
Operations and Training - initial thoughts
The Joint Data Challenge (currently scheduled for April 9 2018) includes a component for Operations. EHN1 NP Computing Cluster is expected to be part of the that team which will work to define and then exercise a model for support and operations for the data taking. The team is currently coordinated by the DUNE S&C Coordinators - Andrew Norman and Heidi Schellman.
DUNE S&C provides training materials and sessions at Collaboration meetings. Need to input to them for EHN1 NP Computing Cluster. As input:
Need to do training how to use the overall distributed computing infrastructure including all clusters. This group and task will be responsible for release preparation; support for data preparation and physics analysis - e.g would like to have some extra cores and/or nodes for doing for analysis. What release do they want to use and how to include in the input path.
Individual things working order like that.
- Where is the s/w infrastructure - where are the releases.
- Where are the validations..
- Need everything automatic and can be done with a press of a button
--
RuthPordes - 2017-12-12