Activities and deployment history for CMS multicore project
Deployment to T2 sites (Feb-Apr 2016)
2015 data reprocessing in multicore (Dec. 2015-Feb 2016)
Deployment of multicore monitoring (Aug. 2015, ongoing)
Scale test for multicore jobs submission to Tier1s (Mar/Apr. 2015)
Running
PromptReco jobs at T1s trying to achive the 50% target of allocated multicore resources.
Increased pilot submission scale tests at Tier1s (Feb/Mar. 2015)
It is planned that CMS will be running prompt reco. jobs at the Tier 1s in support to the T0 in the coming data taking period. These jobs, in contrast to centralized production jobs until now, will be multicore. So this means that CMS needs to be able to access CPUs at the T1s in multicore mode. Resource allocation is however decoupled from the actual use, which may be to single core, multicore or a combination of both type of jobs, due to the internal resource partitioning capability of our pilots. CMS T1s have been running multicore pilots with multiple single core jobs inside them. We have been doing this at all T1s for several months now, so we have tested that multicore resource allocation works (even if not at full scale yet). however, we now need to test the resource allocation and job submission tools up to the scale is which they are expected to actually be needed, which is about 50% of the T1 resources.
This is why we need to increase the number of multicore pilots running at our T1s, even if filled with single core jobs initially, so that we can then start testing the T0 machinery to support running real multicore jobs at the T1s at the required scale. Sites running single core and multicore pilots/jobs mix would ideally try to allocate resources dynamically, depending on the load.
In the previous step we included one entry per site, selecting one of the endpoints of each site, as opposed to the single core pilots, which try all of the available CEs. We need to homogenize the configuration of our pilot factories by adding new multicore entries for the remaining T1 CEs. (This was already done for KIT, at their request, to balance the mcore pilot load on their CEs, and also for PIC we added a second entry). The purpose of this is also to increase the ratio of multicore to single core pilots at the T1s in order to reach the required fraction of resources allocated in multicore mode (50% of their pledged CPU) for the coming T0 mcore job scale tests. This has to be done taking into account site pledges and the total number of pilots a site should be receiving.
FNAL has expressed their intention to allow and move 100% to multicore pilots for CMS. For the rest of sites, shared with other VOs, we will proceed a bit more carefully in order to ensure that we are using the resources as efficiently as we possibly can. In particular, pilot maximum allowed walltime and retire times should be adjusted to minimize CPU wastage at the pilot draining period at the end of pilot lifetime.
Multicore jobs submission tests (October/November 2014)
With respect to real multicore jobs, by October 2014 we had a first version of the software (CMSSW_7_2_0_pre8) which could be run multithreaded. Even if still not fully efficient, it was valid for the project in terms of being able to test if single core and multicore jobs could be injected to the system and run simultaneously in the same pilots. It was first tested by Alan, assigning multithreaded jobs using 4 cores to PIC, after patching the code for the
ReqMgr. Jobs were executing DIGI+RECO running on
RelVal GEN-SIM datasets. Then, also RECO jobs on RAW data were tested in order to compare CPU performance. By end of October, the basic milestone of being able to submit multicore jobs using the WMAgent was achieved.
However, it became apparent that neither the dashboard nor the job reports (or FJR.xml's) were ready for multicore jobs, as the number of used cores was not being reported and the dashboard was ignoring any job with CPU/walltime efficiency above 100%. On the job side, a
GitHub issue was created to implement the needed changes (
https://github.com/dmwm/WMCore/issues/5437
). Then the work should continue on the dashboard site, as efficiency should be redefined as CPU/(walltime x number of cores used) and we should be able to filter jobs on the number of cores as well.
Discussion also focused on how to manage different parts of a workflow, if it would include steps which are parallelizable and others which are not. It seems that right now you have to set multicore yes or no for the whole workflow (see
https://github.com/dmwm/WMCore/issues/5420
).
ADD PLOTS EXPERIENCE AT PIC
Then
PromptReco tests run in November 2014 (see
monitor plots) It works technically, and showed that we can indeed submit and run this type of jobs to the T1s. However, it also showed that not all T1s have the same flexibility in turning from single core to multicore use of their resources, related to the different configurations of the schedulers at the site side.
ADD PLOTS EXPERIENCE AT ALL T1s.
Sustained running pilots tests at all Tier1s (ready by August 2014)
Performed at all the Tier1 sites, as soon as step 1 was ready for each one of them. This second step involves getting closer to "real life" amount of resources from the sites, being able to run real workflows through multicore pilots. Lacking a multicore application ready (initially foreseen for Fall 2014 at earliest), the infrastructure was started to be tested by scheduling single core jobs, such as production jobs, to the T1s. The main reason to do this continuous submission tests as soon as possible was to expose sites to multicore jobs in order for them to adapt their batch system and scheduler policies and to understand how our model behaves when running together with ATLAS and other VOs in shared sites (most of them actually). This provided us also with feedback from the sites to continue the discussions on the WLCG multicore task force.
For each of these sites we started the tests with a new entry in the production factory for 50 total pilots with 20 idle. In the case of RAL, KIT, and
IN2P3 as well as for PIC, we use the default common value with ATLAS, that is 8 cores per pilot in order to help sites turnover from one user to the other with minimal impact on their farm utilization. This is also the ncore value for FNAL. For JINR, we use 12 cores per pilot.
The first issue we noticed is that pilots were running 7 jobs internally, instead of 8. This was related to the total memory assigned to the mcore pilot in its configuration (initially 2048 x 8 MB). Apparently the default value production jobs are requesting in their classads is 2300 MB. We reconfigured the pilots definition and the new pilots started running 8 jobs simultaneously, as it should be.
Then pilot running time had to be increased from the original proposal of 30 hours to 60 with a GLIDEIN_Retire_Time of 30 hours (default value of 6h previously). This was motivated by the observed low CPU usage, reported by PIC and RAL. Multicore pilots stopped requesting jobs to the central pool too soon, just 6h from start, and therefore not running fully loaded with 8 jobs for most of their lifetime (see RAL report at
https://indico.cern.ch/event/327276/contribution/2/material/slides/1.pdf
). In July, RAL reported that 25% of their used walltime was due to multicore pilots, yet their CPU efficiency continued to be of 55% on average, to be compared with 70% for single core jobs. This was a significative wastage, hence the need to increase CPU usage by multicore by extending both total walltime and retire time.
Results for multicore tests at Tier1s:
- PIC: Accepting multicore pilots from the first half of April. PIC did set up a separated queue to receive CMS multicore pilots. The queue is used internally for scheduling purposes (see https://indico.cern.ch/event/327276/
for reports regarding the use of the mcfloat script by PIC, developed and originally used by NIKHEF). PIC requested that the amount of mcore pilots was incremented gradually. For example, for 900 slots, we have started running 50 8-core pilots, therefore 400 single core job slots are being used via multicore pilots. The site also started developing monitoring tools to better evaluate pilot behavior from the site side.
- JINR: Started by May 6th 2014.
- FNAL: Started after deployment of partitionable slots on its cluster (32 and 64 core machines managed by HTCondor) and re-enabling the multicore entry in the production factories, around May. Reconfigured on August 6th to run 8 core pilots with parameters similar to the rest of siltes.
Submission feasibility tests for Tier1s (ready by April/May 2014)
Performed at all CMS T1 sites. The objective of this step was to check that our multicore submission tool works, i.e. the multicore pilots are able to get multiple cores from the remote WNs. Since this was not a real scale test yet, only a small amount of local resources were actually needed. The key was understanding how to pass our resource requirement from the pilot to the remote batch system, through the CEs.
With respect to the configuration of the remote resources, we need to get multiple cores from one node, but we don't need that the WNs are partitioned a priori in any special way. For example, in the PIC case, a node with N cores would get allocated N single core jobs, therefore the WN is seen by the batch system as N (single core) job slots. Pilots get the N cores by being allocated the N job slots simultaneously, but the actual partitioning of the machine with respect to torque is not changed, meaning that we don't request "whole nodes" or "multicore slots". It is also up to the sites to decide if they want to use the same or different queues for single core and multicore jobs, etc, according to their batch system and scheduler tools and policies.
This test started successfully at PIC. The pilot definition (entry in the pilot factory) needs to include a rsl string containing the resource request. For each site the required configuration was first tested in the ITB glidein factory. For example, in CREAM syntax:
WholeNodes = False,
HostNumber = 1, CPUNumber = N
In the case of PIC, this is then translated by the CE and passed to Torque as the following request: qsub -l nodes=1:ppn=N, which tells Torque to assign N slots on the same WN to our pilots.
The status of this first step for the Tier1s is the following (CE/batch system technologies):
- PIC (CREAM/Torque): as mentioned, multicore pilots have been successfully sent to PIC. READY
- RAL(Nordugrid/HTCondor): Was tested by Alison sending pilots from the ITB factory running sleep jobs. RAL had also been receiving multicore jobs from ATLAS. Use the ARC CEs, with an rsl string as (count=8)(memory=2000)(runtimeenvironment=ENV/GLITE). READY
- FNAL(GT5/HTCondor): FNAL has received multicore pilots previously, so the entries on the actual production factory should work. READY
- KIT(CREAM/UGE): Receiving ATLAS multicore jobs through CREAM CEs since Jan. 2014. Our resource request, the same as in the case of PIC, should be correctly passed to the batch system. As of April 10, multicore tests in the glideinwms ITB have succeeded running 4 sleep jobs/glidein. READY
- CNAF(CREAM/LSF): Discussed with Claudio our first steps. So far queue coordinates don't seem to be correct, still coordinating w/ admins->Solved by April 24th, our pilots run at CNAF successfully. READY
- CCIN2P3(CREAM/GE): Sebastien provided us with CE and queue names: May 6th, being tested. Done by May 7th. READY
- JINR (CREAM): Able to run multicore glideins (12-cores) through ce{01,02} mcore queues by April 29th. READY
-
ASGC: hybrid openstack/opportunistic/multicore cluster is working properly using parrot and 4 cores/glidein. Montecarlo tests planned soon.NO LONGER A CMS T1 SITE
Initial multicore pilot tests (ready by Jan 2014)
Our proposal to schedule CMS jobs, both single and multicore, is to make use of a common tool, the multicore pilot with internal partitioning of resources, provided by our
GlideInWMS infrastructure. This tool has been under "test and debug" work for a few months, using mainly PIC as a test site: see slides for
CHEP13 contribution and
CMS Computing Meeting. It is ready.
List of OLD QUESTIONS and ToDo's
The final goal of the scheduling problem is to answer the question: "what is the best solution to schedule jobs efficiently, if we have to allocate resources for single core and multicore jobs at the same time?". In the WLCG Multicore TF discussions, we have received input from some sites that anticipate that we should probably send both single and multi-core pilots so that they can better optimize jobs scheduling and resource usage on their own. This is of course to be evaluated on a site by site basis, depending on the mix of jobs from different VOs each site expects to get. The default should however be to send only multicore pilots. But this means that we need to work out "plan B", sending both single and multi-core pilots to a site to run single-core jobs. This is also true for an additional reason: when we start sending multicore pilots to a given site, we would like to proceed progressively, therefore a mixture of single and multicore pilots is also needed. It is now clear that the solution seems to be a mixture of single core and multicore jobs/pilots run at sites with dynamic allocation of resources to each pool. CMS will continue to submit single core jobs, even if the main code is parallelized, as support tasks (merge jobs, log collector, etc) as well as analysis jobs will probably remain single core. This single core jobs may run via single or multicore pilots.
Currently, we can't send single core jobs through multicore pilots without disabling the single core entries for a given site in the pilot factory (so that entries producing multicore pilots are the only ones active). Or, if we want to keep both types of entries active, we have to hack the WMAgent configuration (so that it considers that "Production" is a type of multicore task), so that it then adds "DESIRES_HTPC = True" to the job classads, thus triggering multicore pilot creation, for multicore entries properly defined.
The end result should be:
- multicore jobs: always point to multicore entries therefore always trigger multicore pilots
- single core jobs: may point to single core or multicore entries
(OLD PROPOSAL->My proposal (Antonio) to achieve such functionality would be to keep both single core and multicore entries in the pilot factory. We distinguish them by a tag, "GLIDEIN_Is_HTPC = True". Then, the WMAgent can assign the load not only to a site (or group of sites) but also via single or multicore pilots. We get this switch by adding the requirement "DESIRES_HTPC = True" to the job classads at the "workflow assignment time" (instead of at creation time).) NEW PROPOSAL, We can remove the HTPC = True tag from all entries, keeping both types active, so that the single core jobs trigger both types of pilots. We can control the proportion of single core/ multicore pilots by the number of pilots per factory entry. In fact we could set the number of pilots to zero, effectively suppressing the creation of pilots, even if the entry is enabled. The number of pilots is then the tuning knob.
TO BE SOLVED: The problem then is that in the current job-pilot matching policy, with no HTPC tags, multicore jobs trigger single core pilots, which would therefore run empty!! See slides at the glidein chat. We have to avoid this behavior. In the FEs matching expressions included get("GLIDEIN_Is_HTPC")=="True")==(job.get("DESIRES_HTPC")==1)This is solved now in the latests version of
GlideinWMS, so it is no longer needed. New versions of glideinWMS (apparently from 3.2.7) include matching of request_cpus in jobs jdl to GLIDEIN_CPUs in factory configuration by default. If none are explicitly set, both are considered = 1.
OLD suggestion: we start by switching sites completely to multi-core (by disabling the single core entries). The need of sending both types of pilots has not yet been demonstrated, therefore we should go ahead with our base model "all through multicore pilots". This was motivated by the difficulty of making any change in the WMAgent code. As the new proposed solution deals only with
GlideinWMS configuration, we don't need to go directly to all multicore, in fact sites prefer to proceed incrementally, so that they adjust to the new situation (reconfiguring their local scheduler algorithms).
TO BE REVIEWED: Accounting of allocated and used resources is not properly done by the official tools (APEL parser, EGI accounting portal, etc). The issue has been raised by the WLCG multicore task force until it has received attention from the people in charge (see
https://indico.cern.ch/event/272785/contribution/6/material/0/1.pdf
) and it is currently being addressed. Regarding this, it is expected that PIC starts accepting multicore pilots on April 1st, so we could start with PIC.
Update: the issue is going to take longer than anticipated. The sites have nevertheless decided to go ahead and start running multicore jobs.
Update 2: See
http://indico.cern.ch/event/336243/contribution/3/material/slides/0.pdf
TO BE REVIEWED: In order for the multicore pilots to be highly efficient the following points should be considered
- Avoid requesting N multicore pilots for N single core jobs, but rather apply N_cores as the weight in the pilot assignment to sites.
- Draining multicore pilots would keep 1000s of empty job slots, as they have to stick around until the last job is done. This should not be considered as open slots by the negotiator causing it to not request any more pilots at other sites. From Krista: "This may be an issue with how the frontend is counting slots? The draining ones may fall under a state it thinks is available so it doesnt request any more". Being followed by Krista.
- If after some time interval, the remaining workload is not enough to fill all the slots allocated to the multicore pilots running at a given site, the system should not keep e.g. two half-empty multicore pilots running instead of assigning all the jobs to one of them, and let the other one end so that the slots are released. We should use a convenient job allocation policy to minimize the resources that we need to keep running a particular workload (condor_negotiator? RANKING expressions?). From Krista "Maybe we should add a rank expression to favor multicore pilots that are already running jobs".
- From James: "We have encountered a problem in using multi-core pilots for analysis under certain circumstances, in particular when different workflows with widely varying job lengths are submitted in large number to a particular site and matched to multi-core glideins, and also subsequently no more jobs are queued for that site. The short jobs complete quickly, and the slots occupied by the short jobs become unmatched, while the remaining longer-running jobs ensure that the glideins persist for many hours, even if they have many empty slots. This can result in thousands of unmatched glidein job slots, which places tremendous strain on the negotiator since it has to negotiate all the unmatched slots even if there is no demand for them.
In the CMS Analysis Pool we place an upper limit on the total number of unmatched glideins beyond which we stop requesting new pilots, in order to protect the negotiation cycle length from blowing up. When many thousands of unmatched glidein job slots became available at Purdue on May 8th, this global limit stopped any pilots from being requested at any other site overnight, effectively draining the sites. The system quietly recovered on its own as the long jobs at Purdue finished and the sparsely occupied pilots terminated."
-
Some deviations have been detected when comparing factory monitoring pages for multicore entries to the information received directly from the site. The problem seems to be that can't check just one factory, but the three production factories in total. A glidein monitoring task force should be organized, in case it does not exist yet. As from the previous example, an aggregated view for site/entry which includes all factories and frontends could be implemented (this issue has been pending for some time though, see https://cdcvs.fnal.gov/redmine/issues/4992
).
DONE. Wait for patch/fix for partitionable slots problem (DAEMON_SHUTDOWN -
https://cdcvs.fnal.gov/redmine/issues/5359
)
--
AntonioPerezCalero - 2016-02-17