WLCG Multicore Deployment task force meeting on July 1st.

Attendance: A. Sedov, A. McCrea, A.Filipcic, A. Lahiff, A. M. Levin, C. Acosta, C. Wissing, C. Grandi, D. Crooks, G. D. Roy, M. Alef, S. Dal Pra, T. Hartmann, A. Forti and A Pérez-Calero.

Meeting dedicated to the review of the status and impact of CMS multicore pilots jobs running on CMS T1 sites, with reports from KIT (M. Alef), PIC (C. Acosta) and RAL (A. Lahiff)

* KIT report: KIT has been and is currently running CMS and ATLAS multicore jobs since each of the experiments started submitting them. No changes in KIT setup with respect to what was discussed in the last report on their system and its configuration was needed in order to accommodate CMS jobs. In summary, at this stage no issues have been identified, the site is running multicore jobs smoothly.

Some difference is noticed however between the job submission patterns of ATLAS and CMS. In ATLAS case the "wavelike pattern" typically remains, while that of CMS is relatively smoother. Since there is still no walltime estimation in either case, no backfilling is possible, therefore irregular multicore job submission tend to degrade the overall system performance. On the other hand, in a more regular scenario, such in the case of CMS, multicore jobs can in general replace one another in the same slot, except in the case of fluctuations introduced by higher priority jobs from another VO.

* PIC report: PIC has explored two approaches to solve the problem of multicore job scheduling: pure Maui configuration and then the mcfloat script developed at NIKHEF. The conclusion from the results presented is that the mcfloat tool is clearly a better solution as it maintains multicore slots open for multicore jobs, thus minimizing any additional WN draining and therefore providing better CPU usage overall.

In view of these positive results after being tried at NIKHEF and PIC, sites using Torque and Maui combination as batch system/scheduler are encouraged to consider incorporating the mcfloat script to their configuration.

* RAL report: RAL introduced some changes in their system configuration with respect to the last report, in order to increase the probability that multicore slots are reused by consecutive multicore jobs. in order to avoid degradation from node draining. No additional changes needed to accept CMS jobs. Degradation of the farm is kept under control (below 3% of CPU slots unused).

An analysis of CMS multicore jobs follows. Walltime for these jobs is on average of ~10 hours, although with long tails up to 30h. The main issue is job efficiency. Even if CMS single core jobs are generally CPU efficient (80%) and multicore pilots are running multiple single core jobs, the overall efficiency of the pilot is is clearly worse than that of the individual jobs (below 60%). According to the study presented by RAL, pilots are only running full with 8 jobs for a a fraction of the time after they start. After that (slides 11 and 12) they start draining internally. This causes the effect that in the tails of longer running pilots, the longer the pilot runs, the less efficient it is (slide 8).

Antonio comments that this is probably pointing to a suboptimal configuration of lifetime parameters in CMS multicore pilots, which should be corrected by CMS.

* Other topics: Next week is the WLCG Workshop in Barcelona (see https://indico.cern.ch/event/305362/), where the Multicore TF has a dedicated slot to present the status and activities of the group.

As ATLAS multicore job submission is expected to be resumed soon, we are getting closer to the combined experience of having to schedule CMS and ATLAS jobs at shared sites. Updates will be sent through the TF mailing list in order to propose new sessions to review new results when they are available.

-- AntonioPerezCalero - 01 Jul 2014

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2014-07-01 - AntonioPerezCalero
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback