WLCG Multicore Deployment task force meeting on September 9th.

Attendance: A. Chierici, A. Lahiff, C. Acosta, J. Templon, J. Belleman, J. Hernandez Calama, M. Alef, M. Raso-Barnett, P. Love, Rod, S. Gedrat, T. Hartmann, A. Forti and A Pérez-Calero.

Meeting dedicated to the review of the setup and at CCIN2P3, KIT and RAL, along with the discussion on the passing of job parameters to remote batch systems via the CEs.

* CCIN2P3 report (S. Gedrat): The UGE setup for multicore jobs at CCIN2P3 is presented to the TF, based on a more detailed recent contribution to Hepix. The main aspects of the installation are its complexity and the need of manual intervention to allocate resources to the pool available for multicore jobs, which arrive at the farm from several VOs and with a variety of number of cores required, from 2 to 24 cores. This numbers can't be agreed on, as in the case of CMS and ATLAS on 8 core, due to the diversity of the VOs participating in CCIN2P3.

As with other sites, CPU cycles are wasted during the draining needed to move WNs from the single core job to the multicore job pool, via a mixed pool acting as a buffer. The variety of cores needed and the unpredictability of the users requests also contributes to CPU wastage. Users are normally in contact in advance in order to make a plan for the use of resources.

The need for manual intervention arises as the complexity of the installation makes the use of node reservations in the scheduler (as in KIT) impractical: a scheduling cycle involving reservations would be of several minutes, causing system instabilities. Ongoing work to make batch system configuration simpler in order to recover the reservations functionality.

A comment is made on the need to drain the nodes when returning them to the single core WN pool. This represents additional wastage that is in principle not needed: just allowing single core jobs to run would fill the node while the multicore job ends.

A suggestion is made concerning the use of some limitation in the number of nodes or slots being drained simultaneously, in the spirit of the mcfloat script in use at NIKHEF and PIC, in order to keep farm utilization high.

* Passing parameters to the batch system (A. Forti): The discussion on this topic has been ongoing for the last years, however no agreement has been reached in WLCG to present a systematic and standardized list of parameters which need to be passed from the VOs to the sites and which works for all CE types and batch systems. In summary, there is no complete list of parameters and what they mean such that they may be used in the jdl for the jobs submitted by the experiments. However, at least with respect to multicore jobs, the passing of number of machines and cpus seems to be working, proven by the fact that we are running multicore at many sites representing different CEs and batch system technologies.

Alessandra proposes some documents, such as the NIKHEF twiki, as a starting point to advance in this topic, which is within the scope of the Multicore TF. The implementation and testing of a complete list is certainly beyond the timeline of the TF, however, as the most urgent cases, job required memory and predicted walltime values should addressed as a first step. The proposal is therefore to communicate with WLCG that, in order for the scheduling of multicore jobs to perform efficiently, and due to its interference at sites with high-memory jobs, the need for a systematic way to pass memory requirements has been identified. The same is probably true for walltime requirements as well, even though the reliable calculation of the actual walltime values is a much longer discussion.

The TF will continue to discuss on this topic in order to identify the minimal set of parameters needed by the sites to efficiently schedule jobs. Then, propose this list to WLCG so that each one is systematically checked for every CE and batch system configuration.

* KIT report (M. Alef and T. Hartmann): GridKa reports on the unpredictable job submission patterns detected both for ATLAS and CMS in the last months. A more steady flow of jobs would result in a better performance. Seesaw patterns however overlap out of phase, mitigating the effect somehow.

KIT also reports on ATLAS jobs with extremely low CPU efficiency and long running times. Possible empty pilots or jobs idling in the nodes with no significant CPU usage which end up failing after several days. A discussion follows on the ATLAS infrastructure and how some expected walltime could be passed to the batch system such that it automatically terminates jobs exceeding this value, as a way to avoiding them to waste so much CPU. In the particular case of CREAM CE and Grid Engine, the passing of walltime estimation is already available and tested. The detection and elimination of such jobs is particularly important with multicore jobs, as the price to pay for this jobs its multiplied by the number of cores retained.

Combined plots, showing running and queued jobs from both ATLAS and CMS are encouraged to be produced by KIT and the rest of sites, for monitoring purposes and as a way to show advancement in the objectives of the TF.

* RAL report (A. Lahiff): Andrew reported on the status of the multicore jobs running at RAL for the past month. On average, the site has 400-500 combined ATLAS+CMS multicore jobs running.

The fraction of multicore jobs to total is quite different for both VOs, as in the case of CMS they only represent a small fraction so far. RAL requests that CMS increases this fraction to the same level as the case of ATLAS (20-40%).

The job walltime distribution looks also quite different for jobs from both experiments. In the case of ATLAS, the pilot lifetime follows the job lifetime, whereas in the case of CMS, multicore pilots are actually still running internally only single core jobs in parallel, which produces a larger spread in running times.

The overall waste of CPU due to draining nodes seems to be quite under control, representing at most 2.5% of the slots in the total farm at RAL.

RAL also updated on the inefficiency of CMS multicore jobs, as compared to single core jobs and with the reference of the previous report. CPU efficiency for jobs running during August has been lower than that of the previous months, even for single core jobs. However this can be partly attributed to the fact that, in this period, CMS was doing scale tests of their data federation, therefore running jobs reading from remote storages via xrootd, instead of data samples previously transferred to the site. A request is passed to RAL to update the efficiency numbers separating, if possible, the effect from reading remote data from the actual inefficiency deriving from the internal scheduling of jobs by the multicore pilot.

-- AntonioPerezCalero - 09 Sep 2014

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2014-09-09 - AntonioPerezCalero
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback