-Slides presented by Alessandra and Antonio, task force coordinators, to start the meeting, which is the followed by discussion among participants (see https://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=slides&confId=296031)

-Questions are first raised about the difficulties of doing backfilling from the sites batch systems. However, some site representatives, from NIKHEF and ARC, mention that the amount of unused resources, if the job length is moderately short (a few hours), should not be a real problem, as shown by their experience from recent ATLAS activities. However, both cases are T1s with enough resources to suffer less than smaller sites. Therefore this needs to be measured at all sites to understand the impact of different solutions and different sites.

-The discussion focuses on the relation between job length and pilot lifetime, which are clearly determinant in order to guarantee a reasonable turnaround rate at the sites and an efficient use of the resources by the pilots. Sites requested experiments to provide the jobs lifetime. Some sites observed >90% of the atlas jobs are less than 3h and could be directed to a short queue that the batch system can more easily use for doing backfilling. ATLAS is going to test this with Nikhef.

-More sites experiences are mentioned and in particular it seems to be agreed that in the case of a steady flow of multicore jobs, not a huge amount of wasted or idle resources is to be expected. Such a steady stream is desirable, however it's the experiments which should try to guarantee that. Furthermore, with a steady stream of fully occupied slots , even dedicated resources at sites would work as well.

-If in order to be able to run multicore jobs, the site would need to keep a very limited fraction of the slots empty (1-5%), that would also not be a problem. However it is clear that any multicore scheduling proposal coming an experiment that requires a good fraction (10% or above) of slots to be empty is not acceptable, or would be charged on the experiment used resources bill. However, right now this is not measured and cannot be billed to experiments, as accounting only measures cputime and walltime, not jobs queuing time and empty slots.

-It is well understood that the interaction between sites and experiments should lead to minimizing both idle and inefficiently used CPU time, as also expressed in the slides.

-With respect to the timeline and schedule for developments, the task force coordination proposes to adopt October as the aim date for the system to be working, given the need to make sure that everything is ready for the LHC re-start.

-It is pointed out that a working solution needs to be found if experiments really want to start using multicore capabilities immediately, before the longer term proposal. However, ATLAS, which seems to be in more hurry than CMS on this topic, is already working in this mode, with a number of sites already accepting multicore jobs one way or another. Thus, there should be no clash between the immediate needs of the experiments, and activities and timeline of the task force.

-A more refined schedule of milestones is requested from the TF, to be discussed in WLCG operation meeting in a few weeks time.

-Concerning the next TF meetings, it is clear that both experiment and sites views need to be understood: what experiment requires from sites in their current model, and what technology and expertise sites can offer in trying to match the requirements from the sites. Good feedback is definitely needed in both ways.

-Thus, the format for the next two meetings should include a) a presentation of the experiment proposal, with a clear description of what requirements would have to be met from the sites, b) whatever experience we already have regarding this proposal (as expressed for instance by some site representatives concerning recent ATLAS activities, as mentioned above). First meeting ATLAS, then CMS.

-The task force discussion seems to focus on CMS and ATLAS cases, as other VOs have not been so strongly involved in multicore processing.

-The following series of meetings would be dedicated to focus one particular batch system and into getting a deeper understanding of its technical capabilities as well as examples of particularities regarding their management and optimization, as taken from real experience by site administrators.

-A new doodle poll has been sent around in order to decide on a time slot which will then be used for the coming meetings consistently. The initial frequency of meetings is proposed by the TF coordinators to be of one meeting per week, and this is agreed by the rest of participants.

-- AntonioPerezCalero - 22 Jan 2014

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2014-01-22 - AntonioPerezCalero
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback