WLCG Workload Management TEG Membership

Modified on: 2011-11-28


Membership

Davide Salomoni (INFN CNAF)

For the past six years I have been the manager of the computing infrastructure at the INFN Tier-1, located in Bologna, Italy. I recently took on the role of director of computing research at the INFN Tier-1.

I am interested in all the areas covered by the mandate of this TEG.

Some general comments:

The INFN Tier-1 is a multi-experiment site (we currently support about 20 international collaborations); it acts, depending on the collaboration, as Tier-0, Tier-1, Tier-2 or Tier-3, thus covering many scopes and roles. We support Grid, local, and most recently also Cloud instantiations, and one of the operational problems we face is the request to support several diverse frameworks (pilot-based or not). We are therefore interested in finding commonalities, safeguarding the needs related to efficient deployment, operation and security.

Since 2009 I have also been the manager of the INFN WNoDeS framework, aiming to integrate Cloud and Grid provisioning through the use of virtualization techniques. At the INFN Tier-1 we currently run through WNoDeS about 2,000 VMs in production to serve both Grid and local jobs. My interests in this area are related to efficient allocation and use of computing and storage resources, applied to both "traditional" and to "new" computing models (e.g. Cloud computing) and products (e.g. many-core systems and GPUs).

Torre Wenaus (ATLAS, PanDA)

I am (since 2005) a project co-leader, designer, developer for PanDA, the ATLAS distributed workload management system used for production and analysis. Involved in ATLAS distributed software development coordination.

I'm interested in all areas.

Topical comments:

- Pilots and frameworks

I've been a big fan of pilot based systems ever since I learned (in 2005) what LHCb were up to with DIRAC. For all the usual reasons. ATLAS experience with PanDA has been very good. The independent developments of pilot based systems in the different experiments happened for various good reasons but I expect there is scope for some convergence. At the pilot level itself glideinWMS could be a point of convergence.

- Resource allocation and job management

For PanDA, what we ask of a site in terms of CE is CondorG support. Or bare Condor. This has worked extremely well.

Global workload management is necessary but experience suggests we should not look to common middleware for it. ATLAS experience has been that having it integrated into the experiment workload management system has many advantages.

On WN connectivity, outbound http(s) is ATLAS' (only) requirement. We have not had problems getting that supported, it is friendly to security interests of sites (it can of course be proxied and often is).

- Use of information services

Plenty of scope here for more commonality. ATLAS has made some efforts in that direction with AGIS, ATLAS Grid Information System, which is mainly an aggregator of other information sources. Ad hoc info sources, either primary or cache, are easy to create, and ATLAS has many. Efforts are underway to consolidate at the AGIS level, but potential is there to consolidate at a higher level at least for many things. Bureaucratic overhead has to be kept to a minimum, extensibility and flexibility must be well supported.

- Security models

The workload management systems themselves could play an important role in the security model. They hold the most information on user identifiability and activity. Standard info services/APIs to the WMSs could be used by sites and service providers as an effective means of obtaining rich information for monitoring, diagnostics and acting on security incidents. We have done this with PanDA and it's proven useful. It avoids some of the complications of relying entirely on the middleware layer.

- New computing models

Exploiting virtualization and ensuring we can effectively leverage cloud resources are important, as is being able to utilize multi-core efficiently.

Rod Walker (ATLAS)

Kaushik De (ATLAS)

ATLAS PanDA project co-coordinator, ATLAS distributed computing operations

All topics sound interesting! Pilots, Resource Allocation, Job Management, New Computing Models primarily.

Claudio Grandi (CMS)

Since 2010 I'm one of the coordinators of the "Integration of Distributed Facilities and Services" task of CMS. My past activities related to computing include coordinator of the "Grid Integration" task of CMS (2000-2004), Activity manager of the "Middleware re-Engineering" Activity in the EGEE projects (2005-2008) and Tier-1 coordinator for CMS (2009-2010).

- indicate which if any areas on the topic list you are particularly interested to focus on

I'm interested in all of the areas. I'd just start focusing on the first 4 topics and postpone the "new computing models" part to a later stage (may be between the first report and CHEP)

- for any/all topics on the list, give initial comments summarizing your perspective

- Pilots and frameworks

Given the current model of working of the infrastructure and the current needs of the experiments, pilot jobs are needed but we should be aware of the limitations and concerns, especially in the area of security, that working with pilot creates. Pilot jobs implement the important splitting of resource allocation and job management but the first part is fragile. We should address this for the short-medium term and understand how the model can be in a longer term to accommodate both "private job management" and classical job submission that many communities that are using the same infrastructure we are using will need.

- Resource allocation and job management Requirements for a CE service, evaluation of the needs for a (global?) WMS, requirements for public/private access to WNs (incoming/outgoing connectivity)

Even in a new scenario where resource allocation is completely split by job execution we will need an interface to the site that is under the site's control. Call it CE or whatever but it will be needed because it would not be sustainable to have a "gateway" like service running on a machine on the boundary of the site under user responsibility instead of under administrator responsibility. Also an interface is needed to provide (on demand) dynamic information on site status.

Global (public) WMSs are services for those who want to use them. The model should not impose them but be compatible with them. In the longer term they may be needed to facilitate resource allocation, even though for current LHC VOs it may not be needed given the limited number of sites (~50-100 for big VOs) and the existence of clear pledges of sites to each VO.

- Use of information services

Needed to describe the topology (e.g. CE-SE association, resource size, ...) but not for scheduling purposes. For that each WMS system need to find its way of extracting information from the site gateways.

- Security models e.g. MUPJs, authentication and authorizaton - in collaboration with the Security TEG

See the comments on MUPJs above. We need to define a responsibility model that is acceptable for sites already for the short term. In general I think we are underutilizing the possibilities offered by attributes for authZ.

- New computing models Use of virtualization technologies, cloud computing, whole node scheduling, requests for special resources (e.g. GPUs, parallel jobs)

Cloud computing may offer interesting concepts for building a model for resource allocation that solves the security concerns on MUPJs.

Depending on the resource allocation model "whole node scheduling" may be an aspect that simplifies the view, but it is not necessarily bound to it. I'd leave this, together with virtualization, parallelization, etc... to a later stage of the discussion.

Will participate remotely in F2F.

Burt Holzman (CMS)

I am the project manager for glideinWMS and manager of the Tier 1 facility at Fermilab. I also have been the manager and a core contributor to the OSG GIP. I'm certainly interested in pretty much all of the TEG topics.

Pablo Saiz (ALICE)

Will participate

Stuart Wakefield (CMS)

I work on the cms workload management project and am also active in uk grid activities.

As for the face-to-face meeting I'm not sure if i can make it yet, I have teaching on the Friday so I need to try to re-arrange things.

Igor Sfiligoi (CMS)

I will be there both for the telecom and the face to face meetings.

- introduce yourself briefly in terms of what expertise, background, and perspective you bring

I have been working with pilot infrastructures since about 2004, when I helped port CDF to the Grid, using the Condor glideins. I have also been the main force for converting the CDF glidekeeper into the generic glideinWMS now used by CMS (and several other VOs). I have often worked closely with Condor to get what was needed, trying to minimize the amount of code that is glideinWMS-specific.

I also have a fair bit of experience operating pilot services, having done it for both CDF and CMS (and OSG at large).

I have also been involved in grid middleware authz development, being part of the OSG Privilege project. I have been the driving force to get glexec integrated into the OSG infrastructure, deployed on OSG sites and integrated into the glideinWMS (via Condor). (I am mostly out of these two activities by now)

I am also part of the OSG Operational Security team.

Apart from the security SW development, I have very little experience actually running a Grid site; I see the Grid mostly from the client side, treating the CEs as black boxes, as much as possible. I have very little experience with clouds... mostly theoretical knowledge.

- indicate which if any areas on the topic list you are particularly interested to focus on

I am mostly interested in the New computing models and Security models topics.

- for any/all topics on the list, give initial comments summarizing your perspective

- New computing models

I see this coming along, and we may need to adopt them if we want it or not. The nice uniformity of the past few years is likely to go away, and we should make sure both that our tools can handle it. An interesting question is: Should we hide it from the user applications? If yes, how and how much?

PS: We already do something in glideinWMS, but it is quite limited for now.

- Security models We have been talking about MUPJ for a long time, but the progress has been very slow. I want to understand why and how to move forward. I also want to discuss if the current tools (i.e. glexec) are the right thing.

- reference any background materials you think are relevant to the work of the group Just a link to glideinWMS: http://tinyurl.com/glideinWMS

Manuel Guijarro (CERN)

Ulrich Schwickerath (CERN, grid services, LSF, lxcloud)

I'm willing to participate. I'm working in PES/PS at CERN, where I'm in charge of some of the grid services as main responsible (eg. CEs, glExec/SCAS/Argus/...) or deputy (eg bdii). I've been in charge for several years for the batch farm under LSF at CERN, and I'm maintaining the LSF information providers. Apart from that, I'm involved in Cloud activities (lxcloud). This includes recent activities from ATLAS and LHCb who use our internal cloud for CERNVM, directly connecting to experiment frameworks and thus bypassing the whole grid infrastructure ... My main interest is in virtualization and clouds as a possible long term option.

On virtualization, I think it makes sense to acknowledge the activity of HEPiX in this respect, where a lot of work has been done specifically on policies.

Manfred Alef (KIT, Tier-1 cluster management)

- introduce yourself briefly in terms of what expertise, background, and perspective you bring

Tier-1 cluster management (WN setup, LRMS [PBS Pro])

Unfortunately I will be out of office from Oct 24 till Nov 4. Maybe I can join by phone from Vancouver (HEPiX) on Tue Oct 25 or Wed Oct 26 but timetable is already quite complete. On Nov 3/4 I cannot participate neither in person nor remotely.

Andreas Heiss (KIT)

Oxana Smirnova (NDGF)

I confirm participation in the group, representing NDGF Tier1 (and partially Nordic Tier2) interests.

Firstly, let me stress that I am not discussing what we have now or will have next year. I rather look into how the things may evolve in future.

Some unexpected twists can and will happen, but looking back one can see certain tendencies, and extrapolate them. To list a few:

Tendency #1: more resources will be offered on a fair-share basis or on demand. Meaning, we won't be able to tell that site A has N processors/cores/HS06-hours for the experiment X at any given time. Even now it is mostly impossible.

Tendency #2: less resources will be offered with Scientific Linux as the basic OS. This is related to the previous item: if resources are shared, other user communities have a say as well. This is of course not a problem: Amazon EC of course does not use Scientific Linux on its hardware, and yet works.

Tendency #3: less resource owners will be willing to deal with proprietary protocols and system-level tools not used by other communities.

Tendency #4: there is probably no way back to clusters of single processors: shared memory multi-core processors are here to stay. Which means application framework developers may need to investigate optimisation opportunities.

Now, having these in mind, let me get back to the list:

- Pilots and frameworks

Without any doubt, will be heavily re-designed, to provide application environment virtualisation, to use the hardware in the optimal manner, and to satisfy security requirements.

Convergence to a single framework for LHC experiments on short-term scale will be a massive investment of human power, but in the long run will save efforts, both in maintenance and deployment.

- Resource allocation and job management Requirements for a CE service, evaluation of the needs for a (global?) WMS, requirements for public/private access to WNs (incoming/outgoing connectivity)

This assumes the CE-WN model and centralised scheduling, which is not always the case even now, and probably will not be the case in future.

A safe assumption is that we don't know where and under which conditions the processing will be made, thus models must be independent from such details as availability of inbound connectivity.

- Use of information services

Let's hope that we will see the day when the distributed computing resources will all be adequately described and this information will be easily and consistently available. This will make a Grid out of our resources, finally.

Perhaps the simplest case is when every resource and service gets fully virtualised, and amount of information to be published is absolutely minimal, like, one URL for the entire Grid infrastructure (much like e.g. www.dropbox.com is today).

- Security models e.g. MUPJs, authentication and authorizaton - in collaboration with the Security TEG

I would like to make sure we don't develop own custom security solutions again. We have to realise that we use resources that we don't own, and therefore can not come with own rules - we have to always comply with local policies, if we want to get the service.

Security is there to protect the resource and also to protect users. Neither LHC researchers should expose themselves to attacks, nor should they create backdoors to expose the resources they use. Sounds trivial, doesn't it?

Existing open source solutions with many users are typically more secure, because they are scrutinised much more often than proprietary in-house hacks.

- New computing models Use of virtualization technologies, cloud computing, whole node scheduling, requests for special resources (e.g. GPUs, parallel jobs)

Nothing really new here - if LHC communities can not provide portable application software, they will need to make full use of virtualised application environments.

But if application developers will seriously decide to invest effort in re-implementing everything for e.g. GPUs, they should use this opportunity to develop actually portable software. This will reduce dependency on virtualisation technologies.

Laurence Field (CERN)

Andrey Kiryanov (St. Petersburg)

Most of the time I'm a site administrator at PNPI, Russia. I was the author of the performance patches for LCG-CE a couple of years ago. Now I'm working on the Pilot Factory prototype for LCG.

Interested in all areas, but Pilots is my priority.

Di Qing (TRIUMF, Tier-1 grid admin)

I will join the kickoff meeting. For the face to face meeting, I will join remotely through EVO.

I worked for WLCG project as a member of certification and testing team at CERN since 2002, the main task is to certify and test the grid middleware selected by LCG. Starting from 2009, I have been working at TRIUMF Tier1 as grid system administrator. Thus I have plenty of experience on both grid middleware and running a grid site.

Recently I started to investigate the possibility of using the new technology such as virtualization and cloud computing at our Tier1, but still only very little experience on this yet.

- indicate which if any areas on the topic list you are particularly interested to focus on

New computing model and use of information services.

- for any/all topics on the list, give initial comments summarizing your perspective

- New computing models

The same point as others. Most of sites support multiple experiments or projects, for sure, those projects always have different requirement, thus virtualization seems a reasonable solution. Some of our tier2s created different OS images for different projects. When some nodes allocated for one project, those nodes will be booted with the OS image for this project. This can be a solution too if it can be done automatically.

- Use of information services

If it's not easy to describe the resource adequately, how about just publish minimal information? In this case pilot job itself finds out what's the resource available on the execution node and pull jobs which can run there. However, not all projects use pilot model.

Federico Stagni (LHCb)

I'm the coordinator of LHCb DIRAC developments, the LHCb extension to DIRAC. DIRAC pioneered the use of pilot frameworks, which have been later adopted by many other VOs. I'm willing to bring the LHCb perspective into this WG.

Ricardo Graciani (LHCb, DIRAC)

I participate in this TEG in the name of LHCb and have worked during the last 10 years in the development of the DIRAC framework for distributed computing, the tool used by LHCb to control all its computing activities. A long these years I had time to become familiar with many aspects of the existing (and past) midleware and have participate actively in the introduction of new ideas (like MUPJ first introduced by DIRAC/LHCb).

I'm wiling to bring the LHCb perpective into this meeting, but at the same time, as DIRAC developer, I'm also in contact with other user communities that present different use patterns and needs that we should not forget in our discussions since in the long term LHC will not be the main user of the resources.

I'm more inclined to contribute to what you call "Pilots and frameworks" and "New computing models".

On the first point, I'm ready to defend that they provide an extra layer of homogenization, isolation and usage control for large user communities (like those we are dealing with) that we will very unlikely ever get from a common generic middleware.

On the second point, I think that the evolution of the hardware forces us towards the design of new "application" frameworks that allow efficient use of many-core systems in a much more efficient way as the current n-single thread approach that is dominant today.

Peter Solagna (EGI, info systems)

I am Operations Officer at EGI.eu. Among other things, I actively follow the main operations policies bodies (OMB, SPG), I organize the bi-weekly operations meeting where deployment issues are discussed, and I oversee the middleware requirements gathering process across the NGIs.

I am interested in all the areas of discussion, as an observer and to propose the EGI operations point of view, when possible, based on my experience.

Maarten Litmaath (CERN, security)

Xavier Espinal (PIC, operations)

I'm representing PIC as the Services and Production Group Leader.

Let me introduce some of the areas were we are investing time to improve the operations:

Use virtualization as a solid strategy either for the present and the future. We chose ORACLE-VM solution were we have now about 100 machines and running +30 production services (FT(A)S, LFC, CEs, UIs, etc.). Following virtualization of services, we are participating in a project to have federated-interoperable clouds in Spain, with OpenNebula. One of the goals related to wLCG is the possibility to instantiate CERNVM and virtualize computing nodes.

We also started to look at different batch system as Maui scheduler started to suffer a bit with 8k jobs (running+queued), and in one year term we would need a solution, an interesting open-source solution can be slurm but we barely started to look at it (could it be APEL proof ?). Maybe some other sites had a look already.

Other than that, you know our main field of expertise is the dCache-Enstore doublet to serve disk and tape.

Other hot topics we are interested in: future of DDBBs (LFC's consolidating to CERN, 3D streams stopped for us and only FTS wil be using ORACLE backend?), storage usage and computing models evolution ("tape" going to jBODs clusters?)

We are very interested in joining discussions and learn from this new group, having a critical mass of sites with common targets may ease the operational overheads coming from evaluations and integrations and better choose strategies. Hope to be useful.

Unfortunately I won't be able to connect to the kick-off meeting, I'm in London for an LTUG (Large Tape Users Group) meeting. Will try not to miss the next one.

Ricardo Silva (CERN)

Massimo Sgravatto

I have been working in Grid related projects since about 2000. I worked in the workload management area in the context of the DataGrid project first, then in the three EGEE projects, and now in the EMI project (where I am currently leading the gLite Compute Product Team). In these projects we designed and implemented the following services:

I was (and I am) in particular involved in the CREAM CE activities.

I am interested in all the areas of discussions, in particular in the "Resource allocation and job management" one

Marco Cecchi (INFN, WMS)

Will participate via EVO for the f2f.

- introduce yourself briefly in terms of what expertise, background, and perspective you bring

I have worked in the gLite WMS design&development for six years.

- indicate which if any areas on the topic list you are particularly interested to focus on

evolution of workload management/metascheduling. emerging distributed computing paradigms.

- for any/all topics on the list, give initial comments summarizing your perspective

> - Pilots and frameworks
They are the result of a good idea, nevertheless the ability for big experiments to (re)build their own frameworks ground-up with their own people must be taken into account. AFAIK each experiment has a different interpretation and model and software about this concept. They are sort of a shortcut to what has been designed over all these years to serve as a common framework for grid/distributed computing. Sometimes, actually most of the times, this happened for a good reason. The fact that they are the trend in HEP nowadays, of course does not mean that the story about workload mamagement has ended. They also have significant drawbacks that rarely seem to be taken into account by HEP. Of course this will happen as long as advantages outnumber disadvantages. i.e. the free lunch, not caring about security, outbound connectivity etc. That's fair enough, of course.

>- Resource allocation and job management
> Requirements for a CE service, evaluation of the needs for a
>(global?) WMS,

The need for a WMS comes essentially from the effectiveness of match-making in selecting dedicated resources, in terms of presence of required data, duration of the slot, processing power etc, as opposed to getting whatever it comes from the wild and then deciding if keep it or discard it or keep it busy doing nothing. WMS is not meant for opportunistic scheduling, as it is now, and we are interested in a possible evolution in this direction. Despite all the claims by big VOs to dismiss the WMS in a short time, however, as far as I can see it is still there. I'd like to know why in the first place, because I just don't know. As far as I can see, some VOs are still interested in selecting resources in advance through an appropriate match-making process able to get slots according to a given 'work' (= processing power over a given time). Besides, a WMS should be there to provide added value. In the current scenario, HEP doesn't seem to be much interested in that. I.e.: Resubmission -> not needed with pilots, deals badly with sites that do job migration Complex jobs, MPI jobs: -> not particularly appealing to the mainstream HEP community Rich, high level JDL, configuration tips, parameter passing through the whole chain: something big frameworks are not interested into either. Of course much of these points require a full featured connection with the LRMS and grid enabled submission chain, i.e. what CREAM+BLAH cover now.

>- Use of information services
Goes along with the downsides of push-mode scheduling. Retrieval of dynamic information, in particular, must be done in a different way. But is that something that really must be done afresh each time a pilot lands? How much of these bits of information, both static and dynamic, are really retrievable in a more or less simple way by a pilot?

- reference any background materials you think are relevant to the work of the group http://web.infn.it/gLiteWMS/

Jhen-Wei Huang (ASGC)

I am working for ASGC Taiwan Tier1. I had been T1 service manager for two years, thus I have some experiences for all workload management systems supporting WLCG. These year I am mainly focus on ATLAS computing so pilots/pilot frameworks and CVMFS would be interesting to me.

Xiaomei Zhang (BEIJING_LCG2)

I am from BEIJING_LCG2 Tier2. I worked as site support for CMS for about 5 years. From 2011, I begin to focus on building grid system for IHEP own physics experiments, using pilot framework and ganga, etc. I am also interested in cloud and virtualization techniques. I expect to contribute to the group, learn and benefit from the group.

Emmanuel Medernach (IN2P3) As LCG site administrator of the French NGI, I am interested to join and participate in the WLCG Workload Management Technical Evolution Group.

-- TorreWenaus - 25-Oct-2011

Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r6 - 2011-11-28 - TorreWenaus
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback