Project Office News Oct 6, 2015

Global Pool (Simplification of Submission Infrastructure)

  • "Local submission" of CRAB3 jobs:
    • Submission to the FNAL LPC via a dedicated JobRouter has been deployed and is undergoing testing.
    • We are testing a solution for the CERN CAF use cases by carving out a section of the AI as a high-priority, high-availability resource, similar to the glideinWMS set-up for the Tier-0.
    • A JobRouter which creates customized glideins linked to a particular user is in testing at the Tier-3 site at Texas A&M University.
    • Generalized user- or site-launched pilots are still some time away, but will piggyback on the work above.
    • Prioritization for local group users on non-pledged resources has been implemented, based on the local site configuration, and a draft of the documentation on how to deploy this has been written.
    • We need to discuss how to give prioritization to national VOMS groups on non-pledged resources.

  • Resource-based fair-share:
    • Not implemented yet. There is a proposed (but complicated) solution that has not been tested yet. For the beginning of Run II, we have worked around the problem by having the Tier-1 sites still set the fair-share between production and analysis manually. However, given that we are running formerly Tier-1-only workflows at Tier-2 sites, could we consider abandoning the enforced 95% fair-share for production at the Tier-1 sites? This would greatly simplify the operation of the glideinWMS frontend.

  • Multi-core deployment:
    • Still studying and optimizing scheduling/draining/ramping efficiency for multi-core glideins. This is challenging because we run single-threaded workflows within multi-core pilots, and CMS usage has been very volatile this year, resulting in many inefficient rampings up and down.
    • Many improvements in monitoring of multi-core glideins.

  • AAA and Overflow:
    • A European overflow group has been deployed, which currently contains a few Italian Tier-2 sites. This group can be expanded as we gain experience with AAA and xrootd in Europe.

  • Stability and Scalability:
    • We are currently studying the separation of the CCB (Condor Connection Broker) which handles communications with the pool central manager onto another piece of physical hardware. This was found to be a scaling limitation by the OSG at O(150K) parallel running jobs. After deployment of CCBs on separate hardware in the Global Pool, possibly by the end of October, we plan another scale test of the Global Pool sometime in the 4Q.

  • Support Model:
    • Support Model Document needs revision.
    • Recruitment and integration into the S.I. team of two L3 managers, one based in the U.S. and one in Asia (part-time at CERN).

  • Tier-0:
    • We are in regular contact with the HTCondor and glideinWMS developers and set priorities for development goals from our perspective. One such feature that we requested is for a dedicated high-IO slot for the Tier-0, which was deployed in the latest release of glideinWMS. Currently we are learning about this new feature and will be in a position to deploy once the glideinWMS factories fully upgrade.

  • Opportunistic:
    • We have run workflows at NERSC and SDSC through the Global Pool this year. We did run into issues with how to update CRL's, for example, on opportunistic resources that we do not administer.
    • Considering all of the CPU cores reachable by the Global Pool over the course of ~1 week, we see that there are about +100% more resources available over pledge, mostly at Tier-2 sites. We compete for these resources with other VOs, and typically can utilize around 50% of them on any given day.

opportunistic.jpg

JamesLetts - 2015-10-06 (with apologies that I will be unavailable at the meeting time)
JamesLetts - 2015-10-12 - updated.

Project Office News Sep 29, 2015

Data Locality

Locality Principle:

  • It means jobs should run on sites where the data is located. It was the rule (and only way) dutring run1.

Thanks to xrootd and AAA the rule has been relaxed during run 2.

Two ways of avoiding the locality principle: 1) Overflow mechanism (automatic) 2) Ignore locality flag (triggered by user)

Overflow mechanism

  • Jobs in the Global pool that are supposed to respect the locality principle are "migrated" to other sites under some conditions (i.e.: job idle for more than 6 hours)
    • Overflow regions are calculated and job can be send to other sites in the region
    • Number of "overflow" glidein is limited to 3000
  • There is a way in crab to forbid jobs for using overflow ( https://twiki.cern.ch/twiki/bin/view/CMSPublic/CRAB3FAQ#Why_are_my_jobs_submitted_to_a_s )

CRAB3 allows users to set an ignoreLocality flag

  • whitelist can also be set (actually suggested!)
  • since jobs will run on normal glideins the above limit does not apply
  • a way to limit the number of jobs that are using ignore locality is being tested right now

ignorelocality.png

Project Office News Jul 14, 2015

Multicore

  • All T1 sites ready from the Sub. Infrastructure side.
    • FNAL, RAL, CCIN2P3 and CNAF fully multicore now.
    • PIC effectively fully multicore
    • KIT and JINR more balanced situation single vs. multicore
  • CPU utilization by multicore pilots continues to be suboptimal
    • Work continues on trying to identify and minimize main sources of wastage
    • For that, an improved monitoring of glideinWMS is needed, work is ongoing on several fronts.
  • No significant amount of multicore jobs for T1s in the system for months: last tests (prompt reco) in mid March for all T1s, then relval only at FNAL in early June.

Project Office News Jul 7, 2015

Global Pool (Simplification of Submission Infrastructure)

Regarding some pending items for the Global Pool:

  • "Local submission" of CRAB3 jobs:
    • Submission to the FNAL LPC via a dedicated JobRouter has been designed and is undergoing testing.
    • There is a plan to handle the CERN CAF use cases by carving out a section of the AI as a high-priority, high-availability resource, similar to the glideinWMS set-up for the Tier-0.
    • User- or site-launched pilots are still some time away (4th quarter?)
  • Resource-based fair-share:
    • Not implemented yet. There is a proposed (but complicated) solution that has not been tested yet. For the beginning of Run II, we have worked around the problem by having the Tier-1 sites still set the fair-share between production and analysis manually.
  • Multi-core deployment:
    • Launched at all Tier-1 sites (some now fully multicore) and some Tier-2 sites, but scheduling/draining/ramping efficiency needs to be further studied and possibly improved. Seems to work best under steady-state conditions with constant pressure from pending CMS workflows.
    • Need to improve on monitoring for multicore resource utilization optimization.

JamesLetts - 2015-07-07

CRAB3

c3-vs-c3-users.png

weekly-users.png

Project Office News Jun 2, 2015

Project Office News May 19, 2015

Space Monitoring Deployment

  • all but 14 sites have uploaded space usage information SSB metric
  • tickets are open for those sites

AAA: Transitional Federation Deployment

  • When to mark site as bad?
    • AAA-related CMS GGUS ticket open for longer than two weeks
      • Support unit "CMS AAA - WAN Access"
      • Type of problem "CMS_AAA Wan Access"
      • Notify CMS SITE
    • SAM access test < 50% for two weeks
    • HC test success rate < 80% for two weeks
    • Note: metric can be really included when all sites are tested
    • Scale-test (file open-read)
      • Sites joining federation for the first time and those who didn't pass the high bar:
        • 10Hz file opening rate
        • at least 1.5GB/s throughput: 600 jobs at 2.5MB/s
        • test result weekly uploaded here
  • Deployment status
    • production federation
      • we will be doing list of ALLOWed SITEs; everyone else (not in list) can join transitional federation but need to subscribe according instructions, otherwise they are not considered to be part of federation
      • based on scale tests (includes considering status of site in SAM and HC tests and GGUS)
    • taking couple of T3s as an pilot sites join transitional federation (in progress, expecting couple of US T3s joining transitional federation sometime next week)
    • 2-stage deployment
      • 1st stage: happening now; no much care about xrootd service (version-like)
      • 2nd stage: (by end of 2015?) assuming all sites using xrootd >=v4.2.X; brings features for centrally controlled list of ALLOWed SITEs (aka bootstrap)
    • operations procedures need to be set (in progress)
      • Transfer Team in charge of watching metrics and talk to sites in case of problem, AAA/xrootd experts are always around and cc'd to the tickets as needed
      • Site Support Team is welcome help in here, too
    • Communication toward sites
      • crucial part will be keep actual list of ALLOWed SITEs in sync among all regional redirectors, Transfer Team?
      • rest of sites need to be notified how to proceed, CompOps management?

CRAB3

Project Office News March 10, 2015

AAA

  • HC-Xrootd test operational (# of serving sites increasing). Will be presented this Fri by A. Sciaba. Fast link here

Multicore

  • T1 sites: PIC, FNAL, KIT, RAL and JINR ready from the glideinWMS factory side. Working with CNAF and CCIN2P3
  • Multicore jobs (4-core Relval workflows) ran last week on FNAL and PIC. Jobs ran fine at PIC, taking most of the slots there. Took a while to run at FNAL, using just a fraction of the slots, possible because of higher prio. single core jobs.
    • Helped also in deployment of our the monitoring
  • Should continue pushing for mcore jobs continuously in the system: RelVal, T0 Prompt Reco, or just backfill.
  • Report Ncores to the dashboard: work advancing (https://github.com/dmwm/WMCore/pull/5662)

Tier0

  • Production Tier0 has switched to new physical node (32 cores, 64GB memory)
  • Started IO scale tests (with repack only)
  • Starting scale tests for multicore PromptReco at T1 (should happen this week)

Project Office News February 24, 2015

Tier0

  • Migration of Tier0 to puppet managed SL6 voboxes is ~done. Virtual machines do not have sufficient IO, we are switching to a physical node basically ~now.
  • Multi-core migration done for CERN resources, can't activate in production because not validated. This eventually will lead to problems, but should be manageable as long as we only take cosmics data.
  • Continuous data taking for CRUZET puts challenge on Tier0 operations, lots of little things we need to rediscover, working our way towards stable operations.
  • Migration to Tier0 condor pool done
  • Starting to plan IO scale tests (with repack only).

Multicore

CRAB3

The March release is going to be mostly a bugfix release with few relevant items for users. We are working with the Site Support Team to define two metrics in SSB to be used by CRAB3 for blacklisting sites and ASO destinations. Performing some scale test of schedds under different conditions. Recent tests seemed encouraging (reached 20k jobs running in parallel on one shcedd)

The usage of CRAB3 is still increasing, here there are the number of different users per week since last June: CRAB3usage201503.png

Project Office News January 20, 2015

Global Pool (Simplification of Submission Infrastructure)

The Global Pool is fully deployed now and being overseen by the Submission Infrastructure group, co-led by Dave Mason from Computing Operations and James Letts from Physics Support, and backed up by a very capable CAT-A support team at CERN. Weekly meetings are held every Thursday. The Support Model document for operations of the pool is mature, but we still need to identify and recruit some people for off-hours support as well as complete our critical services instructions.

CRAB3, CRAB2 and Monte Carlo Production have all migrated to the Global Pool. Scales of up to 75,000 parallel running jobs have been reached in recent days.

We are still sending pilots with the production role to the Tier-1 sites, since we wish to have a different fair-share (95% for production) there than at other sites (50%). We are co-ordinating closely with the HTCondor developers to find a solution to the resource-dependent fair-share challenge. For the Tier-2 sites, CMS will no longer be sending pilots with the production role in any great numbers. The Tier-2 sites should by told to stop allocating their fair-share based on VOMS roles for generic CMS jobs from this time forward. The Global Pool will only send glideins with the pilot role and fair-share between production and analysis will be determined at the glideinWMS frontend level.

For the Tier-0, we have decided to conduct scale tests both within the Global Pool but also in a dedicated Tier-0 glideinWMS pool, with flocking between the two, due to the more critical nature of the Tier-0 activity. While we do not expect any scaling limitations up to 200,000 parallel running jobs (or perhaps higher), we do not wish to risk data taking by needlessly tying the somewhat independent Tier-0 activity to Global Pool operations. The overhead in terms of operations burden for this separation is not great.

A Global Pool ITB testbed has been set up to test major changes and upgrades. The latest round of CRAB3 tests with HTCondor 8.3.2 will be performed in the Global Pool ITB, as well as the rationalization of the glideinWMS frontend configuration, the extension to multi-core at Tier-2 sites, and the introduction of overflow (AAA) to the Global Pool itself. Already configurations for high memory jobs have been tested in the Global Pool ITB and deployed in the main Global Pool. Monitoring improvements for multi-core have been developed by the glideinWMS group at FNAL and included in the latest version of the glideinWMS frontend, which will also be tested soon in the Global Pool ITB.

In summary, as a project the "Simplification of Submission Infrastructure" is more or less completed technically, except for the resource-based fair-share, which we expect to be realized during the first quarter of 2015.

JamesLetts - 2015-01-20

CRAB3

The February release of CRAB3 will fill the last functional (known) gap with CRAB2 by adding the possibility of emulating the multicrab functionality using the client as a library. Here there is a short but complete list of improvements that will come with the February release:

New Features:

  • multicrab by means of the CRAB library
  • WMCore "event aware lumi splitting" now available also in CRAB3
  • Don't do local stageout nor file metadata upload if transfer flag is False
Improvements:
  • Possibility to add warnings to the taskDB and show them users
  • Improved error handling of configuration parameters
  • Limited to 10k the number of running jobs in the TaskWorker
  • Added support for "non-local" configuration files (passing URL instead of filenames)
  • Added the possibility to specify the collector from the client (useful to use the CMSWEB production infrastructure with a GlideInWMS pool different than the Global one)
  • Added a UserUtilities module with functions that users can import and use in their scripts
Bugfix:
  • Improved report command which should also work with TFiles now, and correctly report the statistics
  • Improved postjob/ASO couch interaction to avoid inconsistencies when we get cached results from couch with a view that used "stale=updateafter"
  • Worked around an oracle bug with large merge queries using insert/update

What's planned for the next releases:

  • For the next releases we have received some requests from the Condor team (aka Brian) and we will work on their implementation (see https://github.com/dmwm/CRABServer/issues?q=is%3Aopen+is%3Aissue+label%3ACondor)
  • We are also working on the "native" support for the FWLite plugin, which is now possible through scriptExe but not user friendly
  • We would also like to add a "crab verify" command to help users debugging problems on their machine and also help them figure out some parameters.
  • And as usual we will work on improving the stability of the tool and to ease its support

MarcoMascheroni - 2015-01-20

Project Office News January 13, 2015

Multicore

  • Fraction of resources available in multicore mode at T1s to 50% of the pledges: increase the proportion of multicore pilots for T1s. Agreed to proceed after the holidays break with the idea of having it ready by end of January:
    • Agreed to communicate to sites in WLCG Ops. Coord. meeting. However, it was done in the last meeting of the year, should we remind them?
    • Then work together with the sites contacts and batch system admins to achieve this.
  • No advances yet in the other topics mentioned in the last meeting:
    • N_cores in monitoring: Number of cores per job is now available is the Condor monitoring but not yet in the dashboard. Github issue on this pending now for two months (https://github.com/dmwm/WMCore/issues/5437)
    • T2 multicore tests: not yet.

Project Office News December 9, 2014

HC for Xrootd

  • Valentina (with Marco's and Andrea's help) has inserted the first test, which
    • uses Pisa as a source, by forcing files like /store/test/xrootd/T2_IT_Pisa/store/...
    • runs everywhere (to be corrected in runs at all Tier2s)
    • has a special dashboard activity name (hcxrootd) (currently this does not appear to be working)
    • result has ben 100% success, with 4 errors due to T3_UK_London_QMUL being used (it has to fallback, but it is not a T2!)
    • run on the 185 files of the HC dataset, taking ~20 min CPU per job and moving ~400 MB per jobs. So a full "run" takes 28 CPU hours, + 74 GB. The estimates are that something like this every day has a ~1% on network; CPU wise (the CPU happens at remote sites, of course) it is 28 hours * number of server sites / number ot total T2s so it will be at most 28 houts / site , which should be ~ 0.1 % of the total 2 CPUs.
  • next steps:
    • fix the activity name
    • prepare one template per server site

Global Pool (Simplification of Submission Infrastructure)

  • The Global Pool project is quite mature now, with a full complement of 7 CRAB3 schedulers, 5 WMAgent schedulers for production, and 2 CRAB2 schedulers running. Production has reached levels of 15,000 parallel running jobs and will continue to ramp up.
  • Global Pool ITB test bed infrastructure set up, with an ITB factory being deployed this week. This will be used for testing major changes before deployment, such as:
    • Rolling out multi-core to the Tier-2 sites
    • Better architecture (SL5 vs. SL6) matching
    • New versions of HTCondor and glideinWMS
  • Some scalability issues of the schedulers over 10,000 running jobs. Appears to be network related, but investigations (by many people) continue. In any case this problem can be worked around by expanding horizontally with more schedulers.
  • We will be discussing ideas for resource-based fair share in HTCondor with Todd Tannenbaum, who will be at CERN this week. This is a needed feature is we want to specify in glideinWMS (and not by the sites) a higher fair share for production at Tier-1 sites.
  • OSG scalability tests have shown that the scales we need to reach during Run 2 are achievable with realistic conditions (200,000 running jobs in a single glideinWMS pool). They have not seen the scheduler scalability problem, however, and can happily run 100,000 jobs from a single schedd.
  • N.B. We will not have 24 hour coverage during Run 2 for this critical service without recruiting the three needed L3 people to help provide it.

JamesLetts - 2014-12-05

Tier0

  • Migration of Tier0 to puppet managed SL6 voboxes is (almost) finished. Last remaining machine is an old-style VM and runs the CERN endpoint for the P5->CERN transfer system, will be migrated over the holidays.
  • Migration to multicore for Tier0 replays is finished, we now always and regularly run multicore replays and consider it the baseline for any further Tier0 tests. When can we switch the production Tier0 ? If we want to use multicore for data-taking, IMO we should also use it for CRAFT and CRUZET next year, so there is some urgency here. Last time I brought it up in a joint meeting the feedback was that this isn't validated yet. Who is taken care of this validation ?
  • Tier0 scale tests have reached 6000 cores, the IO issues with the VMs we have been plagued with have been mostly solved by scheduling one heavy IO job together with one processing job on the VMs. During a week long replay at full scale (at least for 3 days at the end) we only observed one VM that went into some kind of IO overload (we are still investigating this with IT).
  • Still concerns and occasional problems with the IO on the VM voboxes (but nothing that really prevents us from doing what is needed). Not doing any tests on our end as Alan is testing the same for the production schedd voboxes. As soon as we have a conclusion and recommendation from him, we'll adjust the Tier0 vobox setup.

Opportunistic

  • Various technical issues related to running at NERSC have been addressed, submitted a test workflow again on Friday. Still need to check on outcome.
  • SDSC has a working SITECONF now and is setup much more like a "normal" CMS site. A test workflow has been submitted on Friday. Still need to check on outcome.

DirkHufnagel - 2014-12-09

Multicore

  • Following the estimation for PromptReco running at T1s in multicore jobs, the fraction of resources available in multicore mode should be of 50% of T1 pledges: We need to increase the proportion of multicore pilots for T1s. Agreed in the glidein chat last week to proceed after the holidays break.
  • No advances in the other topics mentioned last meeting (N_cores in dashboard and T2 multicore tests).

CRAB3

  • The December release has just been deployed. It is meant to be the release where we officially ask the collaboration to switch from CRAB2 to CRAB3. Almost all the relevant use cases should be covered now. Only the so called "multicrab" is missing, we will see soon if it can be supported by "using the client as a library".
    • Of course we will continue improve the stability of the tool, and the user's experience implementing new features and the missing corner use cases.
  • Continuing cale tests of the latest condor version (8.3.3). Would like to request Physical Machines or VM with SSD as we are experiencing I/O related problems (latest error related to kernel killing the schedd process because of a timeout on the Ceph block store)
  • Site readiness: following up on the issues HC is having about the priority of analysis jobs in the global pool. Some sites does not accept them (T1_DE_KIT, T1_FR_CCIN2P3, T1_RU_JINR, T1_UK_RAL), evaluating the creation of a dedicated HC frontend group to use crab3 schedd with pilots using role=production.

Project Office News November 25, 2014

Multicore

  • Multicore jobs run at all T1s, initial tests plus now T0 PromptReco workflows. Tested 2, 4 and 8 core jobs inside 8 core pilots.
  • Need to increase the fraction of resources available in multicore mode: increase the number of multicore pilots for T1s
  • Multicore job CPU efficiency monitoring needs N_cores to be included in FJR so that it can be used by dashboard. Email discussion with WM and CMSSW experts on what we need to finally implement
  • Start testing submission of multicore pilots to T2s as soon as the glideinWMS ITB instances of factory and front-end are properly setup.

Project Office News November 11, 2014

Project Office News October 28, 2014

Global Pool (Simplification of Submission Infrastructure)

  • We have begun writing a draft support model document for glideinWMS operations for Run 2. It should be complete by Offline and Computing week.
  • Scheduler and collector nodes are fully Puppet-ized (SL6) in the Global Pool, frontend to follow shortly.
  • Deploying a ITB testbed for the glideinWMS Global Pool, where important changes will be tested before being put into production. It will connect to the GOC ITB factory.
  • Production workflows running on two scheduler nodes in the Global Pool now, and scaling up.
  • For the Tier-0, we will deploy a backup glideinWMS pool and test there as well as in the Global Pool. This is due to the higher criticality of the Tier-0 workflows, for which we may want to have a separate submission infrastructure. Waiting for hardware.
  • Continuing series of meetings between CMS and the HTCondor developers. The past few meetings have focused on results of the OSG scale tests. All top priority CMS tickets will be included in the next release 8.3.2, which will be frozen around Nov. 1.
  • Abstract submitted to CHEP15.

Tier0

  • Migration of Tier0 to puppet managed SL6 voboxes is underway, already moved production Tier0 vobox before Extended Cosmic Run, testing machines will follow.
  • Migration to multicore is basically finished in the code (lacking some configuration options), limited scale tests show no problems and good scaling (order 90% cpu efficiency with 4 cores).
  • Working out scaling issues with AI and integrating Wigner resources into the Tier0 resource mix (have workaround for Meyrin/Wigner problems, will have to see if they behave at scale).
  • Will need to understand scaling issue with multicore and especially how the system behaves with a mix of multicore and singlecore. Might need better multicore/singlecore monitoring on the glideIn pilot side (already brought up in the Multicore project).
  • Still planning to add resources to Tier0 Project for higher scale tests in November (CERN IT contacted us already, they seem to have some Wigner resources ready).

Opportunistic

  • We are still working out NERSC/parrot problems. We are making progress, low level things work, we are at the workflow injection and debugging stage.
  • Recently got a new person to work on pushing SDSC. There is no principal problem there IMO, just no one that can dedicate enough time to push this through. Hopefully this will move now.

Xrootd tests in EU

See the slides attached to the agenda: here.

CRAB3

  • We held the first office hour with CRAB3 experts:
  • Feedback on the CRAB3 Tutorial of the 17th of October: 17 people compiled the survey we prepared: https://indico.cern.ch/event/347706/manage/evaluation/results/
    • Question 6 about which feature people would like to have in the next releases is very useful IMHO
  • Prepared 3.3.11 RPMs. The CMSWEB interface will be installed on the testbed by the HTTP group this week. That's the release candidate for the November release. Most notably includes:
    • LHE workflows are now supported
    • Exit codes of failed jobs are now reported to the users
    • Horizontal couch scalability
    • Disable the default automatic kill of bad tasks (until we find a way to better report it to the users)
    • See https://twiki.cern.ch/twiki/bin/view/CMSPublic/CRAB3Releases for the details about the release

Multicore

  • Testing multithreaded job submission chain as a whole by running multicore workflows (RECO on RAW data) at PIC. Jobs configured to using 4 cores each. CPU efficiency loss with respect to equivalent single-core wf is minimal (~5%)
  • WMAgent side, the functionality is here given the success of the multithreaded job submission, although still on patched versions.
  • 8-core pilots with partitionable slots, already in place for months at T1s, used for the tests t PIC with no need to modify their configuration.
  • Next step, consolidate every element of the chain, in particular WMAgent code, then go to scale tests, possibly running RelVal workflows at some T1s.
  • In parallel, Dirk is also testing multicore submission in T0
  • Also in parallel, extend test of multicore pilot submission to T2s (starting at T2_FR_GRIF as offered by the site). Waiting for the ITB for glideinWMS global pool to be in place in order to proceed.

Project Office News September 30, 2014

Global Pool (Simplification of Submission Infrastructure)

  • Scalability Tests:
    • Fifth meeting with the HTCondor developers took place on Wednesday September 24 in which Edgar Fajardo presented results of the recent OSG scale tests performed at UCSD. Using workflows which are somewhat lighter than CMS used, he found that HTCondor and glideinWMS don't break down until ~30,000 running jobs per scheduler node or ~150,000 running jobs in total according to a late update, and with ~500,000 idle jobs in the pool. Need to provision more hardware for the test in order to reach the desired scale for the Global Pool next year of 200,000 parallel running jobs. Further studies with more CMS-like configurations to come in the next couple of weeks.
    • N.B. Sites may see these test jobs as pilots that start up to 32 STARTD's per job slot, but run sleeper jobs (low CPU usage). This is expected. Please inform us if there is a problem. Some sites have experienced problems with their NAT due to the increased number of network connections.
  • We began discussing a support model document for the Global Pool for Run 2.
  • Scheduler nodes are fully Puppet-ized in the Global Pool, with the other types of nodes (collectors, frontend) to follow shortly.
  • Deploying a testbed for the glideinWMS Global Pool, where important changes will be tested before being put into production.
  • Production workflows running in the Global Pool now. Will scale up over the next weeks. Some minor certificate issues to resolve.
  • Migrated a CRAB2 scheduler node to the Global Pool.

Project Office News September 16, 2014

Global Pool (Simplification of Submission Infrastructure)

  • Puppet-ization of HTCondor installation on CERN machines nearly completed.
  • New operators being trained this month and next.
  • Production scheduler integrated into the Global Pool and ramping up production. Some service certificate issues to iron out.
  • For the first time, this week CRAB3 activity has exceeded CRAB2 at times.
  • We have some upcoming deliverables in terms of documenting a proposed support model for glideinWMS, and also a plan for running analysis on the AI.

-- JamesLetts - 16 Sep 2014

Project Office News September 2, 2014

Global Pool (Simplification of Submission Infrastructure)

  • No news since last week. Stable operations.
  • We did start the discussion with the HTCondor developers about having different fair-shares for different resource types (T1 vs. T2). It may be fairly straight-forward to implement.

-- JamesLetts - 01 Sep 2014

Multicore

  • No news from multicore pilot scheduling part since mid August: running OK at all T1s, filled with single core jobs.
  • Email discussions (Eric, David, Jose) during second half of Aug about job-pilot matching, and the model we have in mind: do we want/need to send both single and multicore pilots to sites? should we get rid of single core pilots completely? can sites accept multicore pilots exclusively?
    • Our pilots (multicore with partitionable sots) can handle both single core and multicore jobs simultaneously.
      • what about analysis and prod jobs? both mixed in the same mcore pilots?
    • However, ATLAS will continue to send single core and multicore jobs separately: shared sites expressed concerns about an "all multicore pilot model" and how mixing single core and multicore may preserve good farm usage
      • check site by site?
  • No conclusions yet.

-- AntonioPerezCalero - 02 Sep 2014

Project Office News August 26, 2014

CRAB3 Prodution Readiness

Summary of Friday's slides about CRAB3 production readiness (https://indico.cern.ch/event/336437/contribution/0/material/slides/1.pdf)
  • September release will fix some of the issues presented at the CSA users meeting
    • Possibility to run a portion of the dataset
    • Managing additional output files
    • Use crabclient as a library
  • Two major use cases are still not covered in CRAB3:
    • Ability to use LHE and other generators
    • Run user's provided script instead of cmsRun
  • Issues that will help reducing the operational effort:
    • See slide 8 of Friday's talk

Global Pool (Simplification of Submission Infrastructure)

  • Current Configuration: Four scheduler nodes are installed in the glideinWMS pool. Three are for CRAB3, which are in production for CSA-14 users and other analysis users, and one node is for production, which is being tested with WMAgent.
    • Two of the analysis nodes were corrupted by CERN/IT due to an operator error. We will re-install the nodes with Puppet soon.
    • We have has issues with running SCHEDDs blocking on VMs at CERN at scales above 10,000 running jobs. The one SCHEDD on a physical machine at UCSD has not had recent melt-downs.
  • Global Prioritization has been enabled (50% Analysis and 50% Production), and a special frontend group for the Tier-1 sites set up to handle the 95% production share, still using the production role in the pilot. We will need to request development from the HTCondor team to allow for this multi-leveled prioritization so that the special Tier-1 group can be eliminated.
  • We need to discuss management structure of the Global Pool longer-term.
  • One new CAT-A operator will begin at CERN on September 1, and James will be at CERN during the week of September 15 to help train him. Alison is leaving at the end of September and is focusing on documentation and Puppet activities. Interviews for a replacement for Lola are ongoing.

-- JamesLetts - 25 Aug 2014

Project Office News August 12, 2014

Global Pool (Simplification of Submission Infrastructure)

Current Configuration:

  • Four scheduler nodes are installed in the glideinWMS pool. Three are for CRAB3, which are in production for CSA-14 users and other analysis users, and one node is for production, which is being tested with WMAgent. (Question about glexec)
  • Global Prioritization has been enabled on the glideinWMS frontend, but not yet tested. (Question: T1 vs T2)
Planning:
  • We need to discuss the management structure of the Global Pool in the short-term among ourselves, balancing the needs of Analysis Operations, Production, Development and Integration.
Operations:
  • Working on putting the HTCondor installation into Puppet this month.
  • There has been some instability in the schedulers during the past week, apparently due to ARGUS using all the file descriptors on the machines, causing the schedd processes to exit.
Scalability:
  • HTCondor 8.3.0 has been deployed on the Analysis Operations pool at UCSD (CRAB2). This version contains the latest Negotiator improvements. We have easily reached scales of 60,000 running jobs, and presumably can run at much higher scales.
  • Currently deploying HammerCloud for CRAB3 in this pool in order to push the scale higher.
  • However, during the past few days there has been some instability in the scheduler nodes, which may or may not be related to this version of HTCondor. We are investigating.
  • The upcoming version of HTCondor 8.3.1 will include further improvements for scheduler node stability.

-- JamesLetts - 12 Aug 2014

Tier0

  • MWGR processing in production Tier0 instance in semi-manual mode (triggered after MWGR) due to bookkeeping of data not being available
  • AI resources reduced to <3k cores, will have larger replay soon to keep them busy for a few days to week
  • Started submitting PromptReco to T1, still some problems to be debugged

-- DirkHufnagel - 27 May 2014

Opportunistic

  • Newest glideinWMS/HTCondor support submissing to batch farms, tested at NERSC, soon to be run at SDSC.

-- DirkHufnagel - 27 May 2014

Multicore

  • We are now running multicore pilots at all CMS T1s: PIC, KIT, RAL, JINR, CCIN2P3. CNAF and FNAL.
  • Running together with ATLAS multicore jobs at shared sites with no additional requirements.
  • Multicore pilots are used to run 8 single core jobs simultaneously, while waiting for the real multicore application (any news on this?).
  • The efficiency of CPU usage has been reported to be lower than that of single core pilots. Pilot parameters have been recently retuned to get better CPU usage. To be checked as soon as we have some statistics of the new pilots results.

-- AntonioPerezCalero - 12 Aug 2014

Project Office News June 16, 2014

AAA

Here are a few CSA14 operational issues that are on my mind at the moment:

  • Support for redirectors in the EU. All of the redirectors are redundant through DNS round-robin at different sites, so it is unlikely that there would be total failure in the redirector system. However, it is not clear that we can guarantee 24/7 coverage of the EU redirector, as this would be beyond the MOU agreement of the hosting sites. Thus it is possible (although not probable) that we could lose the AAA service in Europe outside of business hours. While one could argue that analysis is not 24/7 critical (we actually let the students sleep?), this may not be acceptable for production. CompOps should comment on this matter. We might have to adjust expectations, or move the EU redirector to CERN.
    • Related item: there is an xrootd patch needed for failover to work correctly -- is this in the release to be used for CSA14?
  • Scale test performance. We have now identified several sites that have worrisome performance either in file opening rates or file read rates. They are: T1_DE_KIT, T2_DE_RWTH, T2_ES_CIEMAT, T2_IT_Legnaro, T2_IT_Rome, T2_RU_JINR, T2_FR_IPHC, T2_HU_Budapest, T2_UA_KIPT, all of which have low rates for file opens. T1/T2_FR_CCIN2P3, T1_RU_JINR, T2_AT_Vienna, T2_IN_TIFR, T2_UA_KIPT all seem to have some sort of hard bandwidth limitation. See interesting plots here. We have started to make contact with the individual sites on these matters.
  • Shifter instructions. We do have SLS monitoring available now, and it's visible in the services grid map, but still lack directives to shifters on how to react to problems. Nothing stops us from completing this before CSA14.
  • Detailed monitoring. Not a show-stopper, but we still need a number of sites to deploy this. A reminder was sent to sites on Monday. Sites that don't have this should not be counted in any site-monitoring metric.

Project Office News May 27, 2014

Global Pool (Simplification of Submission Infrastructure)

We are installing a second scheduler machine at CERN for CRAB3 analysis in the Global Pool in advance of the next scale test. Also we are investigating recent problems in the Analysis Operations pool involving schedulers blocking under high submission rates and dropping connections with running jobs under high load.

-- JamesLetts - 27 May 2014

AAA

Nothing dramatic to report, but we continue to press along towards readiness for CSA14. We have now run scale tests at 5 T1 sites and 25 T2 sites; we're still trying to digest the results but will subsequently contact sites that do not perform well and perhaps be able to suggest fixes. We've made contact with DPM developers but need to talk more actively to dCache, too. We have some decent redirector monitoring in place, and will soon generate instructions for shifters; we want this in place by the start of CSA14. Getting sites to deploy the detailed monitoring remains somewhat slow going; need another round of poking people soon. We're in touch with the CERN IT dashboard team about making sure that all of this information is displayed correctly.

-- KenBloom - 27 May 2014

Tier0

  • Luis (Fermilab) is running Tier0 replays to test some functional aspects, currently debugging read problem with EOS
  • EOS cleanup (HI data) ~now (should have been brought up in todays XEB), will start setting up larger scale tests (have 5600 cores now)
  • CompOps can start using T2_CH_CH_T0 through the normal production infrastructure
  • Data placement model discussed/finalized at Fermilab (also how configurable we want this to be), next item on the development agenda (but no support for it in WMCore at the moment, waiting on that, helping there a bit).

-- DirkHufnagel - 27 May 2014

Opportunistic

  • In the process of moving opportunistic resources into production pool, can run MC backfill once that is done.
  • ASGC cloud resource available for CompOps, final test ongoing, then will be enabled for production (together with regular CE resources).
  • Have separate glidein system at Fermilab to do BOSCO integration work, plan to have that operational (at some limited functionality at least) in a few weeks to start testing workflows at NERSC.

-- DirkHufnagel - 27 May 2014

Project Office News May 13, 2014

Global Pool (Simplification of Submission Infrastructure)

A new glideinWMS pool has been set up at CERN as the prototype for the Global Pool, comprising:

  • One HTCondor scheduler (vocms96) for CRAB3
  • One glideinWMS frontend (vocms0167)
  • Two High-Availability HTCondor collectors (vocms097 and vocms099)

The pool was deployed before the last CRAB3 scale test and was used as the submission infrastructure for the production CRAB3 Server. Monte Carlo Production will migrate independently. In the near future we will (re)deploy another large scheduluer machine in the Global Pool for CRAB3.

When the Global Pool is fully deployed, merging production and analysis, it will have O(100,000) cores at its disposal. Therefore, studying scale limitations in glideinWMS is essential for its success. We have been making progress. In March, the Analysis glideinWMS pool was limited to ~40K running jobs. Configuration changes (basically reducing the frequency of collector updates) have allowed us to run ~50K glideins in a single pool but not more.

As reported in the last meeting, Todd Tannenbaum of the HTCondor development team built us a release of HTCondor which allows us to increase the size of UDP datagrams for the collector updates to more closely match the size of the glidein ClassAds in CMS, which are much larger (~30kB) than the default UDP packet size in HTCondor (1kB). The small packet size relative to the size of the updates had caused a breakdown in the ability of the collector machines to re-assemble the updates from the glideins after a certain scale (40K running jobs). With the new code, we observe much better performance. The collector machine is rarely pinned to 100% CPU usage now, and it is thought that collector updates will not be a scale limitation in the medium term. Igor Sfiligoi will continue to communicate with the HTCondor developers to investigate scale limitations in our glideinWMS pools, focusing next on the Negotiator.

-- JamesLetts - 12 May 2014

CRAB3

  • Successful monthly scale test during the last week of April. See these slides for a detailed analysis of the results. It took us some effort to reach the 20k running jobs scale (competing with CRAB2, few users submitting jobs, not enough jobs in queue, some task submission failure rate, some tasks limited to few sites, etc). Now addressing issues encountered (mainly unresponsive schedd blocked in the communication with the collector).
  • Used in the scale test the standard glidein analysis pool (connected to the pre-production setup) and the new global pool (connected to the CRAB3 production setup).
  • Very useful feedback from a physics user (Ian MacNeill)
  • Andres has prepared a very nice CRAB3 tutorial

-- JoseHernandez - 13 May 2014

AAA

  • Test 1 (file opening): We are getting a few more sites into the game. Good news is that we see excellent scaling in the T1_US_FNAL EOS system (the _Disk PhEDEx endpoint). Less good news is that dCache doesn't look so good there. We still have to work more thoroughly on understanding the scaling issues in dCache and DPM systems in general. By my count we have done these tests at 2 T1 sites and 21 T2 sites; note there are a total of 41 T2 sites in the federation.
  • Test 2 (file reading): Recent results on a number of European sites are quite encouraging, showing appropriate scaling behavior up to 800 simultaneous jobs reading at 0.25 MB/s each. I for one need to understand how to square this with the less good results in the file-opening tests. Here we have tested 2 T1 sites and 17 T2 sites.
  • Test 3 (client hosting): This test is still in development, hope for more news soon.
  • Test 4 (total chaos): We think this is going to happen in CSA14.
  • Other news: We now have an SLS test that as of this morning is even in the critical services grid! Next step there is to start developing instructions for CSP's and CRC's to keep an eye on things. We are resuming the push on getting all sites to deploy the detailed monitoring, so that we can see what's actually going on everywhere. With many dCache sites moving to version 2.6, it should be easier for them to deploy the appropriate plugin. However, we realized that we've got a problem with multi-VO sites; right now there is no way to separate the traffic from different VO's. We'll propose some short-term fix to get us through CSA14, but we'll need some development effort over the next few months to really get things separated.

-- KenBloom - 13 May 2014

Tier0

  • Started training new Tier0 operator (Luis at Fermilab), will help with day-to-day replays to test various things
  • EOS cleanup (HI data) May 25, until then only limited scale tests possible
  • Data placement model discussed/finalized at Fermilab (also how configurable we want this to be), next item on the development agenda
  • not a showstopper, but Ops want to move the CERN cloud resources into the production glidein factory setup (we can use it via the HLT factory for now)

-- DirkHufnagel - 13 May 2014

Opportunistic

  • In the process of moving opportunistic resources into production pool, can run MC backfill once that is done.
  • ASGC cloud resource testing ongoing, working out problems with image configuration (and how it interacts with the ASGC environment).
  • Have separate glidein system at Fermilab to do BOSCO integration work, plan to have that operational (at some limited functionality at least) in a few weeks to start testing workflows at NERSC.

-- DirkHufnagel - 13 May 2014

Multicore

  • As reported in detail in the meeting (see slides https://indico.cern.ch/event/318115/contribution/1/material/slides/0.pdf) the main update is that we are now running multicore pilots at most T1s, PIC, KIT, RAL, JINR, CCIN2P3. CNAF is ready to be tested, FNAL and ASGC to follow soon.
  • From the results we get so far, after one month running at PIC and a week at the other sites, the multicore pilots are working fine, pulling 8 single core jobs and running them with reasonable efficiency. However, we should optimize the amount of pilots pending at a given site in relation to its workload in order to avoid wasting our share of CPU with unused idle glideins.
  • The main focus of the project work should now go into interacting with monitoring experts for glideinWMS in order to develop our tools.
  • The real work also just begun in relation to evaluating the performance of our model when being used together with ATLAS multcore jobs at shared sites. This will be discussed in coming meetings of the WLCG multicore task force.

-- AntonioPerezCalero - 13 May 2014

Project Office News April 15, 2014

Disk/Tape separation

  • FNAL changed the TFC last week on April 8th, all sites migrated.

Tier0

  • Tier0 on AI is functionally completely working. Ran one replay using ~7% of the available streamer files (18TB input), didn't yet manage to make use of the then max cores.
  • Had to stop scaling tests, resources needed for HI processing. Can only resume when that is done and some EOS space available (500TB for full replays).
  • April Global Run used LSF (no resources on AI). New PCL on Cosmics, new Tier0 Data Service and full loop of conditions upload and use in PromptReco has been tested.
  • HLT Factory connected to production frontend, Tier0 replays switched to that setup (HI processing did NOT use the production frontend though).
  • Still sanitizing the list of streamers we saved from 2012/2013, worked out some problems, more there. Will only get fixed gradually from one replay to another.
  • Next: Implement new data placement model (RAW/RECO/etc on disk/tape at different sites) and have replays use larger AI scales efficiently.

Opportunistic

  • In the process of moving opportunistic resources into production pool, can run MC backfill once that is done.
  • Meeting with Igor/Alison during Comp&Offl week resulted in new way to define opportunistic frontend groups in glidein. Need to test this, then apply to production setup.
  • ASGC cloud resources accessible with test jobs, testing CMS workflow is next. Idea is to treat it as an extension of the already opportunistic ASGC T1.
  • Have separate glidein system at Fermilab to do BOSCO integration work, will work on this after Easter together with Stefan Piperov and Fermilab glidein team.

WMAgent Latency

  • We have a cron job set to mine WMStats statistics
  • From a development standpoint we are interested in automatic transitions
    • Total time is "first job submitted" to completed
  • Ops/PPD is interested in a longer time
    • Total is from WF acquired by an agent to announced by Ops
  • Saving lots of information for each workflow:
    • acquire, closeout, announce, completed times
    • first job submitted, last job completed
    • percent progress: 1, 10, 25, 50, 65, 75, 80, 85, 90, 95, 98, 99% complete
      • Currently tracking job #'s, will soon also track lumis/events
  • WMStats does not save % done snapshots, so the data mining is for that.
  • Future:

CRAB3

  • Sorry, I (Josť) won't be able to attend the meeting (on vacation lost in the mountains smile
  • I'm filling here the news from the integration side (development and Analysis Operations should complement it)
  • Having reached full scale in the March scale test, for the April scale test (starting the 28th) we would like to have the participation of power users from the physics groups and concentrate in running a variety of workflows. It is very important to expose users to the new system well in advance to the start of CSA14 to get the appropriate feedback. By the way, how is the negotiation with Physics Coordination to provide power users for the testing going?
  • We have prepared in the CRAB3 integration twiki configuration examples and instructions for running workflows reading a dataset, skimming a dataset and private MC production, but users should bring their own configurations for the testing so that we can verify that all workflows that run in CRAB2 are supported in CRAB3.
  • The CRAB3 services have been deployed in production behind the cmsweb interface. Users will submit their workflows in this production instance.
  • A new glideinWMS analysis pool is being setup with schedd's, 2 HA collectors and the frontend for CRAB3 testing (see below).
  • CRAB3 cannot run yet on user-produced datasets. Some small fix is required to check the location of blocks in the local DBS instance instead of PhEDEx.
  • The status of DBS publication (after successful ASO transfer) has been incorporated into the job status and it is available to the users via the client.

Global Pool (Simplification of Submission Infrastructure)

  • A new glideinWMS pool has been set up at CERN, comprising:
    • One HTCondor scheduler (vocms96) for CRAB3
    • One glideinWMS frontend (vocms0167)
    • Two High-Availability HTCondor collectors (vocms097 and vocms099)
  • Plan is to fully deploy this pool in advance of the next CRAB3 scale test during the last week April. Production will migrate independently.
  • Gains in understanding scalability limitations making big progress:
    • In March, the deployed glideinWMS analysis pool was limited to ~40K running jobs.
    • Configuration changes (basically reducing the frequency of collector updates) have allowed us to run ~50K glideins in a single pool.
    • Further gains of an order of magnitude or more seem possible to achieve in the short-term by switching to TCP collector updates or larger UDP packets, and eliminating unneeded ClassAds. See figure below. Solutions to modify the behavior of the collector as previously discussed may not be needed.
  • Status as of April 14:
    • Machines configured in Puppet
    • glideinWMS pool fully deployed and configured from git
    • Some ports (such as port 80 from the frontend vocms0167) still need to be opened for connections from outside CERN for the pool to become functional for testing by the CRAB3 development team.

global-pool-apr-15-plot.jpg
Figure from Todd Tannenbaum (HTCondor developer) via Igor Sfiligoi: "Experiment where each stream sent 40 representative glideinWMS startd updates per second. UDP with 1k MTU (what glideinWMS is currently using) peaks at 22 streams at 850 hertz, then the hertz nose dives rapidly once the incoming UDP receive buffer does not have enough free space to hold the incoming fragments. The collector does a lot of work but rarely is able to reassemble a complete message. TCP and UDP w/ 16k MTU perform about the same - most importantly, they do not self-destruct when overloaded. Instead, they continue to perform at their peak of ~1000 hertz."

AAA

We've been making some nice progress on the scale testing. The results aren't always nice, but at least we have them in hand!

The file-open tests ("Test 1") have been expanded to many sites in Europe, and results can be seen at this twiki. A brief summary of the behavior is that hadoop and GPFS file systems can scale just fine in this test, up to 250 Hz (which remember is well beyond what we think we'd possibly need). But dCache, DPM and L-Store systems generally do not. There are two different behaviors -- in some cases sites just seem limited to a very low rate independent of the attempted rate of file opens, while in others the rate of opening does increase with the request rate, but not as much as it should. We are probably hitting issues with both the hardware at each site, and with the different file management systems. We are making contact with the developers of the different systems and will work with them to understand what's going on. In the particular case of L-Store, which exists only at Vanderbilt and thus was very much of interest recently, there is a new version deployed that is expected to have better scaling. We will test it out soon.

As a separate but related matter -- for these tests to work, we need to have the right TFC changes implemented at all sites. We've again asked sites to do this, and several more have.

The file-read tests ("Test 2") will soon be expanded to Europe. More news next time. Meanwhile we have taken to running them nightly at the US sites to track long-term behavior.

Discussions have started on the client-hosting tests ("Test 3"). We will get help from the glide-in experts at UCSD (Igor and Edgar) to commandeer specific numbers of slots at each site and run the file-reading tests above.

We're presuming that the "total chaos" test will be taking place during CSA14.

Multicore

  • Apologies, I (Antonio) will not be able to attend the meeting, as I am currently on vacation
  • As described during the past Offline and Computing week, we are proceeding in a two step plan, a technical test (how to send multicore pilots to each of the sites) followed by a scale test (fill the multicore pilots with real job, multiple single core jobs for now, as we don't have a multicore application ready yet).
  • For the second step, in order to test a site's scheduling capability for multicore pilots gradually, we keep sending single core pilots there, so the site receives its assigned workload through a mix of single and multicore pilots. The trick is to adjust the relative fraction of pilots of each type by
    • Keeping both types of entries enabled in the production glidein factories
    • Setting the number of pilots per factory entry accordingly
  • The main highlight is that we have already started scale testing at PIC. According to what I just described, last week Alison added a PIC multicore entry to production, and the site is currently running real work through a mix of single core and multicore glideins, as expected. This opens a very interesting period where we should advance in several fronts:
    • monitoring: it is essential that we learn how to monitor this new type of pilots both from the site and from the central glideinWMS side. This includes in particular things like measuring what fraction of the time the pilot has spent actually running payload as opposed, for example, to time used for job scheduling, partitionable slots internal reconfiguration, etc. * optimization: the relative length of pilots and jobs, along with some other parameters regulating the pilot behavior, should be adjusted to minimize CPU wastage derived from scheduling (not referring to inefficiency inherent to the application!), especially during the draining period at the end of each multicore pilot lifetime.
    • scheduling: our T1 sites might need to re-optimise their local job scheduling algorithms in order to be able to deal with multicore pilots jobs. Some sites have already experience from ATLAS multicore jobs. PIC is the first to receive both ATLAS and CMS multicore jobs, so it should serve as a testing grounds to check the compatibility of the two very different submission models.
  • In parallel, we have also started technical testing at other T1s. RAL, KIT and ASGC's multicore clusters work (filled with sleep jobs). CCIN2P3/CNAF, we are pinging/working with the admins in order to startsoon. FNAL will also be tested soon. The target date for this part of the project (middle of May) looks definitely feasible
  • Information for the status of the project is being regularly posted and updated in the twiki for the project

Project Office News March 4, 2014

Tier0

  • Tier0 on AI is functionally complete and working (from a code interacting with infrastructure viewpoint).
  • Working towards a workable April Global Run configuration (CMSSW version, Global Tag etc).
  • Resource utilization problems, see job failures when scaling up number of jobs (most likely memory related, still investigating).
  • Connecting HLT factory to production frontend, CompOps will then be able to use T2_CH_CERN_T0 (would not recommend it until the job failure issue is resolved).
  • Once we can reliably use T2_CH_CERN_T0 for normal workflows we'll scale up the resources (pre-request for 2000 cores already communicated to CERN IT).
  • In parallel I've been sanitizing the list of streamers we saved from 2012/2013. Lots of problems (missing files, zero size files), working around them somehow.

Opportunistic

  • In the process of moving opportunistic resources into production pool, can run MC backfill once that is done.
  • Some discussions with Igor on how to dynamically define opportunistic frontend groups in glidein (automatically send CMS jobs to all sites that set the 'support CMS VO' flag but are not a CMS site).
  • Adding ASGC cloud resources still work in progress.

Resource Management Office

  • Discussion with C-RSG members on Wed March 5, regarding the Resource Request and Utilization documents submitted lately
  • Preparation of the ESP (EPR) 2014 procedure, that is (i) the definition of various Computing sub-projects and Needs, and (ii) the announcement to all CMS Federation/Site contacts regarding their expected Site Contribution pledges + their duty to add to the Site Pledge section in SiteDB the intra-federation resource pledges for 2014.

Simplification of Submission Infrastructure

  • Global pool/global prio work is progressing
    • Alison working on setting up collector and frontend machines for analysis global pool, requested & received second set for production.
    • Current plan for production is to gradually move agents over to new pool once its ready, decommissioning old setup.
      • During this, agents will move from half a dozen hardwired prioritized "teams" to a single global priority, provided by requestors.
      • Requestors made aware of the priority plan, and their responsibility in it a month ago
      • We want to do this transition carefully, since we've never run the agents quite in this way before. (work potentially being spread over O(10) agents.)
    • Hope to have CERN based analysis and globally prioritized production frontends this month, work to merge two after.
  • Multicore testing continues
    • Using new glidein release which should improve dynamic partitioning performance
  • Many things learned about glidein scaling during the CRAB 3 scale testing
    • Found scaling limits in number of idle pilots, memory and disk issues from the CRAB 3 DAG's, ...

Disk/Tape Separation

There needs to be a policy discussion now that (as we see below) we're nearly in a fully Disk-Tape separated world. By default should we always write production data to disk and selectively subscribe to tape? Currently production always custodially subscribes produced data to tape, independent of its quality.

  • Site progress
    • T1_ES_PIC - DONE
    • T1_US_FNAL - migration to Disk endpoint finalized yesterday. Choose date for the switch of the site local config. Run consistency checks on 2 PBs not migrated.
    • Started consistency checks at T1_FR_CCIN2P3 to fix incorrectly injected files; will need to do at all Disk & MSS endpoints.

  • SiteDB:
    • merge T1 Disk sites into respective T1 sites in SiteDB (FNAL, JINR, RAL)

CMS Space Monitoring

  • Documentation and tools for Phase 1 deployment at the sites are ready and tested by the pilot sites.
  • Sites can start producing storage dumps on a regular basis and store them locally.
  • At least one record should be uploaded to the central database to verify the access and for us to know the site has infrastructure in place.
  • Currently there is no new records in the database:
T1_DE_KIT 2012-10-11
T1_DE_KIT_MSS 2012-08-13
T1_ES_PIC 2011-11-26
T1_FR_CCIN2P3 2011-11-26
T1_IT_CNAF 2011-11-26
T1_NT_HHJ 2011-10-18
T1_US_FNAL 2011-11-26
T1_US_FNAL_Buffer 2012-02-20
T2_CN_Beijing 2012-06-01
T2_Test_Buffer 2014-02-10
T2_Test_MSS 2011-10-18

  • We need more advertisement for the sites to get involved.

AAA

It's hard to keep it to a few lines....

  • Deployment: I count 40 T2 sites now in the federation, and on the T1 side PIC should be joining soon and KIT is supposedly working on it (but there was confusion at the ops meeting yesterday).
  • I believe we're still waiting on RAL to test out the new version of the proxy server so that they can try out fallback again.
  • Scale tests: My information is a bit stale, but on the file-open tests we continue to work with some US sites on their configuration to make sure they work as well on the others. No real action in the EU yet, should speed up now that DBS3 is settling down. Will get more information about file-read testing shortly.
  • Nicolo and I talked with Jan Iven today about the redirector at CERN. We have something of a plan to work on getting it monitored and report that to shifters; need to work with ops team about what would be done with that information. In fact, look at this already!
  • An issue for computing management: if we want the CERN redirector to be a 24x7 service with gold-plated support, this needs to be requested from CERN IT, and it will be some kind of negotiation over who is responsible for it.
  • It does seem that the SAM test for xrootd access requires more thought, so that we make sure that it's testing the sites rather than the redirector infrastructure. But we might want to postpone this until the SAM3 framework is available.

Project Office News Feb 4, 2014

Insert Individual Project Updates here

Resource Management Office

  • The Resource Utilization 2013 and the Resource Request 2015-16 Reports have been internally reviewed and are now in hands of the CMS Management. To be submitted to the C-RSG by Feb 14.
  • The ESP 2013 Accounting Year has been completed. The Site Contribution credits 2013 are according to expectation. The ratio between the Work Done and the Pledges for the rest of the Computing project (non-SC) was of 99.9%, while the ratio between the Pledges and the Needs was of ~60%. The latter deficit is not unusual for Computing, compared to previous years, and it is being analyzed with the coordinators of the main concerned projects. This result is an important input for the definition of the 2014 Needs, to be processed during February. A more detailed presentation on ESP 2013 credits will be provided soon.
  • On Thu Jan 30 two Computing Shifts (CSP) Tutorials were given, in particular for the new shift personnel located in the 3 main time zones. The number of new shifters compared to last year is of 22% (out of 108 active shifters) and the attendance to the tutorial was of that order. The 2014 CSP shift coverage until March is of 100% and from subscriptions it looks like it will continue at this level. It is important to find out from the CMS Managment if the potential increase of central shift weights (for CSP: 0.75 --> 1.00 and 1.25 --> 1.50), as presented at the XEB last December, will be materialized.

Information System Redesign

  • Maria Alandes (CERN/IT) is evaluating AGIS as the tool for CMS based on requirements as discussed in the O&C week (see this.
  • First results are very positive and it would quite well match our requirements.
  • She is working to make some CMS specific data (for instance, from SiteDB) available through the Information System, evaluating the effort to make it. For the non-CMS sources of information (i.e. BDII), mostly everything we need is already there.
  • As soon as she gets some CMS data on it, she will make an AGIS testing instance available for people (CompOps) to give it a try and provide feedback.
  • From OLI:
    • KIT asked for information sources they need to support, Christoph Wissing is collecting information
    • AGIS sounds interesting, in Ops, we are starting to use site status board tables to do something similar (I would call that a poor man's version of AGIS)
  • From Tony:
    • I'd like to know about the security model for writing information to AGIS, both via the web and through APIs. For completeness, we should also consider if we need security for reading information.
    • Can we map and describe transient (opportunistic) resources in AGIS?
    • How does it scale in terms of DB size, interaction rates, number of clients etc?

Simplification of submission infrastructure

  • From OLI: requested slot in Joint meeting to discuss changes in McM prioritization
  • From Dave: Though not waiting for McM folks, we start working towards global prioritization once the dust settles form the DBS3 migration. Some details yet to be hammered out operationally in terms of how agents are configured and the actual logistics of transitioning to a "teamless" arrangement of several agents.

Savannah - GGUS migration

  • Ready to move the first squads from savannah to GGUS support units
  • Discussion with sites started to make sure that site contacts are receiving GGUS tickets

AAA

I'll be reporting at this meeting, but the short story is that Test #1 (scaling of file opening) is underway. What we've learned so far is that we can open files through the redirector at 250 Hz...if the target site has the right Xrootd configuration. As soon as we have the details nailed down, we'll propagate this information out to sites.

Production Tools (WMAgent) Evolution

  • DBS Migration planning is proceeding
  • Thursday meeting scheduled (16:30) to discuss multicore issues with Glidein WMS and WMAgent developers
  • Valentin has begun looking at CouchDB scaling issues and trying to separate out which details of a job are needed for status and which are to b

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng CRAB3usage201503.png r1 manage 33.7 K 2015-02-24 - 15:37 MarcoMascheroni  
PNGpng c3-vs-c3-users.png r1 manage 44.4 K 2015-07-07 - 16:35 MarcoMascheroni  
JPEGjpg global-pool-apr-15-plot.jpg r1 manage 174.3 K 2014-04-14 - 22:18 JamesLetts  
PNGpng ignorelocality.png r1 manage 63.1 K 2015-09-29 - 16:04 MarcoMascheroni  
JPEGjpg opportunistic.jpg r1 manage 714.6 K 2015-10-06 - 06:20 JamesLetts  
PNGpng weekly-users.png r1 manage 42.5 K 2015-07-07 - 16:35 MarcoMascheroni  
Edit | Attach | Watch | Print version | History: r106 < r105 < r104 < r103 < r102 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r106 - 2015-10-13 - JamesLetts
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback