WLCG Operations Coordination Minutes, January 26th 2017
Highlights
- The new WLCG accounting portal
has been validated.
- Please check the new accounting reports: if no major problems are reported, they become official as of Jan.
- Please check the baseline news and issues in the MW news.
- Long downtimes proposal and discussion.
- Tape staging test presentation and discussion.
Agenda
Attendance
- local: Alberto (monitoring), Alejandro (FTS), Alessandro (ATLAS), Andrea M (MW Officer + data management), Andrea S (IPv6), Andrew (LHCb + Manchester), Jérôme (T0), Julia (WLCG), Kate (WLCG + databases), Maarten (WLCG + ALICE), Maria (FTS), Marian (networks + SAM), Vincent (security)
- remote: Alessandra (ATLAS + Manchester), Catherine (IN2P3 + LPSC), Christoph (CMS), David B (IN2P3-CC), Di (TRIUMF), Frédérique (IN2P3 + LAPP), Gareth (RAL), Kyle (OSG), Marcelo (LHCb), Oliver (data management), Renaud (IN2P3-CC), Ron (NLT1), Stephan (CMS), Thomas (DESY-HH), Vincenzo (EGI), Xin (BNL)
- Apologies: Nurcan (ATLAS), Ulf (NDGF-T1)
Operations News
- Next WLCG Ops Coord meeting will be on March 2nd
Middleware News
- Useful Links:
- Baselines/News:
- New APEL 1.4.1 and APEL-SSM 2.17 supporting SL7/C7 ( GGUS:126009
). To be included in UMD 4
- new ARC-CE release (http://www.nordugrid.org/arc/releases/15.03u11/release_notes_15.03u11.html
) available also in EPEL. It fixes an issue with HTCondor 8.5.5 reported by some sites affecting the job status query.
- dCache 2.10.x support ended in 2016. From the BDII we still see 18 instances running this version ( including BNL) . We will coordinate with EGI in order to ticket the sites asking to upgrade to 2.16
- Issues:
- T0 and T1 services
- CERN
- check T0 report
- FTS upgrade to v. 3.5.8 and C7.3 planned for next week
- FNAL
- IN2P3
- dCache upgrade 2.13.32 -> 2.13.49 on Dec 2016
- JINR
- dCache minor upgrade 2.13.49 -> 2.13.51, xrootd minor upgrade 4.4.0 -> 4.5.0-2.osg33
- NDGF-T1
- Upgrade to dCache (3.0.5) next week to fix a communication bug that is rare.
- NL-T1
- SURFsara upgraded dCache from 2.13.29 to 2.13.49 on Dec 1
- RAL
- Castor upgrade to v 2.1.15-20 ongoing, Tape servers upgraded to v 2.1.16
- Almost completed migration of LHCb data to T10KD drives.
- Update SRMs to version 2.1.16-10 has been planned
- RRC-KI-T1
- Upgraded dCache for tape instance to v 2.16.18-1
Tier 0 News
- Apex 5.1, released in December, has been installed on the development instance; the upgrade of the production environments is scheduled for February.
- The main compute services for the Tier-0 are now provided by 31k cores in HTCondor, 70k cores in the LSF main instance, 12k cores in the dedicated ATLAS Tier-0 instance, and 21.5k cores in the CMST0 compute cloud. Some CC7-based capacity is available in HTCondor for user testing. We are in contact with major user groups for consultancy for migrating to HTCondor.
- 2016 LHC data taking ended with the p-Pb run; about 5 PB have been collected in December. Since then there has been a lot of consolidation activity.
- The p-Pb run for ALICE was performed using a 1.5 PB Ceph-based staging area, without incident or slow-down. This configuration, offering increased flexibility and easier operation, is now considered production-ready for CASTOR.
- Mostly transparently to users, EOS and CASTOR instances have been rebooted. On 23 January the failover of an EOS ATLAS headnode failed, causing service disruption; the root cause is being investigated. Some EOS CMS instability requiring a headnode to be rebooted has been due to heavy user activity.
- New storage hardware will be installed in February as soon as it becomes available to the service.
- The FTS service has been upgraded to v.3.5.8 and a refreshed VM image based on CentOS 7.3 fixing vulnerabilities.
- In January the mark of one billion files in EOS at CERN was crossed.
Tier 1 Feedback
Tier 2 Feedback
Experiments Reports
ALICE
- High to very high activity since the end of the proton-ion run
- Also during the end-of-year break
- In particular for Quark Matter 2017
, Feb 5-11
- Thanks to the sites for their performance and support!
- On Wed Jan 4 the AliEn central services suffered a scheduled power cut
- Normally such cuts are short and do not present a problem
- This time the UPS was exhausted many hours before the power was restored
- All ALICE grid activity started draining away
- Fall-out from the incident took ~2 days to resolve
- A recent MC production is using very high amounts of memory
- Large numbers of jobs ended up killed by the batch system at various sites
- Experts are looking into reducing the memory footprint further
ATLAS
- Very smooth production operations during the winter break; MC simulation, derivations and user analysis have been running in full speed since then. Derivations use up to 100k slots to finish full processing of data15+data16 and their MC samples. Very high user analysis activity for Moriond and spring conferences.
- ATLAS Sites Jamboree took place at CERN on January 18-20. Good feedback and discussions from sites. Interest on virtualization from sites; docker containers and singularity, a dedicated meeting is being planned before Software&Computing week in March.
- Global shares are being implemented to better manage the required resources among different workflows (MC, derivations, reprocessing, HLT processing, Upgrade processing, analysis). More intensive productions will run in the future. Sites will be asked to get rid of the fairshares set at the site level. This is currently in revision and some sites that ran only evgensimul (or part of it) in the past will be affected.
- Tape staging test at 3 Tier1’s two weeks ago to understand the performance in case running the derivations from the tape input might become a possibility. Results will be presented today: 20170126_Tape_Staging_Test.pdf
CMS
- good, steady progress on Monte Carlo production for Moriond 2017
- tape backlog at sites worked down
- JINR looking into tape setup to improve performance
- PhEDEx agents were updated in time at all but two sites, ready to switch to FTS3 and new API
- Christoph, Stephan: the old SOAP interface can be switched off now
- FTS team: will be done in the intervention next Tue
- dedicated hi-performance VMs for GlideinWMS, under evaluation using Global Pool scalability test
- slots overfilled at sites due to HT Condor bug (triggered by node with high network utilization/errors); bug will be addressed in v8.6 to be released early February
- Stephan: a fraction of the jobs used more cores than reserved; so far this could be mitigated
- CentOS 7 plans: no general upgrade planned; sites are free to upgrade and provide SL6 via container, etc.; CMS software is also released for CentOS 7; physics validation expected soon; discussions and tests to move HLT farm to CentOS 7;
- pilots sent to Tier-1 sites consolidated: one pilot with role "pilot" instead of two with roles "pilot" and "production"
LHCb
- Andrew: high activity, nothing special to report
Discussion
- David B: what is the general readiness for CentOS7 WNs?
- Maarten: ALICE and LHCb can use such resources today
- Christoph, Stephan: the CMS report has the details for CMS
- Alessandro: ATLAS production can run OK, analysis to be checked;
we will check if SL6 containers are OK
- Alessandra after the meeting: the situation for ATLAS was presented
in the ATLAS Sites Jamboree
held on Jan 18-20
- All SL6 SW workflows have been validated
- Site admins should read the presentation for details, considerations and advice
Ongoing Task Forces and Working Groups
Accounting TF
- New WLCG accounting portal (http://accounting-next.egi.eu/wlcg
) has been validated and people are welcome to start using it
- Migration to the new accounting reports started. Two sets of accounting reports (current and new ones) have been sent for November and December . In case no major problems are reported, starting from January, new reports become official.
Information System Evolution
IPv6 Validation and Deployment TF
Machine/Job Features TF
- Further updates to DB12-at-boot in mjf-scripts distribution. See discussion at HEPiX benchmarking WG meeting. These values are made available as $MACHINEFEATURES/db12 and $JOBFEATURES/db12_job
Monitoring
MW Readiness WG
Network and Transfer Metrics WG
Squid Monitoring and HTTP Proxy Discovery TFs
- CERN now has the first site-specific http://grid-wpad/wpad.dat
service, and CMS is using it for their jobs
- hosted on the same 4 physical 10gbit/s servers that CMS uses for squid service
- supports both IPv4 and IPv6 connections in order to determine whether squids in Wigner or Meyrin should be used first
- for CMS destinations (Frontier or cvmfs), it directs to the CMS squids, otherwise it defaults to the IT squids
Traceability and Isolation WG
- A tool was identified as a possible solution for isolation (without traceability), 'singularity':
- WG now evaluating the tool: a small test cluster is being built now
- A security review of this tool is needed (transition plan might require SUID)
- Next meeting: Wednesday 1 Feb 2017 (https://indico.cern.ch/event/604836/
)
Theme: Downtimes proposal followup
See the
presentation
- Andrew: shifting a downtime can upset the experiment planning, it should not be done lightly
- Stephan: only explicit policies can be programmed automatically
- Maarten: we can have SAM apply the numbers of the new policies,
whereas the clauses in purple (see presentation) would be "best practice";
we can always do manual corrections as needed
- Julia: manual operations should only be done in exceptional cases,
e.g. when a downtime had to be extended or postponed
- Maarten: today's feedback plus the outcome of a discussion with the SAM team
will be incorporated into v3 to be presented in the MB
Theme: Tape usage performance analysis
See the
presentation
- Alessandro: can the FTS handle batches of 90k requests per T1, times 5 sites?
- FTS team: v3.6 should be able to handle 500k per link
- CMS: we want to understand the tape usage performance through a similar exercise;
afterwards we can do a common challenge together with ATLAS
- Alessandro: the common challenge would be at a shared T1
- CMS: the monitoring is tricky; a file could first be read from tape
and then multiple times from disk; some files could already be on disk;
such cases would have to be disentangled
- Alessandro: that is why in the ATLAS exercise we took very old files
- Alessandro: the throughput was measured between submission of the request and
the file being present in the ATLAS_DATA_DISK space
- Alessandro: new files could be made bigger (not trivial), existing files cannot
- Renaud: w.r.t. the number of files per request, we will check what limitations exist on our side
- Alessandro: a typical campaign would cover O(100) TB and O(100 k) files;
we can customize that per site
- Renaud: what do you think of the observed performance?
- Alessandro: it did not change over the past years, because experiments did not work on it !
it also depends on site-specific tape handling aspects
- Julia: shouldn't also the recording be optimized?
- Alessandro: indeed, e.g. through new tape families
Action list
Creation date |
Description |
Responsible |
Status |
Comments |
01 Sep 2016 |
Collect plans from sites to move to EL7 |
WLCG Operations |
Ongoing |
The EL7 WN is ready (see MW report of 29.09.2016). ALICE and LHCb can use it. NDGF plan to use EL7 for new HW as of early 2017. Other ATLAS sites e.g. Triumf are working on a container solution that could mask the EL7 env. for the experiments which can't use it. Maria said that GGUS tickets are a clear way to collect the sites' intentions. Alessandra said we shouldn't ask a vague question. Andrea M. said the UI bundle is also making progress. Jan 26 update: this matter is tied to the EL7 validation statuses for ATLAS and CMS, which are reported above. |
03 Nov 2016 |
Review VO ID Card documentation and make sure it is suitable for multicore |
WLCG Operations |
Pending |
Jan 26 update: needs to be done in collaboration with EGI |
03 Nov 2016 |
Discuss internally on how to follow up long term strategy on experiments data management as raised by ATLAS |
WLCG Operations |
DONE |
Jan 26 update: merged with the next action |
03 Nov 2016 |
Check status, action items and reporting channels of the Data Management Working Group |
WLCG Operations |
Pending |
|
26 Jan 2017 |
Create long-downtimes proposal v3 and present it to the MB |
WLCG Operations |
Pending |
|
Specific actions for experiments
Specific actions for sites
AOB
This topic: LCG
> WebHome >
WLCGCommonComputingReadinessChallenges >
WLCGOperationsWeb >
WLCGOpsCoordination > WLCGOpsMinutes170126
Topic revision: r19 - 2017-02-01 - MaartenLitmaath