TWiki
>
LCG Web
>
WLCGCommonComputingReadinessChallenges
>
WLCGOperationsWeb
>
WLCGOpsCoordination
>
WLCGOpsMinutes170126
(revision 19) (raw view)
Edit
Attach
PDF
<!-- <font size="6"> %RED% *DRAFT* %BLACK% </font> --> ---+!! WLCG Operations Coordination Minutes, January 26th 2017 %TOC{depth="4"}% ---++ Highlights * The new [[http://accounting-next.egi.eu/wlcg][WLCG accounting portal]] has been validated. * Please check the new accounting reports: if no major problems are reported, they become official as of Jan. * Please check the baseline news and issues in the [[https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOpsMinutes170126#Middleware_News][MW news]]. * [[https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOpsMinutes170126#Theme_Downtimes_proposal_followu][Long downtimes]] proposal and discussion. * [[https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOpsMinutes170126#Theme_Tape_usage_performance_ana][Tape staging test]] presentation and discussion. ---++ Agenda * https://indico.cern.ch/event/607744/ ---++ Attendance * local: Alberto (monitoring), Alejandro (FTS), Alessandro (ATLAS), Andrea M (MW Officer + data management), Andrea S (IPv6), Andrew (LHCb + Manchester), Jérôme (T0), Julia (WLCG), Kate (WLCG + databases), Maarten (WLCG + ALICE), Maria (FTS), Marian (networks + SAM), Vincent (security) * remote: Alessandra (ATLAS + Manchester), Catherine (!IN2P3 + LPSC), Christoph (CMS), David B (!IN2P3-CC), Di (TRIUMF), Frédérique (!IN2P3 + LAPP), Gareth (!RAL), Kyle (OSG), Marcelo (LHCb), Oliver (data management), Renaud (!IN2P3-CC), Ron (NLT1), Stephan (CMS), Thomas (DESY-HH), Vincenzo (EGI), Xin (BNL) * Apologies: Nurcan (ATLAS), Ulf (NDGF-T1) ---++ Operations News * Next WLCG Ops Coord meeting will be on March 2nd ---++ Middleware News * Useful Links: * [[https://wlcg-mw-readiness.cern.ch/baseline/current/][Baseline Versions]] * [[WLCGBaselineVersions#Issues_Affecting_the_WLCG_Infras][MW Issues]] * [[WLCGT0T1GridServices#Storage_deployment][Storage Deployment]] * Baselines/News: * New APEL 1.4.1 and APEL-SSM 2.17 supporting SL7/C7 ( GGUS:126009). To be included in UMD 4 * new ARC-CE release (http://www.nordugrid.org/arc/releases/15.03u11/release_notes_15.03u11.html) available also in EPEL. It fixes an issue with HTCondor 8.5.5 reported by some sites affecting the job status query. * dCache 2.10.x support ended in 2016. From the BDII we still see 18 instances running this version ( including BNL) . We will coordinate with EGI in order to ticket the sites asking to upgrade to 2.16 * Issues: * Follow up on the issue reported by RFC proxy TF (GGUS:124650). A new version of dCache 2.13 has been released last week (2.13.51, https://www.dcache.org/downloads/1.9/release-notes-2.13.shtml). It fixes a problem in jglobus discovered in the context of the RFC Proxy TF, which prevented RFC proxies of certificates belonging to certain CAs to work with dCache 2.13 (from v 2.14 jglobus has been removed from dCache). The same issue is still present in Bestman though. * SLURM high risk vulnerability (https://wiki.egi.eu/wiki/SVG:Advisory-SVG-CVE-2016-10030). Sites are requested to update to version 15.08.13 or 16.05/8 * High risk CVE-2016-7117 Linux kernel vulnerability (https://wiki.egi.eu/wiki/SVG:Advisory-SVG-CVE-2016-7117 ). Patched kernels are available for RHEL 6/7 and SL6/7 * T0 and T1 services * CERN * check T0 report * FTS upgrade to v. 3.5.8 and C7.3 planned for next week * FNAL * FTS upgrade to v. 3.5.7 * IN2P3 * dCache upgrade 2.13.32 -> 2.13.49 on Dec 2016 * JINR * dCache minor upgrade 2.13.49 -> 2.13.51, xrootd minor upgrade 4.4.0 -> 4.5.0-2.osg33 * NDGF-T1 * Upgrade to dCache (3.0.5) next week to fix a communication bug that is rare. * NL-T1 * SURFsara upgraded dCache from 2.13.29 to 2.13.49 on Dec 1 * RAL * Castor upgrade to v 2.1.15-20 ongoing, Tape servers upgraded to v 2.1.16 * Almost completed migration of LHCb data to T10KD drives. * Update SRMs to version 2.1.16-10 has been planned * RRC-KI-T1 * Upgraded dCache for tape instance to v 2.16.18-1 ---++ Tier 0 News * Apex 5.1, released in December, has been installed on the development instance; the upgrade of the production environments is scheduled for February. * The main compute services for the Tier-0 are now provided by 31k cores in HTCondor, 70k cores in the !LSF main instance, 12k cores in the dedicated ATLAS Tier-0 instance, and 21.5k cores in the CMST0 compute cloud. Some CC7-based capacity is available in HTCondor for user testing. We are in contact with major user groups for consultancy for migrating to HTCondor. * 2016 LHC data taking ended with the p-Pb run; about 5 PB have been collected in December. Since then there has been a lot of consolidation activity. * The p-Pb run for ALICE was performed using a 1.5 PB Ceph-based staging area, without incident or slow-down. This configuration, offering increased flexibility and easier operation, is now considered production-ready for CASTOR. * Mostly transparently to users, EOS and CASTOR instances have been rebooted. On 23 January the failover of an EOS ATLAS headnode failed, causing service disruption; the root cause is being investigated. Some EOS CMS instability requiring a headnode to be rebooted has been due to heavy user activity. * New storage hardware will be installed in February as soon as it becomes available to the service. * The FTS service has been upgraded to v.3.5.8 and a refreshed VM image based on !CentOS 7.3 fixing vulnerabilities. * In January the mark of one billion files in EOS at CERN was crossed. ---++ Tier 1 Feedback ---++ Tier 2 Feedback ---++ Experiments Reports ---+++ ALICE * High to very high activity since the end of the proton-ion run * Also during the end-of-year break * In particular for [[http://qm2017.phy.uic.edu/][Quark Matter 2017]], Feb 5-11 * Thanks to the sites for their performance and support! * On Wed Jan 4 the !AliEn central services suffered a scheduled power cut * Normally such cuts are short and do not present a problem * This time the UPS was exhausted many hours before the power was restored * All ALICE grid activity started draining away * Fall-out from the incident took ~2 days to resolve * A recent MC production is using very high amounts of memory * Large numbers of jobs ended up killed by the batch system at various sites * Experts are looking into reducing the memory footprint further ---+++ ATLAS * Very smooth production operations during the winter break; MC simulation, derivations and user analysis have been running in full speed since then. Derivations use up to 100k slots to finish full processing of data15+data16 and their MC samples. Very high user analysis activity for Moriond and spring conferences. * ATLAS Sites Jamboree took place at CERN on January 18-20. Good feedback and discussions from sites. Interest on virtualization from sites; docker containers and singularity, a dedicated meeting is being planned before Software&Computing week in March. * Global shares are being implemented to better manage the required resources among different workflows (MC, derivations, reprocessing, HLT processing, Upgrade processing, analysis). More intensive productions will run in the future. Sites will be asked to get rid of the fairshares set at the site level. This is currently in revision and some sites that ran only evgensimul (or part of it) in the past will be affected. * Tape staging test at 3 Tier1s two weeks ago to understand the performance in case running the derivations from the tape input might become a possibility. Results will be presented today: [[%ATTACHURL%/20170126_Tape_Staging_Test.pdf][20170126_Tape_Staging_Test.pdf]] ---+++ CMS * good, steady progress on Monte Carlo production for Moriond 2017 * tape backlog at sites worked down * JINR looking into tape setup to improve performance * PhEDEx agents were updated in time at all but two sites, ready to switch to FTS3 and new API * _Christoph, Stephan: the old SOAP interface can be switched off now_ * _FTS team: will be done in the intervention next Tue_ * dedicated hi-performance VMs for GlideinWMS, under evaluation using Global Pool scalability test * slots overfilled at sites due to HT Condor bug (triggered by node with high network utilization/errors); bug will be addressed in v8.6 to be released early February * _Stephan: a fraction of the jobs used more cores than reserved; so far this could be mitigated_ * CentOS 7 plans: no general upgrade planned; sites are free to upgrade and provide SL6 via container, etc.; CMS software is also released for CentOS 7; physics validation expected soon; discussions and tests to move HLT farm to CentOS 7; * pilots sent to Tier-1 sites consolidated: one pilot with role "pilot" instead of two with roles "pilot" and "production" ---+++ LHCb * _Andrew: high activity, nothing special to report_ ---++ Discussion * David B: what is the general readiness for !CentOS7 WNs? * Maarten: ALICE and LHCb can use such resources today * Christoph, Stephan: the CMS report has the details for CMS * Alessandro: ATLAS production can run OK, analysis to be checked; <br/> we will check if SL6 containers are OK * _Alessandra after the meeting: the situation for ATLAS was_ _[[https://indico.cern.ch/event/579473/contributions/2429454/attachments/1398035/2132472/20160524_ADC_weekly_centos7.pdf][presented]]_ _in the [[https://indico.cern.ch/event/579473/][ATLAS Sites Jamboree]] held on Jan 18-20_ * _All SL6 SW workflows have been validated_ * _Site admins should read the presentation for details, considerations and advice_ ---++ Ongoing Task Forces and Working Groups ---+++ Accounting TF * New WLCG accounting portal (http://accounting-next.egi.eu/wlcg) has been validated and people are welcome to start using it * Migration to the new accounting reports started. Two sets of accounting reports (current and new ones) have been sent for November and December . In case no major problems are reported, starting from January, new reports become official. ---+++ Information System Evolution <br />%INCLUDE{ "EGEE.WLCGISEvolution" section="20170126" }% * Next meeting Feb 2. ---+++ IPv6 Validation and Deployment TF <br />%INCLUDE{ "WlcgIpv6" section="20170126" }% * NTR ---+++ Machine/Job Features TF * Further updates to DB12-at-boot in mjf-scripts distribution. See discussion at HEPiX benchmarking WG meeting. These values are made available as $MACHINEFEATURES/db12 and $JOBFEATURES/db12_job ---+++ Monitoring * NTR ---+++ MW Readiness WG <br />%INCLUDE{ "MiddlewareReadiness" section="20170126" }% ---+++ Network and Transfer Metrics WG <br />%INCLUDE{ "NetworkTransferMetrics" section="26012017" }% ---+++ Squid Monitoring and HTTP Proxy Discovery TFs * CERN now has the first site-specific http://grid-wpad/wpad.dat service, and CMS is using it for their jobs * hosted on the same 4 physical 10gbit/s servers that CMS uses for squid service * supports both IPv4 and IPv6 connections in order to determine whether squids in Wigner or Meyrin should be used first * for CMS destinations (Frontier or cvmfs), it directs to the CMS squids, otherwise it defaults to the IT squids ---+++ Traceability and Isolation WG * A tool was identified as a possible solution for isolation (without traceability), 'singularity': * WG now evaluating the tool: a small test cluster is being built now * A security review of this tool is needed (transition plan might require SUID) * Next meeting: Wednesday 1 Feb 2017 (https://indico.cern.ch/event/604836/) ---++ Theme: Downtimes proposal followup See the [[https://indico.cern.ch/event/607744/contributions/2449767/subcontributions/218703/attachments/1402467/2141097/LongDowntimes-170126.pdf][presentation]] * Andrew: shifting a downtime can upset the experiment planning, it should not be done lightly * Stephan: only explicit policies can be programmed automatically * Maarten: we can have SAM apply the numbers of the new policies, <br/> whereas the clauses in purple (see presentation) would be "best practice"; <br/> we can always do manual corrections as needed * Julia: manual operations should only be done in exceptional cases, <br/> e.g. when a downtime had to be extended or postponed * Maarten: today's feedback plus the outcome of a discussion with the SAM team <br/> will be incorporated into v3 to be presented in the MB ---++ Theme: Tape usage performance analysis See the [[https://twiki.cern.ch/twiki/pub/LCG/WLCGOpsMinutes170126/20170126_Tape_Staging_Test.pdf][presentation]] * Alessandro: can the FTS handle batches of 90k requests per T1, times 5 sites? * FTS team: v3.6 should be able to handle 500k per link * CMS: we want to understand the tape usage performance through a similar exercise; <br/> afterwards we can do a common challenge together with ATLAS * Alessandro: the common challenge would be at a shared T1 * CMS: the monitoring is tricky; a file could first be read from tape <br/> and then multiple times from disk; some files could already be on disk; <br/> such cases would have to be disentangled * Alessandro: that is why in the ATLAS exercise we took very old files * Alessandro: the throughput was measured between submission of the request and <br/> the file being present in the ATLAS_DATA_DISK space * Alessandro: new files could be made bigger (not trivial), existing files cannot * Renaud: w.r.t. the number of files per request, we will check what limitations exist on our side * Alessandro: a typical campaign would cover O(100) TB and O(100 k) files; <br/> we can customize that per site * Renaud: what do you think of the observed performance? * Alessandro: it did not change over the past years, because experiments did not work on it ! <br/> it also depends on site-specific tape handling aspects * Julia: shouldn't also the recording be optimized? * Alessandro: indeed, e.g. through new tape families ---++ Action list | *Creation date* | *Description* | *Responsible* | *Status* | *Comments* | | 01 Sep 2016 | Collect plans from sites to move to EL7 | WLCG Operations | Ongoing | The EL7 WN is ready (see MW report of 29.09.2016). ALICE and LHCb can use it. NDGF plan to use EL7 for new HW as of early 2017. Other ATLAS sites e.g. Triumf are working on a container solution that could mask the EL7 env. for the experiments which can't use it. Maria said that GGUS tickets are a clear way to collect the sites' intentions. Alessandra said we shouldn't ask a vague question. Andrea M. said the UI bundle is also making progress. <br/> Jan 26 update: this matter is tied to the EL7 validation statuses for ATLAS and CMS, which are reported above. | | 03 Nov 2016 | Review VO ID Card documentation and make sure it is suitable for multicore | WLCG Operations | Pending | Jan 26 update: needs to be done in collaboration with EGI | | 03 Nov 2016 | Discuss internally on how to follow up long term strategy on experiments data management as raised by ATLAS | WLCG Operations | DONE | Jan 26 update: merged with the next action | | 03 Nov 2016 | Check status, action items and reporting channels of the Data Management Working Group | WLCG Operations | Pending | | | 26 Jan 2017 | Create long-downtimes proposal v3 and present it to the MB | WLCG Operations | Pending | | ---+++ Specific actions for experiments | *Creation date* | *Description* | *Affected VO* | *Affected TF/WG* | *Comments* | *Deadline* | *Completion* | | 29 Apr 2016 | Unify HTCondor CE type name in experiments VOfeeds | all | !InfoSys | Proposal to use HTCONDOR-CE. | | Ongoing | | 01 Dec 2016 | Open tickets to sites for moving to FTS3 client | CMS | - | There are !PhEDEx prerequisites | Jan 2017 | DONE | ---+++ Specific actions for sites | *Creation date* | *Description* | *Affected VO* | *Affected TF/WG* | *Comments* | *Deadline* | *Completion* | | 01 Dec 2016 | Proposal for advance warning of long site downtimes | All | - | Please, give feedback to the Dec GDB [[http://indico.cern.ch/event/394789/contributions/2392230/attachments/1388375/2113909/LongShutdowns.pdf][proposal]] | 20th January 2017 | DONE | ---++ AOB
Attachments
Attachments
Topic attachments
I
Attachment
History
Action
Size
Date
Who
Comment
pdf
20170126_Tape_Staging_Test.pdf
r1
manage
640.8 K
2017-01-26 - 15:35
AleDiGGi
Edit
|
Attach
|
Watch
|
P
rint version
|
H
istory
:
r20
<
r19
<
r18
<
r17
<
r16
|
B
acklinks
|
V
iew topic
|
Raw edit
|
More topic actions...
Topic revision: r19 - 2017-02-01
-
MaartenLitmaath
Log In
LCG
LCG Wiki Home
LCG Web Home
Changes
Index
Search
LCG Wikis
LCG Service
Coordination
LCG Grid
Deployment
LCG
Apps Area
Public webs
Public webs
ABATBEA
ACPP
ADCgroup
AEGIS
AfricaMap
AgileInfrastructure
ALICE
AliceEbyE
AliceSPD
AliceSSD
AliceTOF
AliFemto
ALPHA
Altair
ArdaGrid
ASACUSA
AthenaFCalTBAna
Atlas
AtlasLBNL
AXIALPET
CAE
CALICE
CDS
CENF
CERNSearch
CLIC
Cloud
CloudServices
CMS
Controls
CTA
CvmFS
DB
DefaultWeb
DESgroup
DPHEP
DM-LHC
DSSGroup
EGEE
EgeePtf
ELFms
EMI
ETICS
FIOgroup
FlukaTeam
Frontier
Gaudi
GeneratorServices
GuidesInfo
HardwareLabs
HCC
HEPIX
ILCBDSColl
ILCTPC
IMWG
Inspire
IPv6
IT
ItCommTeam
ITCoord
ITdeptTechForum
ITDRP
ITGT
ITSDC
LAr
LCG
LCGAAWorkbook
Leade
LHCAccess
LHCAtHome
LHCb
LHCgas
LHCONE
LHCOPN
LinuxSupport
Main
Medipix
Messaging
MPGD
NA49
NA61
NA62
NTOF
Openlab
PDBService
Persistency
PESgroup
Plugins
PSAccess
PSBUpgrade
R2Eproject
RCTF
RD42
RFCond12
RFLowLevel
ROXIE
Sandbox
SocialActivities
SPI
SRMDev
SSM
Student
SuperComputing
Support
SwfCatalogue
TMVA
TOTEM
TWiki
UNOSAT
Virtualization
VOBox
WITCH
XTCA
Welcome Guest
Login
or
Register
Cern Search
TWiki Search
Google Search
LCG
All webs
Copyright &© 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback