WLCG Operations Coordination Minutes - July 24th, 2014
Agenda
Attendance
- local: Maria Alandes (chair), Nicolo' Magini (secretary), Andrea Sciaba', Maria Dimou, Andrea Manzi (MW Officer), Markus Schulz, Alberto Aimar, Oliver Keeble, Felix Lee (ASGC), Maite Barroso (T0), Andrej Filipcic (ATLAS), Maarten Litmaah (ALICE), Stefan Roiser (LHCb), Alessandro Di Girolamo (ATLAS)
- remote: Antonio Perez (PIC), Burt Holzman (FNAL), David Crooks, Jeremy Coles (GridPP), Michael Ernst (BNL), Thomas Hartmann (KIT), Ulf Tigerstedt, Sam Skipsey, Alessandra Forti, Federico Melaccio, Renaud Vernet, Isidro Gonzalez, Christoph Wissing (CMS), Emmanouil Vamvakopoulos, Alessandra Doria, Michel Jouvin, Julia Andreeva
Operations News
- The Gstat tool is no longer necessary for the WLCG operation, and while it is still available and is in use by several sites, it is no longer officially supported, so it should not be considered as a WLCG tool. Note that the GGUS SU remains open since EGI has still some SAM dependencies on Gstat that will be replaced in the upcoming months with glue-validator.
- Notes of the last WLCG Workshop are now available in the following link.
- Next WLCG Operation Coordination meetings: 21st August and 4th September
Middleware News
- Baselines:
- FTS3 v3.2.26 : include workaround for a crash due to missing multithreading support in canl/gridsite
- DPM 1.8.8 : all compos released in UMD too.
- perfSONAR 3.3.2-17 : 2 vulnerabilities discovered, thanks to the Trasnfer/Metric WG we have wiki pages describing the issues and how to upgrade
- Storm 1.11.4 several bug fixes. Available in UMD from today
- MW Issues:
- “Missing Key Usage Extension in delegated proxy” affecting CREAM: for this reason ATLAS jobs cannot use Rucio functionality due to WN proxy being defective. Fix foreseen for 1 September.
- “dCache Upgrade side effects” affecting dCache: the file protocol was left enabled and published on the BDII. new version 2.6.31 expected in EMI, but workaround available.
- “Problem with opening files from EOS using Brazilian proxies” affecting EOS: this ffect some users with a cert issued from a Brazilian CA: Issue fixed to be deployed this week in production.
- T0 and T1 services
- NDGF is planning to upgrade to dCache 2.10.1 when it will be released.
- Recent changes:
- BNL /CERN/FNAL/RAL
- FTS3 deployment/upgrade to 3.2.26
- KISTI
- IN2P3
- JINR-T1
- dCache upgrade 2.2.27, 2.6.31
- CVMFS upgrade to 2.1.19
- ~ 75 tickets opened to sites for the upgrade. 40 already done, some of them planned before the end of July, Beginning of August.
- Andrea Manzi comments that tickets were opened also to sites which have installed CVMFS over NFS, and sites are replying about the upgrade situation.
Oracle Deployment
Tier 0 News
- FTS3 Software Upgrade on Tuesday July 22nd, from 3.2.22 to 3.2.26
- new FTS3 pilot dedicated MySQL instance (it used to be shared with the production instance) being deployed today
- The CvmFS stratum 0 services will be upgraded from 2.0.15 to 2.1.19 in the next days. During the intervention writes will be impossible. All CvmFS mounting clients will be unaffected unless running version <= 2.0.X when they will stop working.
- The CvmFS upgrade schedule is detailed in the CERN IT-SSB
Tier 1 Feedback
Tier 2 Feedback
- From GridPP:
- CVMFS monitoring was mentioned at the WLCG workshop. Several UK sites have commented that additional low level hardware monitoring of this type is not needed for every site and it should be made opt-in. More focus could go into functional tests (such as provided through Nagios using the existing CVMFS probe
) and building on the outputs of the Squid Monitoring Task Force.
- Gstat 2.0
reports many sites as status 'CRITICAL'. The problem for most (in the UK at least) appears to be that a value was to be declared as "Production" in GLUE 1 and "production" in GLUE 2; at some point they changed both to lower case and sites did not pick up on this change. In principle we can have most sites fix this quickly, but as gstat is largely unsupported (Stephen Burke indicates that certainly the info system tests are unmaintained though the GLUE 1 tests should still be valid) is it worth doing, and if not perhaps the Gstat pages need to come with a warning to avoid causing unnecessary work?
- Progress is being made to pick-up banning lists from the central ARGUS, however the implementations for banning users within SEs are not entirely clear or are not presently deployable. For example GGUS 104885
and GGUS 96374
highlight problems for StoRM. Support exists in DPM for syncing banning lists from a central ARGUS, but the required package is not installed from the EMI3 repos by default when installing a DPM system and is also hindered by a packaging issue.
- About the CVMFS monitoring, Maarten comments that there was a meeting between Dave Dykstra and Costin Grigoras to discuss how to proceed from the existing Squid monitoring and Costin's proposal.
* The issues with ARGUS bannings are common to other sites, not just UK.
StoRM developers to be contacted about ARGUS procedures. Oliver Keeble confirms that the DPM issue needs to be fixed.
--++ Experiments Reports
ALICE
- activities oscillating between high and very low
- ALICE site admins have been reminded of the necessary configuration changes to support the new VOMS servers for LHC experiments and ops
ATLAS
- Various ATLAS internal issues related to the new ATLAS frameworks commissioning/integration
- Commissioning Rucio with FullChain Test ongoing, in parallel using Rucio as file catalog accessed through DQ2.
- Commissioning JEDI and ProdSys2.
- using steadly MCORE resources since approx 2 weeks: 35/40k core slots, i.e. 4/5 k parallel running jobs. All ATLAS sites are asked ( since many weeks) to proceed into the setting up of dynamically allocated MCORE resources.
- US: discovered that the amount of DarkData is quite high (1PB aggregated), we exposed a plan on how it will be solved in the coming days
- heavily affected by the DNS - LB issue
CERN had yesterday (reported on the IT SSB): many jobs finishing/starting during that period failed
- we saw the entry in the SSB: is it possible to be fed via email to one specific IT SSB entry? Like "notify on update this egroup"....
- under discussion within ATLAS new way of measure the availability of the sites, through their availability for analysis. More news in the coming weeks
- One standing issue since more than one week: SARA-MATRIX GGUS:106748
MCTAPE full
- Alessandro and Alessandra explain that around 70 sites out of 120 are not multicore-enabled.
- Jeremy Coles in chat: In the UK our sites are aware of the MCORE request. We are making sure the setup is working at 5 initial sites and well documented. Several sites are reluctant to make changes before admins go on vacation.
- On the DNS issue, Maite explains that the master was removing machines from the alias without reason; the workaround was to switch to the slave. She suggests to bring Alessandro's SSB feature request to SNOW and reminds that subscriptions can be made by FE.
- Alessandro reminds that the SARA-MATRIX issue was reported repeatedly at the WLCG daily meetings.
CMS
- Processing overview: Grid resources basically busy
- Finishing samples for CSA14
- Upgrade MC
- CAS14 (Computing Analysis Software challenge 2014)
- CRAB3
- New user tool for analysis job submission
- Want to reach 40k concurrent jobs
- Complement user jobs with HammerCloud submission of CRAB3 jobs
- AAA
- Allow remote access of files
- Increase scale over time
- Exercise Dynamic Data Placement and Cache Release
- miniAOD
- Improving EOS access at CERN
- Intense reading from HLT, producing high-PU samples
- Increased replication factor for various directories
- Understanding file opening problems
- Russian Tier-1 T1_RU_JINR
- First usage for processing workflows
- No issues so far - will extend the scale
- Continue MC production
- Move to GGUS for CMS Computing Operations
- Want to disable submission of new Savannah tickets in a few weeks
- Still same small improvements to be implemented in GGUS
- CMS SAM tests
- Want to make xrootd-fallback test critical
- Needed to postpone, still issues at RAL
- Removal individual release tags from CEs
- Not used by recent CMS submission tools
- Almost completed
- Maria Alandes opened a few tickets in agreement with CMS
- Reminders for sites
- DPM 1.8.8 is now recommended also by CMS
- Move to FTS3 for all transfers
- Upgrade CVMFS to 2.1.19
- Update xrootd fallback configuration
- Add "Phedex Node Name" to site configuration
- About DPM 1.8.8, Oliver reminds that the fix is in the repositories now. CMS was not explicit in lifting the warning when the fix was released, leading to confusion at the sites for one month.
LHCb
- Data Processing Activities
- Reprocessing of "first 2010 14 nb" finished, productions are closed
- Mainly MC and user activities currently ongoing
- Operations
- Access to lcg-voms2 and voms2 from outside CERN are timing out. The team operating the outer perimeter CERN firewall was asked to open the necessary ports such that the connection to the new VOMS servers will immediately be dropped by the host (GGUS:107014
)
- Sunday 20 July a virtual machine at CERN Openstack hosting a vobox service for DIRAC was not accessible anymore. The reason was a problem of the underlying hypervisor and the VM (hosted on the HV) could not be started. The AI team has proposed to overcome such problems in the future by hosting VMs on Ceph in the future (GGUS:107065
)
- re-curring problem after upgrade of dCache sites return file://
protocol which needs to be explicitly disabled it seems (GGUS:106368
), happened already also on other sites. Should be fixed as of version 2.6.31.
- Middleware
- Maarten comments that the voms2 server firewall settings are fixed now.
- Maarten reminds that the VOMS client v3 is in production (e.g. on lxplus) and working, though slower compared to v2. The developers are not willing to maintain v2 further. Memory usage of v3 now OK at 16 MB, fixes issues like timeout.
- Stefan reminds that the LCG AA area deployment is a preproduction version so it's not in production for LHCb, Andrea Manzi notes that it needs to be fixed.
- The other experiments remind their status:
- ATLAS moved to arcproxy and is happy
- ALICE is happy with VOMS client v3
- CMS uses whatever is on the WN, usually v3.
- Maria Alandes proposes to define actions offline.
- About the issues with VMs, Alessandro comments that currently it is not possible to instantiate servers with more than 2 cores due to general shortage of resources. Maite proposes to follow up offline with cloud managers.
Ongoing Task Forces and Working Groups
Tracking Tools Evolution TF
Following a discussion inside the GGUS development team we decided to propose the closing of this Task Force at today's WLCG Operations Coordination meeting. This proposal got accepted. The reasons for this decision are here:
https://twiki.cern.ch/twiki/bin/view/LCG/TrackingToolsEvolution#Wrap_up
- The GGUS team will continue to report at the WLCG 'daily' meetings.
- Maria Dimou proposes to invite Guenther Grein to GDB to present the CMS GGUS interface in case others are also interested.
- Maria Dimou comments that the last incomplete task in the TF task table (FIx GGUS - SNOW interface for Requests) will remain as a standing item for the developers, not worth to keep a TF open for it.
FTS3 Deployment TF
- Migration to FTS3 nearly completed for CMS
- Started to verify optimizer behaviour under transfer error conditions. Simulated 'unrecoverable' errors, now want to test with 'recoverable' errors.
- Can now monitor FTS3 transfers of user job output files in the Dashboard, as requested by CMS Computing Integration for CRAB3
- Maria Alandes asks to document the tests in the TF task table
gLExec Deployment TF
- 85 tickets closed and verified, 10 still open (no change)
- Deployment tracking page
- other activities on hold until proper resolution of Argus stability issues
- unreleased rpms with important fixes in use at various sites
- Maarten explains that the ARGUS rpms were not officially released (lacking release notes or validation) because of the drop of the SWITCH support level. They were distributed privately to the sites in GGUS tickets.
- Discussion if the rpms should be published in a repository. Maarten answers that publishing the rpms in this way will just hide the real issue which is the lack of support for ARGUS, Markus agrees. Do not push the sites to upgrade to this version.
- Markus comments that the issue cannot be solved by Operations, the MB needs to find organizations willing to commit to support. Open GGUS to MB?
Middleware Readiness WG
- The 5th MW Readiness WG meeting, held on July 2nd, led to adoption of the MW Package Reporter prototype, developed by Lionel Cons, by the pilot Volunteer Sites (Edinburgh and GRIF). Please check the Minutes and Actions
linked from the the Agenda
.
- The MW Officer, Andrea Manzi, is actively working with these sites for the deployment of recent DPM components' versions. DPM is the pilot WLCG MW package used for the Readiness Verification effort. Progress is traced in a dedicated JIRA tracker
.
- The developer of the MW Package Reporter, Lionel Cons, using feedback from the Edinburgh & GRIF experience, produced fixes and keeps fine-tuning the tool before further distribution to other Volunteer Sites.
- CMS expert, Andrea Sciaba', following discussions internal to the experiment concluded that the testing that OSG and in particular the glideinWMS developers do, is enough. This covers the discussion and action given to ATLAS and CMS, at the July 2nd meeting. The original proposal was to use test instances of pilot factories, also validating against CREAM and ARC CEs.
- A presentation on the WG was given at the WLCG Workshop in Barcelona Check the slides here
.
- Book your diaries!! Next meeting on October 1st at CERN with audioconf and vidyo. Provisional Agenda here
.
- Andrea Manzi reports that DPM 1.8.8 + memcached plugin was chosen as the first product to follow the verification workflow, all tests look fine.
Multicore Deployment
- ATLAS: restarted multicore production and up to now has had a pretty stable flow of jobs
for the past 2 or 3 weeks.
- CMS and ATLAS have been running concurrently at 5 T1 sites. Here is the monitoring for ATLAS PIC, FZK, IN2P3, RAL and CNAF
and CMS FZK
, RAL
, IN2P3
, PIC
, for this period.
- PIC farm, for example, has been consistently running CMS, ATLAS T1 and ATLAS T2 multicore jobs for three weeks. The principle of dynamic allocation of WNs, provided here by the mcfloat script, is showing good results, as jobs are running together reusing empty multicore slots, while the total number of WNs allocated to multicore jobs has been slowly but steadily increasing along the weeks with good overall farm utilization.
- UK: multicore discussed in depth in UK this week. SGE, HTcondor and Torque are the most used batch systems and there is a solution to try for each. A site for each batch system currently running ATLAS jobs. The other sites asked to have links to different solutions and more details on the configuration from the TF page which is on the TODO list - in paticular the solution for Torque (mcfloat script as used now by NIKHEF and PIC) need more detailed explaining. Italian sites asked for the same at the WLCG workshop.
- Alessandra answers to Maria that they are not yet opening GGUS tickets to missing ATLAS sites.
- Alessandra comments that the next priority is to improve the documentation. Pilot sites need to write the docs for their batch system, especially for Torque (most common) by NIKHEF and by Alessandra from the Tier-2 point of view. RAL provided docs for HTCondor and scattered docs are available for SGE. To be turned into a TF task.
SHA-2 Migration TF
- introduction of the new VOMS servers
- an EGI broadcast
with a new timeline has been sent on July 1
- compliance of the WLCG infrastructure will be tested with the SAM preprod instances
- first for ALICE, since the early hours of July 23
- initial results
for ALICE show:
- OK tests for a significant number of sites
- many failures due to simple configuration issues of CREAM and/or Argus
- no new showstopper so far
- while the new servers are not in production, all other connection attempts need to be refused
- from many sites such attempts were found to time out instead
- this affected CMS (GGUS:105820
) and LHCb (GGUS:107014
)
- on Wed the network team made some changes, which appears to have solved the matter
- RFC proxies
- no progress on the open issues due to other priorities
- Maria Alandes will provide a script to Maarten to open GGUS tickets in bulk to the sites to track the status of the introduction of the new VOMS servers.
- Maite comments that given the vacation period it seems late to open now tickets with a September 15th deadline, Maarten notes that the original deadline in the broadcast was July 15th.
- Markus suggests to provide an estimate of the work needed in the ticket: according to Maarten it is less than one day at small sites, but it depends on the site config.
- Alessandro agrees with the plan to switch to the new VOMS servers in the ATLAS SAM preprod instance.
WMS Decommissioning TF
- Upgrading to HTCondor 8.2.1 resolved GAHP segmentation faults
- Work ongoing to improve SAM Condor probe jobs lifecycle - proper cleanup of jobs from the queue
- In a daily A/R site-based comparison, few differences remain for ATLAS (all understood)
- For CMS, in daily comparison most of the differences for T3s, only few for T2s (all understood)
- Worker node probes reporting different results (glexec, analysis)
- SAM publisher unable to start on the worker node
- Maite reminds that the WMS decommissioning should happen before the quattor decommissioning at end of October, Maarten says that it looks feasible by September or early October.
IPv6 Validation and Deployment TF
- CMS: the current testbed will be switched from bare GridFTP to FTS3, using the KIT or the Imperial College FTS3 servers, which are IPv6-ready.
- ATLAS: the PanDA dev instance is already dual-stack and job submission tests will start after completing the migration to Big Panda and JEDI.
- LHCb: in the process of finding pure IPv6 nodes from which to run tests.
- ALICE: working on migrating AliEN to Perl 5.18. The currently used version (5.10) does not support IPv6.
- xrootd: tests of xrootd 4 at QMUL confirm that it works with IPv6 out of the box. The tests consisted on transferring files via xrdcp from lxplus-ipv6 (link
).
HTTP Proxy Discovery TF
Network and Transfer Metrics WG
- Kick-off meeting doodle sent, proposed dates are in the last week of August and second week in September
Action list
- ONGOING on the WLCG monitoring team: status of the CondorG probes for SAM to be able to decommission SAM WMS
- ONGOING on the middleware officer: report about progress in CVMFS 2.1.19 client deployment
- NEW on the WLCG monitoring team: evaluate whether SAM works fine with HTCondor CE. Report about showstoppers.
- NEW on the Operations Coordinators: Follow up with LHCb performance issues with voms-clients v3
AOB
- The next meeting is on August 21st.
--
MariaALANDESPRADILLO - 20 Jun 2014