WLCG Operations Coordination Minutes, June 18th 2015

Highlights

  • The discussion on the Information System future is now open. The inventory of Use Cases, dependencies and plans will continue in the coming months. In order to continue the discussion, a dedicated pre-GDB, a new TF or a discussion at the new strategy-making GDB -presumably starting in September- could be used.
  • Sites are reminded to enable multicore accounting.
  • The LFC can now be removed from the Baseline versions' table.


Agenda

Attendance

  • local: Alberto Aimar, Maria Alandes, Julia Andreeva, Marian Babik, Jérôme Belleman, David Cameron, Maria Dimou (minutes), Xavier Espinal, Alessandro di Girolamo, Oliver Gutsche, Oliver Keeble, Maarten Litmaath, Andrea Manzi, Stefan Roiser, Andrea Sciabà, Andrea Valassi,
  • remote: Stephen Burke, Alessandro Cavalli, Jeremy Coles, Alessandra Doria, Michael Ernst, Alessandra Forti, Thomas Hartmann, Felix Lee, David Mason, Rob Quick, Eygene Ryabinkin, Massimo Sgaravatto, Gareth Smith, Vincenzo Spinoso, Ron Trompert, Christoph Wissing, Antonio Yzquierdo.
  • apologies: Catherine Biscarat.

Operations News

  • No news.

Middleware News

  • Baselines:
    • removed WMS and L&B as they are not used by WLCG experiments workflows
    • FTS 3..2.33 + gfal 2.9.2 deployed 2 weeks ago introduced an issue with SRM bringonline towards Storm instances, a fixed version of gfal2 ( v 2.9.3) has been released and already installed in all FTS deployments

  • MW Issues:
    • Regarding the issue reported 2 weeks ago about the globus-gss api name compatibility mode change released in globus-gssapi-gsi-11.16.1, a new version of the library (11.18.1) has been released moving back the behaviour to the previous one, but only as a temporary measure. EGI operations has been contacted to prepare together a broadcast to the sites, in order to check if their services could be affected by this future change.

  • T0 and T1 services
    • CERN
      • LHCb and shared LFC instances going to be decommissioned the 22nd June
    • KIT
      • Updated cmssrm-kit to latest patch version of 2.11 branch, because they requested the support of NFSv4.1
    • IN2P3
      • dCache upgrade to 2.10.32 on core and pool servers (16/06/2015)
    • RRC-KI-T1
      • dCache upgrade to 2.10.30

Tier 0 News

  • Efficiency meeting: the Cloud Team reported on the various improvements addressing I/O wait problems, some of which have been applied, other ones still under investigation. ALICE reported some improvements indeed, but there is still room for more. LHCb commented on the efficiency at CERN which isn't as good as in other sites. ATLAS have been investigating several efficiency aspects on the HammerCloud front: they are in touch with the CERN Batch team to cross-check numbers.
  • ATLAS jointly with the Batch and Cloud teams are investigating I/O wait problems. Some improvements have already been applied and there are a few more ideas.

Tier 1 Feedback

Tier 2 Feedback

UK sites feedback about the Information System:

  • InfoSys is considered useful for service discovery, used by minor VOs managers and users (especially lcg-infosites) and for some migration campaigns.
    • It will not go away if LHC VOs decide not to use it anymore but it will reduce the load of tickets.
    • Some people have multiple roles as sys admins and smaller VOs managers so they are both consumers and providers.
  • InfoSys contains way too much information that is not used.
    • Storage InfoSys is not as bad as the CEs and is used by several tools
    • It is not clear why do we are still supporting two schemas it makes it more difficult to support and there are periodic tickets for problems with one or the other
    • Not clear with future technologies such as cloud if some of the dynamic entries we publish still make sense
  • Having the information all in one place may make it easier to find it but there are some objection about having everything in the BDII
    • Mixed content a port for service discovery not the same as HS06 average value for the cluster used for accounting for example
    • Static vs Dynamic another example
  • Filling the schema(s) without YAIM is a pain although that maybe a one off if done with another tool. It is still labour intensive due to the size of the schema(s)
    • Code is ill documented and difficult to understand which parts fill the ldif
    • Features like subclusters in Glue 1.3 don't work or there are no clients able to use them.
  • Not clear if it could be replaced by other technology. Old debate of InfoSys vs GOCDB for service discovery.

First discussion on the future of the Information System (InfoSys)

  • Alessandra Forti opened the discussion by presenting the (above) assembly of UK sites' feedback concerning the InfoSys usage today and the changes in the support effort if it goes away for the WLCG VOs.

  • Rob Quick in his presentation (slides on the agenda) reminded that OSG makes the InfoSys available as a service to the VOs, which decide what use they make of it. The OSG will change nothing until USATLAS removes all dependencies. The estimated best case scenario today is that the OSG will be able to set the roadmap for the InfoSys deprecation early 2016.

  • Maria Alandes presented (slides on the agenda) the Use Cases, collected so far, for the InfoSys usage in WLCG. It may be possible that some more Use Cases are still missing. People were prompted to report them. The InfoSys possible future deprecation may entail additional questions, e.g. about the future of GLUE. Comments by the participants included:
    • HTCondor is better understood in OSG than GLUE. (Rob)
    • AGIS now relies on the InfoSys. A merge of GOCDB, OIM and the BDII is desirable. In principle ATLAS desires to go the USATLAS (OSG) way. (Ale di Gi)
    • LHCb finds useful the discovery of new CEs appearing on the Grid thanks to the InfoSys today. (Stefan)
    • CMS in principle doesn't use any information from the BDII, although this has to be confirmed. (Oli) In particular Maria A. asked whether the glideinWMS Factory configuration file relies on the BDII to get the list of CEs.
    • ALICE (and all) need the InfoSys for SAM and the CERN IT C5 reports (non WLCG-specific) need it as well. (Maarten)

The discussion on the Information System future is now open. The inventory of Use Cases, dependencies and plans will continue in the coming months. In order to continue the discussion, a dedicated pre-GDB, a new TF or a discussion at the new strategy-making GDB -presumably starting in September- could be used.

Experiments Reports

ALICE

  • mostly high activity
  • CERN: issues due to incompatibility between latest CASTOR and xrd3cp
  • important request to all sites (posted on the alice-lcg-task-force list):
    • please start planning an upgrade of your Xrootd to v4.1
    • please coordinate that with the ALICE Xrootd experts as usual
    • this will help improve the use of third-party transfers
    • it also is needed for making the infrastructure ready for IPv6

No deadline is set yet for this.

ATLAS

  • Data taking: a lot of data recorder thanks to the long LHC duty cycle (some day up to 80%). Trigger rate sometimes up to 2KHz. No showstopper issues.
  • Tier-0: several issues observed during the first 2 weeks of data taking. Intense collaboration between ATLAS experts and CERN IT batch and OpenStack experts. Several improvements made by the batch and Openstack team on the infrastructure, job throughput considerably increased over the past days. ATLAS is now trying to move the most I/O demanding jobs to MCORE, trying these days 4 cores, first results seems encouraging, a task will be processed like this.
  • Network issue at CERN last week created quite some troubles in terms of file and metadata corruption. Still cleaning up the issues
  • There is a backlog of data to transfer from CERN to BNL. After discussion with FTS developers this was found to be caused by the FTS optimizer not pushing hard enough. We have fixed the number of active transfers on the channel but still the throughput is lower than expected.

CMS

  • overview
    • data taking till beginning of this week, now in technical stop and 50ns scrubbing, afterwards 50ns high intensity run
    • MC production and processing in full swing, sustained 1 Billion digitized/reconstructed MC events per month
    • Many thanks to all the sites: Tier-1, Tier-2 and CERN
  • Request to sites
    • Update to Tier-1 CPU Fair Share Policy: 90% production role (was 95%), 10% pilot role (was 5%)
  • Very few issues:
    • EOS at CERN:
      • some file corruption coincide with the network problems on June 11th
      • bug in xrootd plugins that caused intermittent read authentication failures (workaround exists)
    • File transfer issues from FNAL to RAL, suspicion is coincident with many CMS jobs on the RAL farm which cause heavy WAITIO on the storage nodes
  • Questions:
    • Any news debugging the network link problems between P5 and Wigner?

Xavi commented that the problems with storage were due to packet loss at Wigner, still not entirely understood.

LHCb

  • Run2 offline processing
    • validation of offline workflows successfully finished,
    • some data still stored in the pit and needs to be processed, as soon as done will move to offline processing
  • Issues
    • Some very old files were stored on RAL storage without checksums, therefore could not be moved by data management, RAL is fixing those now
    • Some problems with FTS transfers, files cannot be transferred b/c taking too long and proxy expiring, under investigation
  • perfSonar work for extracting information from WLCG sources is done, will move now to insertion into LHCBDIRAC

Stefan will open a GGUS ticket, if the FTS permission errors' problem takes time to solve or put the number here if it exists.

Ongoing Task Forces and Working Groups

gLExec Deployment TF

  • NTR

RFC proxies

  • SAM
    • RFC proxies OK on sam-alice-preprod since yesterday
    • the proxy renewal sensor was locally patched for a quick PoC
    • a proper patch will be incorporated shortly
  • CMS
    • PhEDEx instances are being switched to RFC proxies

Machine/Job Features

  • NTR

Middleware Readiness WG


  • The 11th meeting of the WG took place yesterday.
  • A lot of good work is on-going from most Volunteer sites.
  • Special credit is due to Edinburgh and GRIF for their detailed DPM testing on behalf of ATLAS and CMS respectively.
  • Similarly, great effort is invested by Triumf and NDGF for multiple dCache versions testing for ATLAS and from PIC for dCache testing for CMS.
  • Fine-tuning configuration at CNAF for StoRM testing for ATLAS.
  • New pakiti-client version 3.0.1 is imminent in EPEL Stable. The updated documentation is available to all Volunteer Sites, together to a new configuration file to be used due new PKG DB servers deployment. This new pakiti-client version gives the possibility to specify a tag ( --tag option). MW Readiness nodes should start publishing their packages with the tag MWR. Andrea M. will contact the sites for this upgrade.
  • The MW Readiness App https://wlcg-mw-readiness.cern.ch/ is now available on a production instance. Check here the Baseline MW versions' mgnt view.
  • EL7 support and the move to Java 8 are now urgent for ARGUS. The CERN testbed will be available real soon now for testing under heavy load and other scenarios.
  • The next MW Readiness WG vidyo meeting will take place on Wednesday September 16th at 4pm CEST. Please comment a.s.a.p. if this date is not good!

Multicore Deployment

  • Multicore accounting: several sites haven't yet enabled on some or all their CE yet. mostly CREAM-CEs but some ARC-CEs appear here and there too. APEL team has opened tickets for the NGIs. But here is a reminder of what the WLCG sites should do. List of computing elements. If any problem contact the APEL team.

  • Andreas Gellrich from DESY, reported in email problems with the accounting of multi-core jobs in CREAM as well as in ARC. This has been reported to the developers, e.g. NorduGrid, but is not solved (yet). John Gordon of the APEL team regularly tests things. DESY-HH operations are effected as well, deploying PBS/torque. Related tickets are GGUS:112147, GGUS:114382 and http://bugzilla.nordugrid.org/show_bug.cgi?id=3457

Andrea S. suggested that the list of CEs (above) be sorted by site.

IPv6 Validation and Deployment TF


Squid Monitoring and HTTP Proxy Discovery TFs

  • No progress this week

Network and Transfer Metrics WG


  • perfSONAR status
    • Proposed to establish WLCG-wide meshes for top 100 sites (based on their storage contribution and geographical location). This would enable full mesh testing of latencies, traceroutes and bandwidth.
    • Potential bug was identified and submitted to ESNet affecting latency measurements for long distance testing (US to Europe, Europe to Asia, etc.).
  • Currently evaluating the possibility to publish perfSONAR results directly from the OSG collector (that populates OSG/esmond datastore). Set of patches to extend the OSG collector were submitted for consideration.
  • Next meeting will be on 8th of July (https://indico.cern.ch/event/393101/), planning a detailed update on OSG datastore and FTS performance study.

To a question by Andrea S., Marian replied that most of the 100 sites mentioned in the WG's report have a PerfSONAR instance.

HTTP Deployment TF

The 2nd meeting of the TF took place on 3rd June. The meeting focused on the functionality that the experiments would like to see delivered via HTTP and with what priority. The result is intended to be used by the storage providers and by the subsequent TF activity of validating HTTP storage deployments.

A rough draft of the conclusions (still to be validated by the TF) is available at https://twiki.cern.ch/twiki/bin/view/LCG/HTTPTFStorageRecommendations.

The mandate was modified slightly and agreed - available at https://twiki.cern.ch/twiki/bin/view/LCG/HTTPDeployment

The next meeting will be 15th July and will concentrate on monitoring and validation.

Action list

Description Responsible Status Comments
Organise further discussions on the InfoSys future Maria Alandes - Conclusions should be reached before the September GDB. Check with Ian Bird if the InfoSys is an item for the new GDB.
Remove the LFC from the Baseline table Andrea Manzi - To be done before the next meeting
Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Andrea Manzi Ongoing Details on GGUS:114076

Specific actions for sites

Description Affected VO Affected TF Comments Deadline Completion
Some sites have still not enabled multicore accounting All Instructions here a.s.a.p. still partial
ALL ATLAS sites implementing a cap to their multicore resources (whether their configuration is dynamic just for a portion of nodes or it is a static partition) should review the cap to give 80% of the ATLAS production resources to multicore. As a reminder the shares for ATLAS jobs are as follows T1: 5% analysis and 95% production; T2: 50% analysis and 50% production. So multicore should get 80% of 95% at T1s and 80% of 50% at T2. More info here ATLAS Multicore   None  
LHCb T1s requested to make sure that all the RAW data will be stored on the same tape set in each tape system when it is feasible LHCb - More details in GGUS:114013, GGUS:114014, GGUS:114015, GGUS:114016, GGUS:114017, GGUS:114018, GGUS:114019    
CMS requests an adjustment of the Tier-1 fair share target for the following VOMS roles: /cms/Role=production 90% (was 95%), /cms/Role=pilot 10% (was 5%). Note that for CMS SAM tests the role cms/Role=lcgadmin is used, it basically needs very little fair share but should be scheduled asap to have the test not timing out. Overall at least 50% of the pledged T1 CPU resources should be reachable via multi-core pilots (this is as before - just mentioned for completeness) CMS -   None yet  
CMS asks all T1 and T2 sites to provide Space Monitoring information to complete the picture on space usage at the sites. Please have a look at the general description and also some instructions for site admins. CMS -   None yet  

AOB

-- MariaDimou - 2015-06-15

Edit | Attach | Watch | Print version | History: r48 < r47 < r46 < r45 < r44 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r48 - 2018-02-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback