Minutes of Pre-GDB on WLCG Computing Operations - 15 January 2013

  • Local: Maria Girone, Andrea Sciabà, Philippe Charpentier, Ian Fisk, Simone Campana, Maarten Litmaath, Predrag Buncic, Julia Andreeva, Gonzalo Merino, Markus Schulz, Ian Collier, Fernando Barreiro, Alessandro Di Girolamo, Nicolò Magini, Daniele Spiga, Maite Barroso, Oliver Keeble, Fabrizio Furano, Domenico Giordano, Mattia Cinquilli, Maria Alandes, Ikuo Ueda, Michel Jouvin, Luca Dell'Agnello, Stefan Roiser, Jan Van Eldik, Ulrich Schwickerath, Marco Mascheroni, Manuel Guijarro, Wei-Jen, Mattias Wadenstein, Patrick Fuhrmann, Josť Antonio Coarasa, Luis Granado Cardoso ...
  • Remote: John Gordon, Cristina Aiftimiei, Peter Solagna, Peter Clarke, Pepe Flix, Alvaro FernŠndez, Mihnea Dulea, Claudio Grandi, Daniele Bonacorsi, Stephen Burke, Denise Heagerty, Massimo Sgaravatto, Christoph Wissing, Alessandro Cavalli, Tanya Levshina, Salman Toor, Alberto Aimar, Alison Parker

Network workshop summary (D. Bonacorsi)

Daniele presents his summary of the LHCONE point-to-point service workshop held in December. The workshop had a good mixture of network and experiment talks with a lot of time for discussions. Among the various topics, the workshop discussed the similarities between the ATLAS and CMS DM/WM systems and the ideas about how network status information and network provisioning could be plugged in the DM/WM systems of the experiments. As Costin pointed out in a talk on the ALICE DM/WM, network information is becoming important as we are approaching the limits in the network capacity. A talk by Artur described the ANSE project (Advanced Network Services for LHC Experiments) whose purpose is exactly to study how network resource allocation could be used to improve the overall throughput by integrating network-aware tools in production workflows. Daniele found these ideas very promising and the timing good (using LS1 to implement them), though it has to be carefully evaluated if the development effort is worth the gain, taking into account the limited available manpower in ATLAS and CMS.

SHA-2 (M. Litmaath)

Maarten presented the current status of the SHA-2 adoption. An important news is that dCache confirmed the patched JGlobus-2 works for them, which allows using SHA-2 certificates without having to use RFC proxies. SHA-2 certificates will be issued for production not before August 1st (but a more precise schedule will be defined only later in the 1st half of 2013), when the EMI-1 middleware will be phased out (on April 30, plus a 3-month grace period). However, for some services (CREAM, WMS, Storm, dCache?) only EMI-3 will be fully SHA-2-enabled. OSG will phase out old versions in spring. Some experiment software and some central EGI/OSG/WLCG services may still need to become SHA-2-ready. EGI will verify the middleware compliance using a SHA-2 test and check the infrastructure using SAM. This month the new CERN CA with SHA-2 will be ready (but it will not yet be in IGTF). In summary, the first milestone must be to have all relevant software working with SHA-2. Once this is the case, RFC support will come automatically in most cases and we can switch to RFC proxies later at our convenience, hopefully next year.

Maarten clarifies that everything will be backwards-compatible, so for example host certificates with SHA-1 will still work. Concerning the gLite WMS for SAM, keeping it running a bit longer just for that would not be an issue. Patrick raises the issue of three CAs which put emails in the DNs, causing problems to dCache. Maarten answers that David Groep was informed three months ago, but it looks like at least one of these CAs will be around until the end of 2013. It might be necessary to modify JGlobus to support them. After the meeting Brian Bockelman found that the code should already be OK also in that respect.

gLExec (M. Litmaath)

All EGI ROCs now submit the gLExec tests using the ops VO. Sites should declare in GOCDB which CEs are gLExec-enabled and EGI will automatically submit tickets to sites that fail the tests. The number of sites claiming support for gLExec was 82 on January 14 (lower than on November 13 when it was 86, typically due to reconfigurations). Concerning the Tier-1 sites, only a few fail tests for some of the VOs they support: BNL, SARA, IN2P3, KIT, ASGC (in most cases minor configuration problems very easy to fix). From the experiment point of view, CMS is ready to enable gLExec, LHCb almost ready, ATLAS possibly in about two months, while ALICE probably will need many more months.

About the timeline, Ian F. points out that everything must work by the end of the LS1, but in order to properly test everything (including the scalability of ARGUS and GUMS when they will be hit by thousands of WNs) the deployment must be finished by 2013. Maria G. asks ATLAS and LHCb to define clear milestones; more in general, she asks for a timeline for experiments and sites to be ready in one month from now. Ian F. proposes to re-use a spreadsheet for milestones used by the Management Board. About LHCb, Stefan says that LHCb did not make a lot of progress lately, but the required changes are easy.

CVMFS (S. Roiser)

The NFS export of CVMFS was tested and found to be very stable (just showing some network load if many jobs start in a short time) and will be released to production next week. There are just two pending issues: sites which are both a Tier-1 and a Tier-2, and a couple of ATLAS sites using LUSTRE and running in "disk-less" mode. ATLAS and LHCb have a target date of April 30 to stop installing software with the old method; CMS for now is strongly recommending sites to deploy CVMFS. ALICE will ramp up their deployment in the coming months.

The deployment status is very good, with many more sites with CVMFS than one month ago, and most of the other sites providing at least a tentative date. Only 11 sites did not even provide information. Some tests are being developed to be integrated with the experiment SAM tests.

Ian F. says that in CMS the idea is to test the health of CVMFS and remove the software tags. Gonzalo asks if ATLAS plans to get rid of the NFS shared area (a small one would still be needed), and Alessandro answers that it is not planned for now (but alternatives are considered).

Information System (A. Sciabà)

Andrea summarises the motivations for the renewed interest and activities on the Information System, which basically are the adaptation to the current use cases, the wish to consolidate parallel developments in the experiments and following up on the TEG recommendations. Some of the past issues (instability, validity, accuracy of the information) are (or soon will be) mitigated by features as caching in top BDIIs and a more thorough validation of the information. Current use cases have a very strong focus on resource discovery, while dynamic information will be less and less important, in particular with the decommissioning of the gLite WMS, and installed software publication will become unnecessary with the adoption of CVMFS. Crucial challenges in the near future include publishing new types of services and non-Grid, or non-WLCG resources (e.g. clouds). A possible approach could be to have a "WLCG aggregator" for the relevant information using existing sources (BDII, GOCDB, OIM, REBUS) and future sources (e.g. clouds). This would make it much easier for the experiments to access this information, as they should just add their own specific information.

Markus asks if there is a deployment plan for this proposal. Andrea answers that it is far too early for that, as the idea was put forward only a few days ago. Then Markus asks if this would also require the development of a new client. Andrea answers that it is a natural assumption, if a new service is developed. Markus then points out that this would mean work for the data management tools. Ian F. comments that this is a non-issue in WLCG, as all the experiments that use GFAL or the lcg-utils completely bypass the BDII via the -b option.

Maria G. proposes to have in one month from now a more detailed timeline and a deployment scenario.

Data management changes (I. Fisk)

Ian F. illustrates the changes proposed by the experiments, that try 1) to improve the access to data or 2) to reduce the amount of resources needed for the same level of access. This is the case for the disk-tape separation and the storage federations.

Disk-tape separation was requested by CMS to all Tier-1 sites: all data staging, migration to tape and clearing of disk caches would be explicitly triggered by the experiment, which would take care of managing the disk space. The implementation is at the site's discretion. LHCb is proposing something similar, while ATLAS and ALICE are using such approaches already.

Patrick asks if different endpoints for disk and tape is a requirement or two paths in the same endpoint would be OK; Ian F. answers that it does not matter as long as there are two "PhEDEx endpoints". Gonzalo asks if ATLAS and LHCb think to follow the same route. Simone explains that in ATLAS this separation is there since a long time (the concept of T1D1 was soon abandoned) but still they use the automatic prestaging to a disk buffer of around 100 TB (for a 10% Tier-1) in front of the tape system for reprocessing and there is no plan to change this. Philippe adds that LHCb is doing the same as ATLAS and the only advantage he sees in the CMS paradigm is to avoid the occasional job failures due to files missing from disk. Ian F. maintains that the CMS experience shows that managing the disk cache would make things easier and this is no different from the way EOS is used.

Storage federations are used by ALICE since the beginning and are being deployed by ATLAS and CMS, while LHCb is open to that possibility. An immediate advantage is fallback to other storage in case of failures but on the longer term it might allow controlled access to non-resident data on a larger scale or self-healing storage. Currently, federations are xrootd-based but http federations might become an option. Another type of federation is the Tier-1 storage "cloud" that CMS proposes to implement by putting the Tier-1 worker nodes on the OPN, thus allowing for transparent access to data at other Tier-1 sites; this strategy is the same discussed at the Amsterdam workshop in 2011.

Ian F. clarifies that so far only KIT expressed concerns about putting the WNs in the OPN, due to a possible lack of IP numbers; some sites already implemented it. Gonzalo says that PIC would need to deploy a NAT in the OPN: doable but not trivial. Maria G. asks who should follow up on this: WLCG operations, the network WG, the experiments? Ian F. thinks that it would be best if the WLCG operations did it. Philippe sees no harm in it.

This would be beneficial to all VO's: it would reduce the need to fall back to tape and for data replication between Tier-1 sites. Something similar on the LHCOne is also being considered. Eventually this might relieve some Tier-1 sites from the need of a tape archive facility and therefore reduce the operational costs (but this is for a time scale longer than LS1). A realistic scenario could be to have CERN and four Tier-1 sites providing e.g. two copies of the data archived on tape.

It is possible that by 2015 users will take storage federations for granted and will expect to access data not too differently than how they access e.g. videos!

LHCb Data Management after LS1 (P. Charpentier)

Philippe summarises the current DM model in LHCb and a possible evolution. Concerning the replica catalogue, it will still be needed for job brokering and DM accounting, but would not need to be as highly accurate as now and could allow files to be unavailable without the job failing. About the access and transfer protocols, there are no strong preferences between xrootd and HTTP/DAV, but he wonders why not to prefer standard protocols, also considering the success of CVMFS. SRM would still be used for file staging and to know the amount of free and used space; space tokens could be dropped and replaced by different endpoints. The next steps would be to add GFAL2 and FTS3 support in the DIRAC DM, migrate to the DIRAC File Catalogue and investigate HTTP/DAV. It is still unclear how dynamic data caching could help in practice (that is, without ending up with all data replicated everywhere); data popularity information is being collected since several months, but some questions need to be answered, for example what metrics to use and whether to perform automatic or manual replication/deletion.

Maria G. comments that given the interest on data popularity it would be useful to have common discussions. Domenico and Simone are happy to share the experiences in CMS and ATLAS about the metrics, the way the information is used and the algorithms for data deletion and replication. Concerning the dynamic data placement, Simone says that it helped a lot in ATLAS but it still requires some fine tuning.

Maria G. asks what is the deployment model for FTS3. Alessandro and Oliver answer that the stress testing still needs to be done before the deployment scenarios (which are many) can be discussed. The next step is to test exchanging messages between FTS servers. Philippe asks for an estimate of when a model will be defined and Maria G. supports having it by a defined date. Markus objects that having a single central FTS server might introduce a single point of failure even for transfers between external sites, but Ian F. and Philippe do not see it as a problem, considering that if CERN is offline, FTS is the least of our worries.

HTTP solutions (F. Furano)

Fabrizio describes the status of HTTP-based solutions in the middleware. Data access, transfer and federations can be done using HTTP and the challenge is to build quality components to fulfill the HEP requirements.

Work on DPM aims at integrating many backends (legacy pools, S3, HDFS, etc.) and supporting many data access frontends, in particular HTTP/DAV. Since version 1.8.4, DPM is based on DMLite where the historical DPM API is now implemented on top of DAV, GridFTP-equivalent access is available and random I/O as well. File transfer performance over WAN via HTTP is becoming competitive. HTTP is also good for implementing federations. GFAL2 supports HTTP (and consequently FTS3 does). LFC and Dynamic Federations complete the picture for providing a full implementation of federations.

A crucial point is the availability of a full client implementing redirection, security and "posix-like" functionality mapping into DAV, and none of the existing standard clients provides all that HEP needs (but they could be a starting point, e.g. libNEON).

Dirk explains that development of HTTP support in EOS had started but was halted waiting for a complete client to come out. Having a semantically complete HTTP implementation will require a lot of work (one could think of implementing it on top of xrootd). Predrag stresses that the tricky part is the client (and properly implementing things like caching): the server is less critical. He adds that one could extend the CVMFS model to data, but only when giving up on access control (that is, for read-only, public data).

CPU accounting of public cloud resources (F. Barreiro)

Fernando presents a proposal for the job accounting on public (e.g. commercial) clouds. Currently, accounting can be accurately done on Grid resources, local batch resources and cloud resources where we can access the hypervisors (e.g. CERNVM), but not on public clouds, where no access beyond the VM is possible. The proposal is to use the accounting information collected by Hammercloud running standard jobs as a way to convert events/sec in HEPSPEC06. The conversion factor can be estimated from the database of submitted jobs on sites where HEPSPEC06 values are known. As a by-product, one could use this to identify sites that publish an inaccurate HS06 value.

Philippe comments that LHCb used MC production jobs as a private benchmark but found it to be accurate only at the 10-20% level due to differences among the individual WNs. Alessandro says that it is true, but having an overall average for the site is fine enough. John comments that it is a known problem that sites with accounting problems are not sent tickets, and moreover site performance measured using MC jobs was consistent over time on some sites but varied significantly on others. He believes that, to increase the accuracy of benchmark values, either sites should periodically verify them, or benchmarks should be run from the outside. Simone points out that the latter option is precisely what is being proposed.

Cloud Accounting (J. Gordon)

Since 2011 the EGI Federated Cloud Task Force is looking into how to use multiple clouds via common interfaces and is working with the APEL team on accounting. There is a testbed comprising several WLCG sites and running a number of different cloud infrastructures, publishing accounting records to RAL using SSM. What is accounted now are VM instantiations but it would be easy to account for the usage inside the VMs in the same way as inside WNs. Alternatively PanDA could generate usage records per workload (not just per pilot); in that case, to avoid double counting, the traditional processing of batch system accounting records would need to be disabled for ATLAS. Finally, John recommends not to develop something orthogonal to EGI (taking into account that WLCG sites are involved).

Maria G. asks how is it planned to coordinate with OSG. Tanya says that OSG plans to use APEL for cloud accounting, and anyway they do not use any public cloud. Philippe asks why WLCG should get accounting information from commercial clouds: it is relevant for the experiment but not for WLCG. John says that it is because WLCG management want to have an overview of all the experiment activities. Ulrich warns against mixing two different kinds of accounting in APEL (namely, normal usage records and Hammercloud usage records); John says that the accounting portal could easily allow to distinguish them. Maria G. asks if there are any milestones in EGI to guide us: John says that a document describing a test setup should be ready by the end of April. Maria G. concludes that the two approaches (official accounting and HC accounting) do not clash with each other and they are complementary. Philippe insists that - even if for experiments it is important to account also external resources, and this is reported - the WLCG accounting system should not try to include them. Alessandro gives as an example the case of a site that decides to provide some of the pledged resources via a commercial cloud, and such a site would be interested in knowing if it is getting good value for the money. Ian F. adds that so far we considered important to have an official accounting not generated by the experiments, to avoid possible abuse. Philippe asks why should we trust more the official accounting, given that we know that sometimes it is wrong. Ian F. answers that it was compared with the Dashboard information, and as long as the discrepancy is within 10%, it is fine. Maria G. concludes that we need to agree on a common path.

Testing the CERN Agile infrastructure (M. Cinquilli)

ATLAS and CMS have been testing the Agile infrastructure at CERN, based on OpenStack, similarly as it was done for StratusLab and LxCloud. This is particularly interesting both for CERN IT (as it provides a test system) and the experiments (which can start using some extra resources). Mattia shows some recent results obtained by running different types of ATLAS and CMS jobs on a testbed of 200 VM (4 cores each) per experiment; job efficiency was very high (1% failure rate), with a measured overhead with respect to LXBATCH between 7% and 21%. Work is ongoing at BNL to implement automatic provisioning of VMs in PanDA, while this feature is already available in glideinWMS through Condor.

Ian F. comments that an overhead of ~20% seems a bit too high for jobs that are CPU-bound: values between 3% and 10% should be expected. The IO is local but very little. It is possible that the comparison with LXBATCH is not the best way to measure such overhead, for example if the WNs are different.

Sim@P1: how to profit of the ATLAS HLT farm during the LS1 & after (A. Di Girolamo)

Alessandro describes the plan for using the HLT infrastructure for grid production using an OpenStack overlay. The next two months will be devoted to set up a testbed and integrate it into PanDA. The setup is expected to be very similar to the one used for the OpenStack tests on Agile.

Use of HLT farm and Clouds in ALICE (P. Buncic)

ALICE intends to use the capacity of the HLT farm (currently equivalent to the whole US-ALICE share) as recommended by the C-RRB; by 2018 the capacity should increase to something equivalent to all of today's WLCG resources. The idea is to use CERNVM to deploy VM instances, potentially on several cloud infrastructures (the HLT, the CAF, Amazon etc.). Various adaptors have already been implemented (the ALICE HLT will use Libvirt). Testing of the deployment was successful and a CVMFS repository was set up. It is still needed to adapt AliEn to use CVMFS. ALICE is looking forward to working together with other experiments and the CERNVM team on a common solution.

Maria G. asks whether the people working on the HLT farms have a common discussion forum. Josť (for CMS) answers that they are in contact with ATLAS and have shared meetings but so far their work was in parallel to the IT OpenStack activities. Maria G. suggests to avoid such parallelism and try to work together; the computing coordinators should say if the WLCG operations coordination WG ought to be involved. Ian F. agrees with that and proposes to consider the possibility of even sharing the HLT resources. Predrag objects that HLT farms are really about partitioning, not sharing, and in the past we had problems trying to build scalable solutions for everybody. Ian F. gives two motivations for sharing: 1) HLT resources are unique in the sense that they may have to be shut down for a long time, and 2) they are so large that we might find it difficult to saturate them by ourselves. Predrag sees as the biggest obstacle to this proposal the uniform access to storage: if this is achieved, then it would be possible, otherwise it would be too difficult. Ian F. proposes anyway to leave it as an agenda item.

Processing Offline Data in the LHCb HLT farm (L. Granado Cardoso)

LHCb already uses the HLT farm also for offline processing, so there is no need for a special setup. The WNs are on a private network and can access data from DIRAC using NAT masquerading. The control system is developed in PVSS and has the same look and feel as the DAQ control system. The system has already been tested successfully and is being used right now. At the moment there are no plans to use VMs as it would be an unnecessary complication, but there is the intention of running some tests on virtualizing the HLT using OpenStack in 2013.

Status of the CMS online Cloud (J.A. Coarasa)

CMS has deployed a cloud overlay on its online farm using OpenStack as cloud manager layer. A factory using Condor instantiates VMs of the appropriate type and CVMFS is used to get the software. Data is staged in and out via xrootd. Over Christmas the infrastructure was tested using Folding@Home jobs over more than 1300 VMs. In the next weeks a large scale deployment of CERNVM will be tested and the objective is to use the cluster as another CMS facility during the LS1. CMS is willing to share the experience on how to deploy such a cloud layer with external sites, which might want to smoothly transition to a Cloud infrastructure.

-- AndreaSciaba - 24-Jan-2013

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2013-01-24 - AndreaSciaba
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback