January 2017 GDB notes


Agenda

https://indico.cern.ch/event/578982/

Introduction (Ian Collier)

Slides: https://indico.cern.ch/event/578982/contributions/2418696/attachments/1393904/2124380/GDB-Introduction-2017-01-11.pdf

Remember: F2F pre-GDB about HEPIX Benchmarking Group in Feb. 2017. Please, consider registering and attending the ISGC2017, where the March GDB will be co-located. Lot of upcoming meetings that are important are listed --> the dCache meeting and the OSG All Hands meeting will be added to the list as well.

Questions

Remark: The HSF Workshop in SD: is not only about software, it's about the computing for the next 10 years. There is an effort to create a white paper before this meeting.

No further questions.

WLCG Workshop Update (I. Collier / A. Forti)

Slides: https://indico.cern.ch/event/578982/contributions/2423302/attachments/1393913/2124295/20170111_GDB_manchester.pdf

Information on the workshop logistics available in the slides.

Budget will be in the 100-130 pounds range, no more than that. (this will include reception, meals and the gala dinner). Most likely participation of people from SKA. Some visits could occur during the meeting.

Discussion

Dates: it is decided to do it on Monday 19th June afternoon, Tuesday 20th and Wednesday 21st (complete) - then there will be time to interact more with other communities. We would need to think about the agenda to see if this makes sense to finish by Wednesday afternoon or extended a bit, which might be possible (at least using Thursday, since Friday is holidays in many places). Call for co-locating other stuff at the end of the week? That might be another possibility to be discussed.

--> The extension will soon be decided and the agenda will be prepared accordingly, so people can plan the trip in advance.

HNSciCloud Status Update (H. Meinhard)

Slides: https://indico.cern.ch/event/578982/contributions/2418697/attachments/1393822/2124112/2017-01-11-GDB-HNSciCloud.pdf

There are many computing challenges foreseen in science: increase of resources and fluctuating demands -> potential solution is to use public clouds. There are some issues with using public clouds, which needs to be resolved. The current project aims to use the capacity in the same way we do with on-premises resources, and how to organize for procurements, according to the way in which we do this (tenders, etc). Also, this project aims to overcome with all of the legal impediments that might arise. CERN has made several procurements (increasing in resources) since 2014, and this has culminated with the joint project for Cloud Services procurement. See information of these projects in the slides.

HNSciCloud project: 10 procurers - >1.6 M + 2 experts organizations onboard. It's a lot more than procuring: it means integrate it and transparently use the services. Many sciences domains are included in the prototypes. This project is funded by H2020 (pre-commercial procurement) - total volume >5 M. The companies are deeply involved in the process for the integration and the exploitation of the services. The R&D work component is (very) important in this project (it represents >50% of the budget).

Challenges are foreseen in the Computing, Storage, Network and procurement arenas. Ideas on caching at provider's site? access via streaming or caching data, but this should transparently work fine with the existing experiment services, and in particular this is important for experiments that are not WLCG. The storage is problematic on Cloud Services, so the network should be sufficient for the workflows that will be handled. From past examples, CPU and network worked fine, but we should make sure that the storage needs are well covered, by means of this project.

Project phases: preparation - tender - implementation; 4 designs (by the end of this Jan. 2017 the companies will deliver the design documents for evaluation), then 3 prototypes, then 2 pilots. The phases extend up to Dec. 2018.

Achievements so far are explained in the talk. The project is delivering fine, according to the original plan. Check the slides to know details about the 4 consortia that have joint the project.

Envisaging solutions: for HTC and HPC (via dedicated partner, since most of them don't have HPC capabilities). Storage solutions, partly based from INDIGO (ONEdata) and some others (GPFS/SpectrumScale and NFS). AAI: Indigo AIM, direct support of SAML 2.0. All this is very preliminary, since these solutions need to be verified, integrated and tested.

30-Jan: deadline for the design reports. 23-Feb: final answer on the designs. Check the slide for details about the phases timelines and milestones.

Questions

(Q) J.Flix: how deep the procurers and the experiments are involved in the design? (A) H. Meinhard: No. This is different: the development is on the companies sides, the researchers are not involved in this. (Q) M. Schulz: but the experiments were not involved providing any feedback? (A) H. Meinhard: this has been presented to them in some meetings, and the feedback was provided at the time

(Q) M. Schulz: 4 consortia are available, 3 enter the next phase. All of them are going to be interested to continue? (A) H. Meinhard: there are indications that this will indeed happen.

(Q) J. Flix: What about the quality of the proposals? Might it happen they are not at the quality we expect? (A) H. Meinhard: it is expected that the proposals will have sufficient quality.

Accounting (J. Andreeva)

Slides: https://indico.cern.ch/event/578982/contributions/2418698/attachments/1393883/2124229/WLCGAccountingGDB.11.01.2017.pdf

WLCG Accounting TF was created in Apr. 2016. Check the twiki for the TF composition and the activities done so far. The main objectives were: validation of the WLCG accounting data, coordinating with the developers to deploy a consistent WLCG view, generation of T1 and T2 reports by the EGI portal, among others. For the space accounting, validate and implement the views, and including the accounting of the opportunistic usage of resources as well.

Validation checks were carried out and they included comparing the CPU consumption from the experiments p.o.v. and the EGI portal, taking into into account that experiments account for payloads, and EGI for pilots. Considering the pilots inefficiencies, some checks were consistently done. Wallclock time checks were done (taking into account that the batch systems do scale times, while the times measure by the experiments are raw), and also wall clock work tests were done (multiplied by the number of processors), being careful since not all of the benchmarking factors were consistent between APEL and the experiment accounting system.

With these checks, and taking into account all of the sources for inefficiencies, the TF concluded that the data provided by the EGI portal is trustworthy and reliable. The validation served to detect problematic sites and GGUS tickets were opened, and these problems were corrected. Important for the sites: a FAQ page was created to help the sites publishing correctly their accounting information.

SSB page has been created and made public, and it contains a view for sites to check if they are publishing consistent values, and new naming conventions were proposed by the TF by means of all these discussions.

T0 accounting: all types of T0 resources are now correctly accounted. This has been validated with some experiment representatives. CERN accounting data has been re-published since April 2016.

New EGI accounting portal: very good collaboration with the developer, that fixed many issues. So far, the feedback is very positive. The portal contains multiple changes and improvements. The generation of T1 reports have been moved to this EGI portal. The REBUS code has been re-used. Wallclock work is now being compared to the CPU pledges, as was decided in a MB early in 2016. The installed CPU capacity has been removed from the reports. If no more problems are observed, the new reports will become official starting from January 2017.

Space accounting, not based in SRM: the common formats were agreed with the experiments in Data Management pre-GDB. WLCG DM group will coordinate the space reports, and the collection, store and visualization will be carried out in 2017 (this is out of the scope of the WLCG accounting TF).

Accounting of the opportunistic resources: two issues need to be addressed: a) topology b) benchmarking. CRIC will provide the topology. The HEPIX benchmarking group is following up the benchmarking of these resources. Experiment reports could as well be used.

Things to be followed up: new plots in the EGI accounting portal; benchmarking; how to publish the benchmarking information; HTCondor and APEL; Improving the debugging of accounting problems (raw wall clock time published for all sites in APEL would certainly help).

Questions

(Q) J. Gordon: Accounting data for opportunistic resources is now on the experiments side. Could this be used? (A) J. Andreeva: this implies some work with the experiments. We might need to make some prototypes. (Q) G. McCance: this will be done by the experiments or APEL? (A) J. Andreeva: site should be publishing this. CERN has already a prototype.

(Q) J. Flix: Could the T0-report made public, showing which kind of resources are used by the experiments, such us the resources used for T0, the ones used for experiment HW services, and so on...? (A) M. Coelho: We are working on it.

(Q) J. Flix: What about CRIC status? (A) J. Andreeva: this will boost, since a new FTE will start working on this soon and the team of experts has been setup.

IPV6 (D. Kelsey)

Slides: https://indico.cern.ch/event/578982/contributions/2418700/attachments/1393936/2124385/Kelsey11jan17.pdf

Please, check the next meetings. Next F2F at CERN in Feb. The aim is to have CPU IPv6 by April 2017. Several things to track atm.

Update from the experiments: ATLAS, QMUL and Brunel have IPV6-only CPU already. Several sites with dual stack storage. Rucio progress is fairly ok, PANDA shows slower progress. Some progress will be shown next week at the ATLAS sites jamboree. LHCb and CMS progresses done as well.

LHCOPN IPv6 is still missing in some T1 sites, including RAL (lot of work being done atm).

Check the slides for the ETF IPv6 status and Perfsonar IPv6 status

As a summary, T1s seems to be ok (only a few missing). A good number of Tier 2s run dual-stack (BUT MANY do not - maybe WLCG Ops could help on this). No show-stoppers identified to date, but still a lot of work ahead of us.

Questions

(Q) L. Betev: IPv6 in CPU is fine, but the storages should be dual-stack until everything is IPv6-only, but not only the Tier-1s. (A) I. Collier: limited resources will work in this mode, then next year should be more strict. (Remark) L. Betev: but the storage should be dual-stack in any case until the migration is done everywhere.

(Remark) M. Litmaath: WLCG Ops Coordination will find people to look after the IPv6 deployment everywhere, by means of a Task Force. This TF can launch a GGUS campaign elsewhere. But, we do need twiki pages with how tos, with some (or all) working solutions documented, so we can point sites to this. (agreed). I. Collier proposal: milestone April 2017 will be fulfilled, but we need an Ops. TF to follow this. Until that date, we need to get all the procedures documented. A possible co-located activity in the WLCG workshop could be the IPv6 deployment at the T2s. Part of the program could be some form of tutorials on IPv6 deployment at the sites. So, the WLCG workshop will be good place to have a facilitating session on IPv6 deployment, particularly focused to T2s, so the sites are encouraged to participate and get this knowledge. Documentation and howtos should be available at that time for being public for the sites.

pre-GDB network: summary (M. Babik / S. McKee / E. Martelli)

Slides: https://indico.cern.ch/event/578982/contributions/2418711/attachments/1394050/2124515/Report_on_the_Pre-GDB_on_Networking.pdf

Very well attended pre-GDB: 48 in the room - 25 remotely. 3 sessions occurred.

Site and Ops session

LHCOPN: 180 PB moved in 2016 -> +70% as compared to 2015. Some sites saw congestions, and they upgraded their WAN networks. LHCONE: the network is expanding, in particular in Asia.

WLCG Network throughput WG: perfsonar deployment status at the sites, and doing analytics with the collected data, +many other things.

Check the Site reports highlights slide for some site details that were presented in the session.

Experiments session

ATLAS: Presentation made in form of a google doc (check the link). No large network increase is expected during Run2. Current network is ok. ATLAS maintains a network 'closeness' matrix, and there is the plan to include more network info in their DM services. For Run3: x10 increase in data transfers. SDN should be explored to determine the benefits for ATLAS.

CMS: DDM relies on reliable networks. WAN reads are now 20%, this might increase to 50%. Run3: x5-x10 factors are expected.

LHCb: monitoring the network is of strong interest for LHCb. They expect x10 for Run3.

ALICE: Run3 network usage is expected to be similar than today, if there are increases they will come from the increased data amount.

Future network capabilities session

DTNs for LHCONE: reference implementation done by ESnet, HW measurements made by Caltech. Alien waves: using other operator's infrastructures. This is being adopted by NorduNet.

Next gen. of SDNs: 400G will be ready by Run3 would this be enough? Data Transfer Nodes (DTNs) demonstrated very high transfer rates with the latest hardware. NGenIA vision project: integrated resources under SDNs, with smart middleware to orchestrate the data flows among DTNs. The goal is to provide stable and predictable data rates, even using machine learning techniques.

ESnet plans: 1/2 of the traffic is LHC related. 10x every 4-5 years, for the last 30 years! ESnet6 project just started to increase x10 the capacity. This will be in production by 2020. Based on the historical data, the plan on capacity upgrades on the paths has been derived.

Peak utilization drives network Ops and planning. This opens the possibility to think how to better exploit the network.

Question still to be addressed: priorities for the next years. Importance of monitoring, measuring, analytics, alerting, SDNs? missing capabilities from the experiments p.o.v? how to address/balance security vs ease-of-use vs capability. These priorities should be identified and carried out by someone/somewhere: formal mandate from the MB?

The meeting was very useful. Maybe this should happen more often (once a year).

Questions

(Q) J. Flix: Looking at the network utilization, it seems we over provision if looking into the ratio of peaks over the average rates. Where to find the proper compromise to plan to network upgrades? (A) E. Martelli, S. McKee, L. Betev, B. Bockelman, M. Wadestein: SDNs would help. Peak demands occur for some workflows and/or bulk transfers. There is plenty of space to improve, but this is very difficult to estimate. We need some better ways to handle with occasional peaks.

(Remark) M. Litmaath: MB talk is welcome and needed (A) E. Martelli: the 4 in the group will prepare the talk for the MB. It will be soon asked to be scheduled.

-- JosepFlix - 2017-01-12

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2017-01-12 - JosepFlix
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback