WLCG Workshop, Okinawa, April 11-12, 2015

Agenda

https://indico.cern.ch/event/345619/other-view?view=standard

WLCG Status

Tokyo T2 Site Report - T. Nakamura

ICEPP regional analysis center, located at Tokyo University, created in 2007

  • ATLAS only
  • Upgraded every 3 years: next upgrade at end of 2015
    • Goal: reach 7-8 PB by 2018
  • 46 KHSO6, 2.5 PB pledged (3.2 PB including LocalGroupDisk)
    • Storage : DPM

WN: 10 GbE connection

WAN: 10 GB/s connection

  • Can be saturated by FTS3
  • Japan connected to NY by a 10 Gb/s line + a new 10 Gb/s line through Osaka and Washington DC
    • Also a backup line through LA
  • International connection to be shared in the future with Belle2 and ITER
    • 100 Gb/s upgrade planned in 2016 : LA
    • 2 more 10 Gb/s to GEANT in 2016
    • 20 Gb/s for Tokyo T2 in 2016

WLCG Status and Readiness for Run 2 - J. Flix

Operations Coordination well established since its creation in 2012

  • 1.5 FTE
  • Manages operational issues and service deployment in synergy with OSG and EGI
  • Discusses experiment plans and requests
  • Define and follow-up actions
    • Task forces and WGs created as needed: ~10 FTEs
  • Meetings: fornigthly meeting + shorter meeting twice a week

Achievements in the last year

  • Federated Xrootd deployment for FAX and AAA
  • perfSonar deployment
  • Coordination of SHA2 readiness effort + replacement of VOMRS by VOMS-Admin
  • FTS3 testing and deployment
  • WMS decommissioning
  • Multicore job support shared with single core jobs
    • High efficiency of resource utilization maintained

WLCG critical services list updated for Run2

  • Used by T0 to allocate the appropriate support effort
  • May be extented to cover T1s and T2s in the future: some critical services hosted outside CERN

glexec deployment done and moving to production

  • Already done in CMS
  • In progress in ATLAS

Other ongoing activities

  • Squid monitoring
  • Machine/job features: pass information to running jobs
  • IPv6 validation and deployment
  • http deployment in WLCG
  • MW readiness WG: readiness assessment using experiment workflows
    • WLCG MW Officer role created: defines baselines
    • Expert from all partners participating
  • Network and transfer metrics

WLCG Operational Costs

  • Feedback from experiments + site survey end of 2014
  • Huge effort done by all experiments to optimize their computing models to achieve phycics objectives with the budget restrictions in many countries
    • Improved SW
    • Multicore jobs
    • Optimized usage of disks
    • T2s used for some workflow activities previously handled by T1s
    • Exploit new type of resources: HPC, clouds

Main challenges foreseen for Run 2

  • MW support
  • Migration to new batch systems
    • Also pass job parameters to batch systems
  • Migration to a new major OS version
  • IPv6 + more demanding networks
  • Cloud resources (community and commercial) part of standard WLCG resources and operations
  • Unsecured budgets... in particular for personnel

Need to begin to think at long term planning for HL-LHC: expect significant changes in the computing models

Discussion

  • Jeff T.: unsecured budget for personnel is a major threat. Can live with lower HW resources than expected but cannot leave without the persons to run the infrastructure and support the experiments
  • Philippe C: it is time to stop speaking about T1 and T2. MONARC model is dead: experiments looks only at functions/services delivered by site
    • Ian B.: agree technically but politically we need to continue to deal with the difference. A requirement from funding agencies.
    • Pepe: T1s are committing to higher level of support, impact on manpower

WLCG Security - R. Wartel

Identity federation status

  • A lot of apps impacted and requiring changes: Vidyo, online CA, VOMS
  • Some progress: communities and projects better organized, Code of Conduct should help establish trust, policy work well received, AARC (H202) bringing some hope...
    • AARC: 2 years, 19 partners, outreach/training, technical and policy work. Priorities: working international authN, harmonization of attributes

Global computing: also adopted by criminal organisations!

  • Cyrbercrime highly profitable and risks are minimum
  • Specialized markets
  • Malware-as-a service
  • Strong consolidation of the underground economy: severe competition between a handful of exploit kits (EK)
    • Huge progress in time-to-market for exploits: a vulnerability can be exploited in EKs a few hours after being identified
  • Email remains the leading source of compromise: 90+% of breaches caused by spear phishing
    • Target phising: 70% of efficiency
    • Exploits started before antivirus updates: 24h before an antivirus update

Old security ("medieval fortress") approach does not work anymore

  • Landscape has changed: datacenters security and laptop security are equally important, main attacks target both and all platforms
  • Need to focus on procedure and people
  • Protect both services and people
  • Windigo example: lasting since 4 years, 30K servers compromised in the last 2 years (including big names!), a full ecosystem of advanced malware

New threats: ransomware, doxing

  • Recently happened in HEP communities: multiple staff targeted, including death threats
  • Our community exposed: very open, a lot of articles with personal information...

Need to learn and adapt

  • Internatiaonal collaboration is our main asset
  • Don't overlook mobile device security/protection
  • Incident handling as part of normal operations
    • Importance of traceability
    • Also has a cost
  • Global adversaries require dedicated WLCG experts
    • Sites will deal with traceability requests
  • Global incident response: need appropriate legal, policy and technical tools
    • Also remove community/organisation boundaries

DB Services during Run 2 - L. Canali

Oracle: remains the proven solution for high-availability, concurrent transaction DBs

  • New NetApp backend (FAS8060) with more memory and and SSDs: improved perfs
  • Oracle production version at CERN is 11.2.0.4
    • Preparing upgrage to 12.1.02.2
  • New critical databases
    • QPSR (Quench Protection): 150 krow/seconds, 1 Mrow/seconds achieved during stress testing
    • SCADAR: WinCC/PVSS archiving
  • Replication evolution: moved to Golden Gate for Atlas conditions DB replication to T1s, Active Data Guard for online to offline replication
  • CERN 24x7 piquet during Run 1: will restart in May 2015
    • The need will be reevaluated in 2016

Trends: servers with more cores and memory, SSDs becoming affordable

  • Consolidate HW to reduce management costs
    • Balance with the ability to upgrade services one by one: also adding the possibility to run several Oracle instance on the same box
  • Review applications to take advantage of workloads that can fit in memory
    • Reduce application complexity

DB on demand: self-service for provisionning, management of backup

  • Mainly MySQL (85%) but also Postgres and Oracle
  • Monitoring tools provided to users to troubleshoot performance problems

Scale out DBs: share-nothing architectures targeted to high performances and low cost

  • Backend at CERN: Hadoop
  • Query engines: both SQL (declarative) and imperative (MapReduce and Spark)
  • Currently offloading some Oracle DBs to Hadoop for DB that are "write once, read many": LHC logs, SCADA, CMS popularity
    • Data warehouse, reporting and analytics

Network Update - T. Cass

CERN news

  • WiFi: Campus-wide 802.11ac rollout planned in 2016/7
    • Controller-based "wave 2" solution with one access point per 3 offices
    • Market survey in progress to select the HW in December 2015 or early 2016
  • CERN network: remove the strong difference between LCG and GPN networks, move Campus network off GPN
    • LCG, GPN and other networks in the future as subsets of the computer center network
  • Data center connections: 260 Gb/s to experiment pits, 200 Gb/s to Wigner, 210 Gb/s to LHCOPN, 100 Gb/s to LHCONE, 80 Gb/s to GP internet
  • ESnet extended to CERN (and Amsterdam)

Ethernet roadmap: 1 Tb/s after 2020...

LHCOPN: main paths remain, backup paths moved to LHCONE

LHCONE: moving to a global infrastructure for HEP (Belle2, Auger...)

  • Many common sites
  • Working on extension to South America (Argentina for Auger), Africa and Middle East
  • AUP now agreed/available

IPv6 deployment in WLCG progressing but T2s readiness remains a concern

  • Only 20% of T2s currently ready and 20% with a plan in the next 2 years
  • Experiments working on IPv6 support: CMS wants AAA sites to support IPv6 during 2015, ATLAS requesting T1s and T2Ds to provide dual-stack perfSonar instances
    • IPv6 WNs expected soon at several places

SDN

  • Project started with Brocade in OpenLab
  • Focus on expected network evolution and new use cases we could face in the future
    • CDN4LHC: reduce load on long distance links, improved perfs for poorly connected sites. Network of cache servers based on peering, IPv6 multicast
    • SDN may play a role in implementing these new strategies

WLCG Monitoring - J. Andreeva

See slides

Batch Systems - I. Collier

RAL is running 560 WNs, 12K cores

  • 40-60 Kjobs submitted every day

Start looking at a Torque/MAUI alternatives in August 2012

  • Scalability, reliability, high-availability, dynamic resources
  • Concentrated on open-source solution
    • Open-source GE: long-term future uncertain, community not very active
    • SLURM: found various issues in our use case
  • HTCondor chosen as replacement

Step by step migration to HTCondor

  • Started a new CE with decommissionned HW
  • Tested with a first VO: ATLAS
    • Hardened the configuration
  • Then CMS, LHCb, ALICE
  • Adaptation of operation tools: monitoring, accounting...
  • After validation by all the LHC VOs, migrated 50% of the capacity
    • 1 year after starting investigation
    • Migration completed 2 months later

Very good experience over the past 2 years: very stable operation

  • No change needed to the configuration when ramping up the number of WN
  • Higher job start rate
  • Easy upgrades
  • Strong and good community support

Main HTCondor features used

  • Hierarchical accounting groups to achieve fairshare
    • dteam/ops treated as high priority jobs: also possible to flag other groups
  • partitionable slots + defrag daemon
    • defrag daemon: configuration updated by a cron job to do only what is necessary
    • Also use a sort expression to ensure that multicore jobs are considered before single core jobs
  • HA central managers
    • Several collectors running concurrently: submit machines and WNs report to all of them
    • One active negociator: managed by condor_had
  • PID namespace + mount under scratch
  • cgroups for CPU and memory
    • memory cgroups: issues found, disabled until fixed

Monitoring historically based on Nagios and Ganglia

  • Used startd cron to implement a WN health check and prevent a job starting in case of problems with a VO granularity
    • Information published into a ClassAds: easy to check it with different tools
  • Nagios checks: mainly for condor_master daemon on all machines
    • Alarming different depending on the type of machine
  • condor_gangliad: collect information from Condor ClassAds into Ganglia: one dedicated machine
  • Recently added ElasticSearch for displaying/analysing completed jobs
    • Source: condor history files

Integration with cloud (OpenNebula): allow to take advantage of unused cloud resources

  • About to move it into production: absence of static membership helps a lot

Future plans

  • Move WNs to SL7 and run SL6 WNs in a chroot environment using NAMED_CHROOT functionality in HTCondor
  • Simplification of WN configuration/installation using CMVFS grid.cern.ch: 800 RPMs less
  • Use PID namespace as an alternative to pool accounts: should provide the same of job isolation and traceability

Volunteer Computing - L. Field

Motivation

  • Free resources: 100K hosts achievable...
  • Community engagement: outreach channel, offering people a chance to participate

But also many challenges

  • Cost of using free resources: integration, operations
  • Attracting/retaining volunteers: advertisement, engagement
  • Low level of insureance: anyone can register

Virtualization opens a way to address the challenges: tested with Test4Theory

  • Vacuum model is very close to BOINC model: VM started by BOINC starts an agent/pilot to connect to experiment central queue

vLHC@home: CERN BOINC central service for several projects

  • ATLAS@Home started 2 years ago without any effort to attract volunteers: already 5K volunteers, 2nd simulation site
  • vLHC@home includes a Drupal portal as the common entry point for all projects
  • DataBridge: a scalable and efficient service to download job parameters and upload job results
  • A common platform allows to coordinate/share the outreach effort and the development/operation costs

WLCG Operational Costs - A. Forti

First summary of the survey presented at March GDB, concentrating on FTEs

MW Support

  • CEs and SEs concentrating most of the negative feedback about deployment and troubleshooting: poor documentation, lack of log mining tools
  • YAIM future unclear but many sites still relying on it
  • Sites would like to see the WLCG specific services reduces
  • Concerns about ARGUS and Torque/Maui support

Torque/MAUI: sites waiting for recommendations

Toward a "simple T2"

  • Recommend ARC CE: no need for an APEL box, simpler to configure/manage
  • Simplify/reduce the number of information published into the BDII
    • Also missing a YAIM replacement to fill the value in the BDII
  • Keep up with the work to reduce the number of storage protocols

Virtualization/clouds: provide sites with a few images containing everything required, without the site having to configure specific services

  • Containers are another promising technology to relief sites from configuration WLCG specific services

Storage

  • http-based federation: stick to industry standards
  • Ceph more and more attractive: need to properly support it as a storage system in WLCG

Monitoring: should make more publicity to SAM integration into local Nagios

  • Local monitoring is essential to catch problems before the jobs are affected
  • SAM tests: lack of documentation about errors detected, time wasted to google, also not possible to manually rerun the tests after fixing the problems
    • Also the issue of experiment specific pages being protected: sometimes difficult to access existing information

WLCG OpsCoord should be the channel to make request to sites

  • Many sites supporting several VOs: direct request has the risk of conflicting requirements
  • Also need to ensure that there are not too many urgent requirements put on sites

Sites asking for more clear OpsCoord directions

  • Proposal: add a 'site actions' section to the minutes
  • Should also try to consolidate the information available at one entry point

Also a need for site to site communication

  • Mailing list: risk of overlap with lcg-rollout
  • Open and searchable wikis
  • WLCG Jamborees?
  • HEPiX and GDB

Improve participation of sites to OpsCoord meetings, including TF/WG

  • Start a bit later (4pm) to allow a larger US participation
  • Asian participation

Experiment Session

ALICE - M. Litmaath

Run 2: more detectors added, new LHC conditions will lead to a double event rate

  • Also result in a 25% increase of event size
  • Efforts concentrated on improving SW performance

Simulation is still mainly based on G3 for performance reasons

  • G4 is still 2x slower...
  • ... but also gives access to multicore resources (multithreaded capabilities in v10
    • v10 validation has started

Distributing computing and analysis: no news, good news! Things work and continue to grow

  • Migration to CVMFS fully completed
  • AliEn: new ARC interface, ongoing consolidation work
  • Testing "opportunistic" use of HLT by AliEn
  • Analysis: organised analysis (trains) now more than half of the analysis load
    • -50% of individual analysis in 1 year

Data popularity monitored: almost no inactive data left

Run 2 peparation

  • Re-commissioning in progress, in particular with cosmics data
  • Reprocessing of RAW from Run 1 with last SW

CPU efficiency quite constant now for all types of jobs: ~80%

  • Several times > 70K concurrent jobs

Several new sites joined or about to join

  • including a few candidate T1s: UNAM (Mexico), Sao Paolo
    • After KISTI and Russian KI
  • Several T2s with a significant increase of their resources: Hiroshima, Torino , COMSTAT

Storage changes

  • Xrootd v4: IPv6 support
  • EOS: 4 external sites running it now...
  • Xrootd proxy to allow xrootd access from clusters without an outbound connectivity: GSI, HPC clusters

RFC proxies required on VOBOX to move to latest openssl

  • Didn't manage to get legacy proxy working
  • Work in progress, done at many sites
  • Latest AliEn on VOBOX as soon as migration is completed

SAM3: will use a new A/R formula

  • Basically any CE (or a VOBOX) & all SEs

R&D work in progress

  • Ceph as a xrootd backend
  • Virtual Analysis Facility (VAF): Proof on Demand in a cloud
    • Using HTCondor, Elastiq, CernVM Online

ATLAS - A. Filpcic

New computing model: less difference between T1 and T2

  • Long-lived data stored on well connected stable SEs: 90% availability required
    • Sites with availability < 80% not considered for data placement
  • Jobs executed everywhere: intermediate datasets left distributed over SEs
  • New services (JEDI, Prodsys-2, Rucio) all in production since December 2014
    • New system faster and more flexible

A lot of effort put on monitoring tools tailored to ATLAS needs

  • Generic tools generally not matching ATLAS needs: FTS3 dashboard, WLCG transfer... and FAX dashboard

Analysis share: 5% at T1s, 50% at T2s

Production: most jobs with 2 GB/core and 6-12 hours

  • Some specific jobs require extreme resources: 4-8 GB, 4-6 days
  • AthenaMP: works with up to 32 cores, improves memory footprint
    • Not supported job types: merge jobs (but fast, low memory), event generation
    • Initialisation/completion time done with a single core: ~15 mn. Should allow up to 96% of efficiency for a 6-8h job duration
  • ATLAS is target 80% of its resources able to run multicore jobs but currently at 50%
  • Jobs monitoring their RSS usage against the limit set and will terminate if going over
  • Plan to use more opportunistic resources in the future: ATLAS production system can now handle them transparently
    • For non I/O intensive workloads

Data access through WAN: job overflow

  • Level controlled by JEDI

Shift changes: new Computing Run Coordinator created

  • Less requirements of kwowledge in ATLAS SW
  • Call to site admins as volunteers: counted as a Class-2 shift

Sites invited to attend weekly ADC meeting

CMS - C. Wissing

Event rate from 400 Hz to 1 Khz: computing during Run 2 will be resource constrained

  • Also increased pileup

Multicore processing: not only for improving memory footprint but also to stay in the 48h windows for RECO jobs

  • Prompt RECO is a 4 thread app

Lowering site boundaries

  • Data federation to allow remote access to data
  • Compute resources in one global HTCondor pool used for production and analysis
    • Sharing between prod and analysis done by HTCondor instead of being configured at sites
    • Local fairshare configuration: see slides
  • Provisionning of opportunistic resources (cloud, HPC...) through HTCondor GlideinVMS too

HLT for processing and production: HLT size larger than all T1s, configured as an OpenStack cloud

  • Data storage: direct access to EOS
  • Network connection upgraded from 60 Gb/s to 120 Gb/s

To face the resource constraint expected, working on being able to use any non pledged resources: clouds, HPC, ...

Simplified management of disk space at sites

  • All spaces previously managed by groups now transferred into centrally managed space: central space is 60% of the pledged disk space
  • Still 40% of the pledge space that is basically unregistered and unmanaged: deploying space management service at sites

Disk/tape separation achieved: will allow to use T1s for user chaotic analysis

  • No risk to trigger tape access
  • Recent tape exercise done: all T1s achieved performances far beyond expected ones

New data format: mini-AOD

  • To replace group ntuples
  • 50 kB/event
  • Should satisfy 80% of the analysis cases

LHCb - S. Roiser

LHCb uses "intensity leveling" with a reduced luminosity constant over a fill: with the new LHC conditions, this will reduce the pileup

New trigger scheme

  • 1 MHz after the HW L0 trigger
  • HLT1: real time partial reconstruction, buffered to disks
  • HLT2: (slightly) deferred full event selection, with calibration data. Output very close to offline reconstruction
    • Output: 12.5 kHz
    • 10 kHz (with some parked events) intended to go through the offline reconstruction
    • 2.5 kHz TurboStream: events that can be directly processed by physics analysis, without going through offline reconstruction

Offline reconstruction: now supposed to be the final processing, no reprocessing foreseen before end of Run 2

  • Done using the same calibration/alignement as in HLT
  • Longer retention expected for stripping output: compensated by more physics moved to MDST
  • Some T2s used for reconstruction: no more tight coupling between a T2 and a T1

Analysis can be run at T0, T1s or T2Ds

  • A small fraction of the LHCb workload but the highest priority in the central task queue

Can take advantage of any computing infrastructure, virtualized or not: all environments served by the same pilot infrastructure connected to DIRAC

Data storage: no more direct processing from tape caches/disk buffers

  • Data copied from tapes (disk buffers) to disk only storage through FTS3
  • Should lead to a reduction of tape disk buffers
  • LFC replaced by DIRAC File Catalog: bookkeeping unchanged
  • Data popularity monitored

Data access through "Gaudi federation"

  • List of replica created for each analysis job with ability to fail over to a non local replica in case of problems
  • SRM used for tape interactions and for writing to storage (job output, data replication)
  • Xrootd: used for reading only. SRM-less.
  • http/DAV: deployed at all sites, could be an alternative protocol to SRM
    • http federation started: development in progress for doing data consistency checks

Computing at HL-LHC Timescale

Introduction - I. Bird

CERN Council decided that HL-LHC should apply to be an ESFRI project

  • Opportunity for new sources of funding
  • ESFRI didn't exist when LHC project was started

Planning towards HL-LHC: need to agree on common baselines and expectations

  • Need to discuss potentially controversial topics: will the current computing models scale? physics costs vs. computing costs?
  • Distributed computing is here to stay
  • General purpose x86 Linux comes to an end: more efficient to specialize
    • GPU, HPC, ARM...
    • Still a role to play but only for some specialized workflows/workloads
  • Datacenters: 0(100) is not very efficient, concentrate on O(10) large data facilities with associated computing resources
    • Potential role for commercial providers
  • "T2s role": today providing > 50% of the computing resources and engagement of a lot of skilled people. Don't want to loose that
    • A lot of workflows still appropriate for this kind of resources

Also the recognized need for evolving/reengineer SW: HEP SW Foundation

WLCG must think at its role in a HEP-wide infrastructure serving future HEP projects (ILC, Belle2, FCC...), Intensity Frontier experiments and other related sciences (astro-particles...)

  • Need a common repository/library of proven tools and MW to allow reuse of existing solution: HSF can help with this
    • We also need to adopt standards every time this is possible
  • Need strong input from experiments
  • Failing to do this quickly will lead to too high costs: we are under the pressure of funding agencies and other bodies who are following this closely and asking hard questions...

ALICE View - T. Chuyo

At HL-LHC, from 40 Mhz collisions to 50 MHz

  • No more data selection in the trigger: continuous read out. More than 1 TB/s from the detectors.
  • x100 output to storage: 100 GB/s to storage, 13 GB/s to computing centers at Run 3
  • Needs for local data storage higher than anticipated

Common HW and SW teams for DAQ, HLT and offline

  • 02 facility for online (synchronous) processing

Synchronous reconstruction (online) followed by offline, asynchronous refined reconstruction with quality control

  • zero suppression, compression: TPC still accounts for 60% of event size
  • Asynchronous reconstruction of raw data at T0 + T1s
  • Asynchronous reconstruction of MC data at T2s

Analysis at Analysis Facilities: input is AOD produced by T0/T1 (from raw data) and T2 (from MC data)

New approach already exercised during Run 1.

ATLAS - G. Stewart

ATLAS upgrade will happen during LS3

  • Replacement of inner dectector
  • Rate increase: x10
    • x10 in raw storage: 75 PB/year

Impact on the different workflows

  • Event generation and simulation not really affected
    • Simulation is scaling with energy, a lot of places that are good target for concurrency/vectorization (GeantV)
    • Event Generation : CPU intensive, a good candidate for HPC (some preliminary work at Argonne)
  • Digitization: linear scaling with pileup
  • Reconstruction: factorial scaling with pileup
  • Analysis: linear scaling with pileup

Integrated Simulation Framework: integrated framework that can combines different simulation engines using the most appropriate for each part of the event simulation

  • Including fast simulation

Tracking: the key component at the heart of the battle against combinatorics!

  • Currently highly serialized to allow early rejection of poor candidates and avoid wasting CPU cycles
  • Need to probably sacrifice some serial efficiency to benefit from more concurrency: but quickly hitting memory limits...
  • Deep learning may play a role in the future

Framework: GaudiHive, multi-threaded version of Gaudi

  • No easy migration path from the current framework: progressive plan towards Run 3, including a lot of training

Analysis: moving to train model

  • Required to be smarter with I/O

Computing evolution foreseen

  • More disks but also more tapes to manage more efficiently derived data
  • New computing resource types: classic WLCG sites will remain a key part with bigger facilities
    • Smaller sites to move to lighter MW like BOINC?
  • Details are uncertain concerning the HW but multi-threading, data oriented design, parallel algorithms will be the keys for the success

New generation of data management and workload management tools developped for Run 2 have been designed with the scalability/flexibility required for HL6LHC

Discussion

  • Jeff: what about future programming languages? Any chance to move to something other than C++?
    • Graeme: I don't think so. Would be a 100 M$ effort at least. I don't think FAs will buy/fund this. Fortunately, C++ is improving.

CMS - D. Lange

2 phase upgrade

  • Run 3: deal with high pileup
  • Run 4: deal with extreme pileup
    • Planning 5-7.5 kHz of events

Computing resource needs estimate for Run 4: x200 compared to Run 2

  • Depending on the exact estimate for HW evolution and SW improvements: 3-15x deficit
  • Almost the same for storage
  • Reconstruction is the key part to optimize

Multithreaded CMSSW framework being commissionned now

  • Short term (Run 2): will allow to process higher trigger rates
  • Longer term: explore new approaches based on more parallelism
    • Have ported CMS track reconstruction to Intel Phi

Analysis: miniAOD format, 5-10x smaller than Run 1 format

  • Potential for big analysis improvements in Run 2
  • Basis for R&D toward more I/O performant analysis data models

CMS R&D active and organized around weekly meetings, open outside CMS

  • Working with TechLab for benchmarking new HW architectures
  • CMS members actively involved in HSF

Discussion

  • Nate to take into account that will need new ideas for reconstruction and other activities and these new ideas always have lower performance at the beginning. May negatively balance the gains achieved with existing algorithms.

LHCb - M. Cattaneo

LHCb upgrades will be completed at Run 3

  • In fact LHCb runs at a "reduced luminosity" that doesn't require HL-LHC: need to redesign several sub-detectors
  • Goal: improve precision
  • Software trigger at 40 MHz: 2 levels HLT with a fully reconstructed event produced by online

Need to be ready in 2020: no time for major changes in technology

  • R&D based on existing experience
  • Run 2 as a testbed for new ideas
  • New TDR in 2018

Reconstruction

  • HLT1 will run something very close to current offline reconstruction at event rate (30 MHz)
  • HLT1 data buffered to disk for deferred processing with calibration/alignment data: will already been used in Run 2
    • No need for offline reconstruction
    • Redefine RAW data as reconstructed data?
  • Doing reconstruction in only one place allows for HW optimizations but the code must continue to run on x86 architectures for MC events
  • 2.5-5 GB/s to storage
  • Changing the game for skimming and analysis: all events wrote out by reconstruction are interesting and will be analysed
    • For some analysis interested only in the decay of the triggering signal, a new smaller format: TurboStream

Offline resources: mainly for simulation

  • Active work in progress for fast MC: LHCb relies on a high volume of simulation (x50 expected)

Storage model: few datacenters both for tapes and disks

  • Tapes: 3 sites would be enough
  • Disks: no need for many smaller sites but recognize that it can be an important sociological/funding issues

Current distributed model for CPU works well for LHCb

  • Also include more opprotunistic resources
  • No need for coupling with data

Discussion

Simone (SC): all experiments share the goal of moving to a limited number of datacenters with CPU everywhere. How we move formard? The main issue is probably the socioligical one: how to reduce the number of datacenters and keep the expertise/know-how that is spread in all the sites?

  • P. Charpentier (PC): we are facing the problem that many sites and funding agencies consider that doing physics is running analysis and don't understand the importance of running MC.
    • Need to convince FAs and sites that you can have a very valuable contribution to physics without operating storage
  • Ian Bird (IB): agree with Philippe but FAs are not obscure bodys but people close to us. It's up to us to explain what we consider a useful contribution to physics.

D. Britton: we must be careful about consolidation. In UK, getting involvment of institutions like universities we were able to deliver 300% of the resources directly funded by the project (GridPP). If consolidating on fewer centers, may lead to a reduction of hwat is delivered.

  • G. Steward (GS): operational costs come mainly from storage. Can probably maintain a very diverse/distributed contribution to CPUs while consolidating storage

L. Sexton-Kennedy: some steps of the analysis require local storage, it's a small amount but end users may need it at their sites

HW Trends - B. Penzer

Semiconductor market saturating: no growth anymore

Server market small but very profitable: 99% Intel

  • HEP is 0,3% of the server market which is becoming a niche....
  • Not many companies with the ability to spend a large part of their revenue in R&D: Intel spends 25%
  • Most IC companies are fab-less: only 4 companies with leading edge fabs
  • A few companies dominate the different markets: Intel (processor graphics) Samsung (disks, memories...)
    • Not necessarily competing with each others

May have reached the point where the HS06/$ will not improve anymore: cost to produce the new generations matching the moore law increased significantly

  • Impact on server prices not yet clear as the processor market is highly profitable. But the price/performane ratio improvement may be only 10%

Microserver developments: currently dominated by ARM but Intel is coming

  • XeonD adopted by Facebook instead of ARM: easier software port
  • Game may change is Samsung is buying AMD...

Still a few new architecture in the fieds but nothing has materialized yet...

GPUS for HPC: a very small market (10K units) financed by integrated graphics cards whose market (revenue) is decreasing

Tape drive: LTO now has 96% of the market, 1 cent/GB

Memory: new technology called "memory stacks" coming that will improve by a x15 performances

  • Volatile DRAM market
  • NAND Flash: reached the limit with 2D, going 3D
  • Disruptive technologies complicated: many projects dropped due to the cost of producing them

Disks: 6 TB and 8 TB available but future unclear

  • Several technologies available to produce higher density disks but costs are very high and impossible to predict what will happen
  • SSD vs. HDD cost/size: x3 to x25. Not an affordable replacement.
    • Even producing the same size volume of HDD with SSD would be a huge investment : 0.5 T$!

Several knobs for savings in the total envelope of systems

  • Storage: should speak about "storage units" defined as a combination of space and perfs
    • Significant gains possible for providing large spaces without high perfs, for example for simulation
  • Current Haswell can execute 32 instructions per cycle, HEP using 1: improving the SW is the main path for savings...

New Storage Technologies - L. Mascetti

See slides.

Resource Provisionning: Clouds - L. Field

An extension of the pilot job paradygm: pilot VM

Need for consolidation in the way cloud are used between VOs

  • CernVM: a VM image (including OS) through CVMFS
    • CVMFS is already a requirement
  • Capacity management: the vacuum model is a robust generic approach (VO is not responsible for starting the VM, the VM pops up and connect to VO central queue)
    • Removes the need for all the frameworks to know about all resources types: parameter space reduction
  • Monitoring: fabric management is the responsibility of the capacity manager. Should be common to all VOs.
  • Accounting: need to map jobs (VO view) to resources offered to the VO (VMs, site view)
    • Need a unique solution for all VOs: need to give a unified view to resource/budget holders

Commerical clouds: a lot of different initiatives

  • Helix Nebula
  • Microsoft Azure Pilot to start soon with CERN Openlab
  • Amazon/BNL joint project for ATLAS and CMS
    • New Scientific Computing group at AWS
  • PICSE
  • European Science Cloud Pilot: H2020 PCP proposal
    • Buyers group: organizations member of WLCG

CPU Resource Provisionning towards 2022 - A. McNab

Virtual "grid with pilot jobs" site

  • The site manages only the virtual infrastructure: nothing VO specific
    • Lot simpler than managing WNs
  • VO is managing its execution environment: CernVM
  • Building a "virtual grid" is just starting a VM with a pilot job
    • With the Vacuum model, starting the VM is no longer handled by a VO central service: simplification

Vacuum model: on a small user_data file must be supplied by the site to define what to run when.

3 VM lifecycle managers implementing the Vacuum model: Vac, Vcycle, HTCondor Vacuum

  • Vac: standalone implementation (IaaC). No IaaS involved.
  • Vcycle: Vac features for an IaaS cloud. Currently supporting OpenStack.
    • Can be run centrally or by a site
  • HTCondor Vacuum: injects jobs which create VMs that coexist with normal jobs

Vac and Vcycle are implementing target shares to enable dynamic sharing of resources

Software Evolution - L. Sexton-Kennedy

Almost all experiments are dealing with the complexity of events by increasing the granularity of the detectors: ILD, LAr, may be CMS...

  • Need for a global collaboration in HEP: a challenge in itself

How HEP SW Foundation can help: mechanism to facilitate collaboration around SW

  • Collaboration is the only affordable way to address the challenges

SLAC workshop: the real kick-off meeting for HSF

  • Good non EU participation
  • Many non HEP/IF experiments represented
  • Community and project views: different focus, no conflict
  • Decided to adopt the Apache model: bottom-up, project-based, do-ocracy
    • Transparency essential
    • Darwinian approach: HSF provides an infrastructure to projects, users decide projects that survive

Several WG formed that started to work

  • Training: consensus it must be the initial focus, several types of training needed, learn from other initiatives like SW Carpentry
  • Packaging and Building WG: define a build protocol to orchestrate the combination of various SW projects
    • Role of new technologies like Docker
    • Allow adoption by existing projects
    • Discussions in issues of GitHub HSF/packaging
  • Licensing WG: many SW projects without a license...
    • An open-source license is mandatory to participate to HSF
    • Build upon CERN recent work on the topic
  • SW Project WG: work on incubator idea
  • Development Tools: give access to tools/platforms available at certain labs, like CERN TechLab, FNAL...
  • Communication and Exchange WG: SW Knowledge Base
    • Everybody can contribute, adds its project, make a review... just request an account

More information: see http://hepsoftwarefoundation.org

  • Everybody interested is welcome to join: see web sites for mailing list addresses

WLCG Partners

OSG - R. Quick

Current status of OSG

  • Ready for LHC Run 2
  • Read to embrace Intensity Frontier as a new major stakeholder
  • Ready for making a big leap frowards in shrinking the geek gap in data analysis
  • Readiy to work with bioinformatics to move it to DHTC through science gateways

Current usage figures: 75% HEP, 67% LHC

  • Footprints on 120 campuses
  • Strong opportunistic usage
    • OSG considers as part of his mission to address long-tail science needs

Recent changes in OSG leaders: see slides

HTCondorCE: lessons learned with 10 years of GRAM, turn a CE into a particular configuration of HTCondor

  • Ownership by HTCondor team, some contributions by OSG
    • Authz by voms/gsi, support for multiple batch systems...
  • Easily delivered in OSG stack
  • Since Dec. 2014: default OSG-CE
    • GRAM-CE support expected to end in July 2016

OSG CA planning to transition to CI-Logon

  • CI-Logon: a NCSA project in conjunction with XD and XSEDE
  • OSG CA will be accredited by IGTF
  • Hope to complete the transition early 2016: smooth transition expected apart from DN changes

OSG as a service: OSG-Connect gateway

  • Abstract complexities of using DHTC
  • Generic service easily customized for each community
  • Starting collaboration with EGI competence centers

Data movement in the hand of big communities: no serious effort in OSG to offer a full fledged data service for long tail science

HPC resources: collaboration with XSEDE, currently mainly XSEDE offloading work to OSG but working on the other direction too.

  • Good will on both sides

Network Services at OSG and WLCG - S. McKee

Network monitoring for WLCG through a standard open source tool: perfSONAR

  • 260 instances deployed
  • Wide deployment by WLCG was a driver for significant improvements in v3.4
    • Current version is 3.4.2 and addresses all the know issues in 3.4
  • Made a huge progress in perfoSONAR config management with meshes: central configuration of instances
    • Dynamic reconfiguration is possible
    • perfSONAR instances can participate in more than one mesh
  • Instances monitored by OMD: https://psomd.grid.iu.edu/WLCGperfSONAR/check_mk/

MadDash for metrics visualisation: http://psmad.grid.iu.edu/maddash-webui/

Several concrete examples where perfSONAR infrastructure helped to isolate and fix tricky problems

  • See slides for a recent example between AGLT2 and SARA

OSG working on exploiting the perfSONAR data to raise alarm from a central archive: PuNDIT

Future plan

  • Build tools above perfSONAR to help diagnose/troubleshoot topology issues
  • Datastore access through ActiveMQ for application to use the data for their decisions
    • Pilot planned with FTS
    • Also a plan to integrate data from perfSONAR into FTS metrics in the SSB dashboard

Belle 2 - T. Hara

Belle 2: 50 ab-1

  • 100 PB of raw data
    • compared to 1 PB for Belle
  • Should start in 2018

Need to adopt a world-wide distributed computing to meet the computing challenges

  • Distributed Computing infrastructure should be ready mid-2017
  • Next KEK computing system replacement planned mid-2016
    • Phases like for Tokyo T2

Computing resources very close to those of ATLAS or CMS

  • More for tapes
  • 3 main data sites: KEK, PNNL, GridKA/DESY

Belle 2 joined LHCONE

  • Similar requirements to LHC for networks
  • May Belle 2 sites are WLCG sites
  • Achieved 1 GB/s between KEK and PNNL

Significant overlap between WLCG and Belle 2 sites: exposed to the MW diversity in WLCG

  • Using DIRAC as the experiment framework
  • Catalogs: not yet decided between DFC and AMGA+LFC
  • Developing their own monitoring tools above DIRAC: in particular to do site testing
  • Using GGUS

Wants to be an observer in WLCG and participate to GDB/WLCG workshops

  • Ian/Michel: welcome to participate to GDBs!

Wrap-Up

Follow-up of discussions to be announced later

  • Several topics will be discussed during GDB
  • A specific initiative to progress on our infrastructure evolution

Next workshop in 9 months: details to be announced later

Thanks to CHEP organizers to make this meeting possible!

-- MichelJouvin - 2015-04-12

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2015-04-12 - MichelJouvin
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback