Site

Question ________________________________Response__________________________________
Site and Endpoints
What is the site name? RAL-LCG2
Which endpoint URLs do your archival systems expose? srm-${experiment}.gridpp.rl.ac.uk ; root://${various}.gridpp.rl.ac.uk/
How is tape storage selected for a write (choice of endpoint, specification of a spacetoken, namespace prefix). Itís done by path. The path maps to a file class which maps to a tape pool.
Queue
What limits should clients respect? (RALís response here is pretty much like CERNís because we also run CASTOR)
---> Max number of outstanding requests in number of files or data volume infinite
---> Max submission rate for recalls or queries ~10Hz
---> Min/Max bulk request size (srmBringOnline or equivalent) in files or data volume 1-1000. Maybe we should check if anyone has submitted 1000
Should clients back off under certain circumstances? YES
---> How is this signalled to client? SRM_INTERNAL_ERROR at request level and SRM_FILE_BUSY at file level returned by SRM. Stalling client by xrootd. Or through admin processes: sysadmins communicating with experiments
---> For which operations? Potentially all
Is it advantageous to group requests by a particular criterion (e.g. tape family, date)? YES
---> What criterion? Files recalled together should be on the same tape. For WLCG and GridPP VOs, users are expected to do this themselves; for facilities (e.g. climate), another service ďaboveĒ CASTOR will aggregate files into reasonably sized chunks that can (and will) be recalled together.
Prioritisation
Can you handle priority requests? YES
---> How is this requested? In practice administratively. Typically, to prioritise recalls for a given user/VO, we will allocate more drives. If a lot of data needs to be recalled (petabytes), CASTOR admins can help reschedule recalls to be more efficient
Protocol support
Are there any unsupported or partially supported operations (e.g. pinning) ? No pinning
Timeouts
What timeouts do you recommend? Our experience is that the client times out before we do. 24 hours is suggested for recalls.
Do you have hardcoded or default timeouts? No.
Operations and metrics
Can you provide total sum of data stored by VO in the archive to 100TB accuracy? YES; we can do much more accurate than that: in our earlier/current information provider, we can do to byte level (but itís expensive, so we do it only once every 24 hrs)
Can you provide space occupied on tapes by VO (includes deleted data, but not yet reclaimed space) to 100TB accuracy? YES
How do you allocate free tape space to VOs? Thereís an ďinfiniteĒ tape pool of free tapes. Free tapes are essentially allocated as needed but we then track usage administratively, like keep an eye on when we (or the VO) need to buy more tapes, whether a VO is using too many tapes, etc.
What is the frequency with which you run repack operations to reclaim space on tapes after data deletion? Depends on user requirements and deletion rates. Weekly.
Recommendations for clients
Recommendation 1  
---> Information required by users to follow advice For WLCG/GridPP-approved experiemnts/VOs, we have a weekly meeting (using Vidyo) which it is highly recommended they join. We have mailing lists and, within the T1, lists of contacts for every VO.
Recommendation 2  
Buffer Management
Should a client stop submitting recalls if the available buffer space reaches a threshold? To the first approximation, no. We automatically garbage collect the least recently used data on the cache. Our policy is to make the cache big enough that this isnít a problem (ATLAS and CMS have 640 TB each). If a user needs to recall 100s of TBs, then they would usually talk to the operators anyway. The cache is shared between ingest and recall Ė we have the ability to separate it if needed.
---> How can a client determine the buffer used and free space? They canít. We hold the buffer at 70% full Ė if it goes higher than that it means we either have a garbage collection problem or an unexpected tape robot outage.
---> What is the threshold (high water mark)? 70% full
---> When should the client restart submission (low water mark)? We canít stop the client and donít have a meaningful low water mark because garbage collection acts as required to keep the cache at 70% full. As mentioned above, huge recalls should be done in collaboration with RAL admins. If we could stop the client, restart would depend on the size of the clientís recall relative to the cache size and the amount of other recall/migration activity.
If the client does not have to back off on a full buffer, and you support pinning, how is the buffer managed? We donít support pinning.
---> Is data moved from buffer to another local disk, either by the HSM or by an external agent? Not by us. The users can manually copy data to other local disk resources, such as CASTOR d1t0 or ECHO.
Additional questions
Should any other questions appear in subsequent iterations of this survey?  
-- OliverKeeble - 2018-01-30
Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r6 - 2018-05-04 - OliverKeeble
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    HEPTape All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback