Storage Management TEG: Questionnaire Level 1 - Elisa Lanciotti
This twiki is to collect the input of
Elisa Lanciotti. Please answer the questions below. For more information, please refer to the
Storage TEG main twiki.
Question 1
- In your view, what are the 3 main current issues in Storage Management (SM)?
My answer:
About the weak points in storage management, I think the most serious operational issue met by LHCb is about data access. Typically during reprocessing campaigns, when many jobs try to access data that have been staged on the disk cache, reading the input files through a protocol (not downloading locally) we often observe that disk servers cannot handle so many open connections. It's a problem of disk I/O. This limits the number of concurrent jobs that can be launched for the reprocessing. We would need more spindles per disk server. For sure other people from LHCb (Philippe and Stefan) mentioned this problem.
Secondly, I would like to mention some issues relative to reliability of SRM interface:
it has been noticed many times (and not only by LHCb, also ATLAS did) some inconsistency at dCache sites between the real content of dCache pools and the content of SRM db. This causes operational problems for the experiment, for example the space usage reported by the SLS sensors might be wrong.
And also for the consistency checks SE vs LFC that every VO performs: sites provide storage dumps of the space tokens, but not all the files existing at the sites are listed there!
I have reported about this issue at the
Tier1 service coordination meeting 03.11.2011
. There I also mention GGUS tickets, e.g. 72114, 75158.
In general , we would like that storage systems enforce their internal consistency and that the information provided through SRM is more reliable.
Also cases like the incident happened at Gridka last week (2 files physically lost from tape, but still reported as existing by SRM!) should not happen. See this GGUS for details:
75922
Question 2
- What is the greatest future challenge which would greatly impact the SM sector?
My answer:
Put here your answer
Question 3
- What is your site/experiment/middleware currently working on in SM?
My answer:
Put here your answer
Question 4
- What are the big developments that you would like to see from your site/experiment/storage system in the next 5 years?
My answer:
not easy to answer. Of course in general experiments want that the storage systems are more reliable and stable.
A possible practical improvement might be to have a solution for the typical incident when a disk server or a tape are temporarily (or permanently ) off-line.
One week ago a tape of LHCb raw data went off-line at CERN, and this triggered some considerations on what to do in case it was not possible to recover it. We realized that we are not prepared for a quick and simple solution (of course data can be recovered, as raw data have a second replica in another site, but it's an expansive manual operation and takes time, to stage the replicas at the other site where they are and then transfer them).
It would be very helpful if the storage system itself could provide a tape backup, an exact mirror of the lost tape.
Idem for disks, even if I suppose that here it is more expansive.
Question 5
- In your experience and area of competence, what are the (up to) 3 main successes in SM so far?
My answer:
I think standardization: With the same client (gfal,lcg_util) you can deal with files that are hosted on storage systems of different types.
Question 6
- In your experience and area of competence, what are the (up to) 3 main failures or things you would like to see changed in SM so far?
My answer:
Put here your answer
That's it!
Thanks! Feel free to edit again at any time, until the date of the kick-off meeting.
--
DanieleBonacorsi - November 2011