May 2011 Reports
To the main
31st May 2011 (Tuesday)
Experiment activities
:
- A lot of data. Processing and reprocessing are running.
New GGUS (or RT) tickets:
Issues at the sites and services
30th May 2011 (Monday)
Experiment activities
:
- A lot of data. Processing and reprocessing are running. Certification of new Dirac version.
New GGUS (or RT) tickets:
Issues at the sites and services
27th May 2011 (Friday)
Experiment activities
:
- Waiting for beam, data. Processing and reprocessing are running. Certification of new Dirac version.
New GGUS (or RT) tickets:
Issues at the sites and services
26th May 2011 (Thursday)
Experiment activities
:
- No beam - no data. Processing and reprocessing are running. Certification of new Dirac version. TransferAgent stuck again, but all files from backlog transfered now.
New GGUS (or RT) tickets:
Issues at the sites and services
- T0
- T1
- GRIDKA CREAMCE (GGUS:70835
)
- IN2P3 LFC RO Mirror is down, as result most MC jobs from French sites failed to upload output data files.
- T2
25th May 2011 (Wednesday)
Experiment activities
:
- Data Taking is active. Processing and reprocessing are running. Certification of new Dirac version. TransferAgent stuck this night, as result no data tranfer from pit to Castor (under investigation).
New GGUS (or RT) tickets:
Issues at the sites and services
24th May 2011 (Tuesday)
Experiment activities
:
- Data Taking is active. Processing and reprocessing are running.
New GGUS (or RT) tickets:
Issues at the sites and services
23th May 2011 (Monday)
Experiment activities
:
- Data Taking is active. Processing and reprocessing are running.
New GGUS (or RT) tickets:
Issues at the sites and services
20th May 2011 (Friday)
Experiment activities
:
- Data Taking is active again. Validation of a new reconstruction is OK.
New GGUS (or RT) tickets:
Issues at the sites and services
- T0
- T1
- CERN : problem CE ce130.cern.ch (GGUS:70748
) and ce203 and ce207 (GGUS:70730
)
- CNAF : gridmap file problem : Yesterday at 12:35 CEST the mapping of Ricardo suddenly changed at CNAF from pillhcb003 to pillhcb027. I don't know the reason, maybe the hard links in /etc/grid-security/gridmapdir went deleted (for all the experiments, not only LHCb). From that time, for all the running jobs the gram job state files of the jobs (owned by pillhcb003) could not be managed anymore by pillhcb027, and this caused the aborts.
For any new job there should be no problem. What remains now is to understand why this happened... but this is another story.
19th May 2011 (Thursday)
Experiment activities
:
- Data Taking is active again. Validation of a new reconstruction is OK. The LFC problem has been fixed by changing the timout and the time between two retries.
New GGUS (or RT) tickets:
Issues at the sites and services
- T0
- T1
- SARA : Space token re-organistion done and in place. (problem during the night because they have been put in prod before we modify our Configuration Service.
- T2
18th May 2011 (Wednesday)
Experiment activities
:
- Data Taking is active again. Validation of a new reconstruction ongoing. we have a major issue . 9k jobs are in status "cheking" / "inputdata resolution" and this number is increasing. We try to identify this problem but we have not yet found the cause of it.
New GGUS (or RT) tickets:
Issues at the sites and services
- T0
- T1
- SARA : CVMFS cache increased to 6GB. Space token re-organistion done
- PIC : problem of partition full on wms01 and wms02 was due to a cron which was not running.
- T2
17th May 2011 (Tuesday)
Experiment activities
:
- Data Taking is active again. Validation of a new reconstruction ongoing
New GGUS (or RT) tickets:
Issues at the sites and services
16th May 2011 (Monday)
Experiment activities
:
- Data Taking is active again. Validation of a new reconstruction
New GGUS (or RT) tickets:
Issues at the sites and services
- T0
- T1
- T2
- SAM / SE tests failing for GridKa and SARA. Currently investigated by site admins.
13th May 2011 (Friday)
Experiment activities
:
- Technical stop : no data taking
New GGUS (or RT) tickets:
Issues at the sites and services
12th May 2011 (Thursday)
Experiment activities
:
- Technical stop : no data taking
New GGUS (or RT) tickets:
Issues at the sites and services
- T0
- T1
- RAL : diskserver migration finished
- CERN : diskserver migration on going
- T2 *
11th May 2011 (Wednesday)
Experiment activities
:
- Technical stop : no data taking
- GGUS ticket against GGUS because we were not able to submit TEAM ticket (GGUS:70459
fixed )
New GGUS (or RT) tickets:
Issues at the sites and services
- T0
- T1
- NIKHEF jobs kill by memory. Memory will increase to 5Gb
- IN2P3 : Pilot jobs aborted during the nigth but now back to normal.
- RAL : diskserver migration on going
- CERN : diskserver migration on going
- T2 *
10th May 2011 (Tuesday)
Experiment activities
:
- Technical stop : no data taking
- MC productions on most T1/T2 sites
New GGUS (or RT) tickets:
Issues at the sites and services
- T0
- VOMS intervention
- Replication stream : Oracle intervention
- T1
- NIKHEF jobs kill by memory. Could we increase the size limit to 5Gb ?.
- SARA : SRM intervention
- PIC : interventions in network equipment and firmware updates
- T2
9th May 2011 (Monday)
Experiment activities
:
- Technical stop : no data taking
- MC productions on most T1/T2 sites
New GGUS (or RT) tickets:
Issues at the sites and services
- T0
- T1
- SARA problem with aborted pilots for reconstruction jobs (GGUS:70170
).
- T2
6th May 2011 (Friday)
Experiment activities
:
- RAW reconstruction of current data almost completed, Stripping/Merging jobs in progress.
- Data removal / archiving postponed because of backlogs in data management processes.
- MC productions on most T1/T2 sites
New GGUS (or RT) tickets:
Issues at the sites and services
- T0
- T1
- SARA problem with aborted pilots for reconstruction jobs (GGUS:70170
).
- RAL jobs looping over 3 unavailable files (GGUS:70158
).
- RAL space token renaming has started yesterday.
- T2
5th May 2011 (Thursday)
Experiment activities
:
- RAW data distribution and their FULL reconstruction is going on at most Tier-1s.
- Cleaning of old data to be started (~ 1/2 PB)
- A lot of MC continues to run.
New GGUS (or RT) tickets:
Issues at the sites and services
- T0
- T1
- SARA problem with aborted pilots for reconstruction jobs (GGUS:70170
).
- RAL jobs looping over 3 unavailable files (GGUS:70158
).
- RAL has increased disk pools as reported low yesterday
- Change of space token names has been done at Gridka
- IN2P3 downtime tonight because of dCache problem
- T2
4th May 2011 (Wednesday)
Experiment activities
:
- RAW data distribution and their FULL reconstruction is going on at most Tier-1s.
- A lot of MC continues to run.
New GGUS (or RT) tickets:
Issues at the sites and services
- T0
- T1
- SARA problem with aborted pilots for reconstruction jobs (GGUS:70170
).
- RAL jobs looping over 3 unavailable files (GGUS:70158
).
- T2
3rd May 2011 (Tuesday)
Experiment activities
:
- RAW data distribution and their FULL reconstruction is going on at most Tier-1s.
- A lot of MC continues to run.
New GGUS (or RT) tickets:
Issues at the sites and services
- T0
- T1
- SARA problem with aborted pilots for reconstruction jobs (GGUS:70170
).
- RAL jobs looping over 3 unavailable files (GGUS:70158
).
- PIC Sam SRM tests working again as of tonight. There was a problem due to changes in space token names.
- CNAF The same change for STs happened. Sam SRM tests changed today.
- T2
2nd May 2011 (Monday)
Experiment activities
:
- RAW data distribution and their FULL reconstruction is going on at most Tier-1s.
- A lot of MC continues to run.
New GGUS (or RT) tickets:
Issues at the sites and services
- T0
- T1
- SARA problem with aborted pilots for reco jobs (GGUS:70170
). Problem fixed this morning.
- RAL jobs looping over 3 unavailable files (GGUS:70158
).
- T2
--
RobertoSantinel - 02-Dec-2010
Topic revision: r2 - 2011-06-07
- unknown