ATLAS DDM Operations at SARA, 2007

December 10

Many errors "Reason [DESTINATION error during PREPARATION phase: [REQUEST_TIMEOUT] fail ed to prepare Destination file in 180 seconds] Source Host []" on SARADISK, no good transfers.

December 7

Scheduled downtime of SARA-MATRIX due to problems with pnfs. Start 2007-12-06, 18:43:00 [UTC], End: 2007-12-07, 18:43:00 [UTC].

November 29

dq2_cleanup is not anymore in .../GRID/ddm/pro03/, I will have to use a different tool to clean DS from aborted tasks.

November 28

Scheduled (?) downtime for SARA-MATRIX from 9:10 to 13:00 UTC. Needed for dcache upgrade

November 21

Scheduled downtime for SARA-MATRIX since 19.11. extended until 22.11.

November 2

Obsolete data deletion succeeded on all sites without a single error!!! Number of files deleted on each site
IHEP_aborted_ds_20071030.list_20071102_1608.log: 69
ITEP_aborted_ds_20071030.list_20071102_1608.log: 65
JINR_aborted_ds_20071030.list_20071102_1608.log: 60
NIKHEF_aborted_ds_20071030.list_20071102_1608.log: 1195
SARADISK_aborted_ds_20071030.list_20071102_1608.log: 1294
SARATAPE_aborted_ds_20071030.list_20071102_1608.log: 0
SINP_aborted_ds_20071030.list_20071102_1608.log: 7
Those were DS belonging to obsolete tasks mailed 30.10.2007.

November 1

Lots of errors for transfers to SARA tape. Reported in GGUS ticket 28553 and also by Pedro in Savannah:

October 26

DS belonging to obsolete tasks were deleted:
IHEP_aborted_ds_20071025.list_20071026_0518.log: 1
ITEP_aborted_ds_20071025.list_20071026_0518.log: 1
SARADISK_aborted_ds_20071025.list_20071026_0518.log: 3
Only 5 files in total.No errors.

October 25

SINP downtime finished 3 days ago: 2007-10-22, 16:00:00 [UTC]. There is a new downtime since today, only CE should be down. SE responds to srm-get-metadata. Open channels
glite-transfer-channel-set -S Active  -s STAR-SINP
glite-transfer-channel-set -S Active  -s SINP-STAR
And check they are active:
glite-transfer-channel-list -s STAR-SINP
Channel: STAR-SINP
Between: * and RU-MOSCOW-SINP-LCG2
State: Active
Contact: (null)
Bandwidth: 0
Nominal throughput: 0
Number of files: 5, streams: 5
Number of VO shares: 1
VO 'atlas' share is: 100

October 8

There are errors for NIKHEF at dashboard pages: State from FTS: Failed; Retries: 3; Reason: TRANSFER error during TRANSFER phase: [GRIDFTP] the server sent an error response: 550 550 rfio write failure: No space left on device.

But there should be enough space for production role:

                              CAPACITY 24.00T FREE 7.75T ( 32.3%) /export/data/vg0/lv0 CAPACITY 6.00T FREE 61.83M (  0.0%) /export/data/vg0/lv1 CAPACITY 6.00T FREE 3.81T ( 63.6%) /export/cache5 CAPACITY 1.88T FREE 61.69G (  3.2%) RDONLY /export/cache6 CAPACITY 1.88T FREE 131.00G (  6.8%) RDONLY /export/cache7 CAPACITY 1.88T FREE 259.58G ( 13.5%) RDONLY /export/data/vg0/lv0 CAPACITY 6.00T FREE 193.21M (  0.0%) /export/data/vg0/lv1 CAPACITY 6.00T FREE 3.94T ( 65.6%)
I must do more debugging.

October 3

Site services on the sara vobox at CERN were crashing and finally completely stopped. Pedro managed to solve it, here is an explanation and manual what to do: "MySQL was simply not working properly. It didn't accept any more connections. when the agents were running you could see the error. to fix you should stop and start the agents. when you start the agents they try to make a database connection which because MySQL is still blocking new connections (this behaviour I haven't seen before). in this case, you should login to the 'site' MySQL machine, stop and start MySQL, go back to VOBox and start the agents."

A new unscheduled downtime for SINP, since 3.10. to 10.10.

October 2

Savannah bug A file is registered in the LFC, but its size on disk is 0:
lcg-lr guid:0433D26E-891B-DC11-BDFE-00112FCCC3FB

lfc-ls -l /grid/atlas/dq2/trig1_misal1_mc12/AOD/trig1_misal1_mc12.007268.singlepart_mu_p1000.recon.AOD.v12000605_tid010498/trig1_misal1_mc12.007268.singlepart_mu_p1000.recon.AOD.v12000605_tid010498._00479.pool.root.2
-rw-rw-r--   1 18992    1475                5302218 Jun 16 00:18 /grid/atlas/dq2/trig1_misal1_mc12/AOD/trig1_misal1_mc12.007268.singlepart_mu_p1000.recon.AOD.v12000605_tid010498/trig1_misal1_mc12.007268.singlepart_mu_p1000.recon.AOD.v12000605_tid010498._00479.pool.root.2

srm-get-metadata srm://
WARNING: SRM_PATH is defined, which might cause a wrong version of srm client to be executed
RequestFileStatus         SURL :srm://
                     size :0
                     owner :1900
                     group :1213
                     permMode :420
                     checksumType :adler32
                     checksumValue :00000001
                     isPinned :false
                     isPermanent :true
                     isCached :true
                     state :
                     fileId :0
                     TURL :
                     estSecondsToStart :0
                     sourceFilename :
                     destFilename :
                     queueOrder :0
Delete it from disk and from the catalogue:
srm-advisory-delete srm://
lcg-uf --vo atlas guid:0433D26E-891B-DC11-BDFE-00112FCCC3FB srm://
I have no permissions to delete it from the DS.

September 28

Broadcast message from Ron: The network problems we were having were solved at about 10 pm last night. However, some storage nodes seemed to be in a peculiar state after the network problems. Today, we have tied up those loose ends and we are passing SAM tests again.

September 20

SINP SE downtime finished (17.9.), but the site downtime continues and the SE does not work.

A short problem with announced by EGEE broadcast solved.

Files from aborted DS deleted. All aborted DS were only on SARADISK. 42 files deleted, no errors.

September 19

It was decided that SARA should get all M4 RAW data.

September 13

SINP is in unscheduled downtime since yesterday until 26.9. Jurriaan added me as a channel manager for NL T2's and I set channels to and from SINP as inactive:
glite-transfer-channel-set -S Inactive  -s STAR-SINP
glite-transfer-channel-set -S Inactive  -s SINP-STAR

September 12

I delete files from aborted DS using (parallel delete from all NL sites). SINP SE does not respond, GGUS-Ticket 26743 has been created.

Some files from PNPI cannot be copied, FTS reports an error Destination and source file sizes don\'t match!!. Problem reported by Stephane via savannah. GGUS-Ticket 26749 has been created.

September 5

A distributionof M4 ESD DS to IHEP, ITEP, JINR and PNPI is quite good: M4 DS panda monitor

September 4

All transfers to ITEP stayed in status "ready". Restart of services on FTS helped.

August 28

List of aborted DS contained 16 DS. In total 81 files were deleted using the procedure with dq2_cleanup, no errors. Log file on afs: /afs/

List of aborted DS (from August 21) contained 3 DS. 3 files were deleted from SARADISK, no errors.

August 23

The channel STAR-ITEP is now working for transfers to
gt-stat-sara -l a7a7440d-50b2-11dc-97f4-93d4a533c78d
Request ID:     a7a7440d-50b2-11dc-97f4-93d4a533c78d
Status:         Finished
Channel:        STAR-ITEP
Client DN:      /DC=cz/DC=cesnet-ca/O=Institute of Physics of the Academy of Sciences of the CR/CN=Jiri Chudoba
Submit time:    2007-08-22 13:21:56.146
Files:          1
Priority:       3
VOName:         atlas
        Done:           0
        Active:         0
        Pending:        0
        Ready:          0
        Canceled:       0
        Failed:         0
        Finishing:      0
        Finished:       1
        Submitted:      0
        Hold:           0
        Waiting:        0
  Source:      srm://
  Destination: srm://
  State:       Finished
  Retries:     0
  Reason:      (null)
  Duration:    18

August 20

Further tests of FTS channel STAR - ITEP:
glite-transfer-submit -v -p ftspwd -s srm:// srm://
Server supports delegation, however a MyProxy passphrase was given: will use MyProxy legacy mode.
gt-stat-sara ff5d2fcd-4ee0-11dc-97f4-93d4a533c78d
- after several minutes still waiting 
The same transfer to proceeds very fast:
glite-transfer-submit -v -p ftspwd -s srm:// srm://
gt-stat-sara 75e0ff15-4ee1-11dc-97f4-93d4a533c78d
- finished within 1 minute

August 18

Check SARA-ITEP channel. Transfer from a UI to ITEP:
 srmcp file:////mnt/raid4_atlas/chudoba/transfer/file.1KB srm://
- OK
A transfer to SARADISK:
 srmcp file:////mnt/raid4_atlas/chudoba/transfer/file.1KB  srm://
- now hangs
A transfer to NIKHEF:
 srmcp -debug=true file:////mnt/raid4_atlas/chudoba/transfer/file.1KB  srm://
- OK
Delete the file from ITEP and submit a transfer request from NIKHEF:
srm-advisory-delete srm://
myproxy-init -d -s
glite-transfer-submit -v -p ... -s srm:// srm://
- not finished, I again lost connection. I will check when I am back to Prague.

August 2

Migration of to a new hardware.

July 31

FTS server was migrated to

July 26

LFC server change, -> Oracle database server and the LFC were moved from two old dual Xeon nodes to two new dual core dual CPU Xeon machines with 4GB of memory, two power supplies each and hardware RAID1 system disks. This make everything more reliable than it was before.

July 17

Unscheduled intervention due to problems with /pnfs. FTS channel CERN-SARA set inactive. Announced at 9:10 via broadcast, back online at 13:44 (broadcast announcement).

June 27

FTS: VO manager role granted for me: /DC=cz/DC=cesnet-ca/O=Institute of Physics of the Academy of Sciences of the CR/CN=Jiri Chudoba (disappeared during an upgrade?)

June 22

Corrupted files unregistered from the LFC (in total 24660 files, 28 were already unregistered earlier - by whom?).

June 20

dCache at SARA upgraded to 1.7.0-36 (13:00 - 15:00)

June 19

A number of SARA's dcache pool nodes suffered from running out of disk space on the root file system. (13:10 - 13:45)

June 11

SARA router maintenance (18:00 - 19:30).

June 6

Maintenance of Oracle db - LFC and FTS down, 14:00 - 18:00, broadcasted just a few minutes earlier. Announced back online at 16:40.

May 30

Scheduled maintenance for oracle server at SARA, announced 8:55, started 9:00, finished ??. The FTS on and LFC at were affected.

May 24

Still LFC problems: The LFC seems to crash every so many minutes (10:36). LFC has been running stably for the last 3.5 hours. (15:29).

May 23

LFC back online (15:26).

May 19

LFC mu11 still down, no way to do a cleanup ...

Answer to Yevgenij, he was complaining that (DPM) is still used. We must migrate to (dCache).

May 18

LFC again down: "Due to a disk problem of the oracle server at SARA the Oracle database is down for the moment. This problem affects the LFC on and the FTS server."

Yesterday we managed to get a dump of files stored at tbn18 and registered at mu11. Out of 45028 files, which were not found by the first version of the script (uses the same path at LFC as for SURL with a change /dpm/ -> /grid/atlas), only 47 were not registered.

Page with a description of procedures used after file losts: AtlasDDMLostFiles .

May 17

Since May 14 many errors in config/TIER2S/subscriptions.log concerning proxy retrieval from Still not understood. Ron issied an EGEE broadcast about SARA's internal network problems, later prolongued until Friday 18 May. Transfers using FTS server ar CERN (config/SARA/subscriptions.log) issued last error due to a missing proxy on on 2007-05-14 22:52:49.

May 15

LFC mu11 down. GGUS ticket 21985.

Numbers about corrupted files at NIKHEF:

  • 78308 md5sum_correct.rec
  • 21169 md5sum_corrupted.rec
  • 32426 md5sum_missing.rec
  • 45028 missing_files.rec
  • 176931 total
Missing files were not found in the LFC because there were different conventions how to create LFN from SURL. After a correction only 47 files were not found in the LFC.

May 5

decommisioning of SE The data stored on the SE can now be accessed through gridftp. The TURLs now start with: gsi

April 24

PNPI site is in ToA. I added it into crontab.

April 23

I got a list of possibly corrupted files at tbn18 together with their md5sums. Jan Kubalec is going to compare md5sums values with values stored in the LFC.

Emails from the vobox are still not reaching their recepients, although the GGUS ticket 14014 was again closed.

April 20

FTS channel STAR-PNPI and PNPI-STAR tested. OK for small files, 1 GB file failed. Transfers to from PNPI to NIKHEF failed too. After an increase of time out value on the FTS server I was able to copy 1 GB from SARADISK

April 19

A long list of possibly corrupted files at NIKHEF. They may be corrupted due to a bad enclosure. The list contains 176931 files. A randomly chosen 3 files is a small statistics, but I got 1 corrupted and 2 not corrupted. Sizes were 79 KB, 80MB (the two non corrupted) and 100 MB (corrupted). I downloaded them, computed md5sum and compared with a value stored in the LFC.

April 5

Some sites were missing in FTS configuration file services.xml. They were inserted yesterday. Many other sites are missing since then.

March 25

Error from March 19 still there. It is due to full disks at SARA.

Errors for transfer to NIKHEF:
Transfer failed. ERROR the server sent an error response: 550 550 rfio write failure: No space left on device.
dpm-qryconf shows some space on all ATLAS pools (minimum 600 GB).

Cleaning of ITEP continues. A provided list of files (20 156 files) present at ITEP at 28.01.2007 did not help: some files were there twice. 4323 files were not known to LFC, 16193 had no replica at ITEP, so these cannot be unregistered.

I did a complete "integrity" check. I identified 38590 missing files in ITEP and now I ran lcg-uf to delete them from the catalogue. It is very slow, I hope that tomorrow it will be cleaned.

March 21 added to services.xml. Errors "No site found for host" do not appear anymore.

March 19

This error started to appear at 15:33 in SARA/subscription.log
Pool manager error: Best pool too high : 2.0E8
No transfers to SARADISK since then.

March 15

An LFC error caused by an upgrade was corrected. The last LFC upgrade included a schema changed, it was not done at first, because the configure_node broke down.

March 14

List of lost ATLAS files provided. There are 1275 lost files.

March 12

Another loss of ATLAS files at SARA was reported.

March 7

FTS channels for PNPI were established, endpoint is: srm://
Channels are:

February 28

A broadcast from Ron 1.3. the tape backend of will not be available from 10:30-11:30 due to maintenance.

New endpoint for NGDF T1 was announced: srm:// Check if channel to SARA exists.

February 2

A standard non voms certificate used until now was replaced by a new one (from Mario Lassnig).

Here are new entries in the crontab:

35 0,12 * * * [ -e "$HOME/.profile" ] && . $HOME/.profile; . /etc/profile; voms-proxy-init -confile /opt/glite/etc/vomses -valid 96:00 -cert $HOME/x509up_u23311 -key $HOME/x509up_u23311 -out $HOME/x509up_u23311.voms -voms atlas:/atlas/Role=production

January 31

DQ2 monitoring moves from the "classical" to a dashboard system developed by ARDA:

January 26

New fs was added to tbn18, which was previously full:
                              CAPACITY 12.77T FREE 1.24T (  9.7%)

January 24

All lost files were unregistered from the SARA LFC.

January 21

The NDGF-SARA channel is configured and the agent is running and there is also a NIKHEF-STAR and STAR-NIKHEF channel with associated agents.

January 12

Lost files on DPM (human error). We got a list of lost files - 3604. All files were registered in the SARA LFC. Later, entries about these files were removed from DPM db.

January 8

File srm:// cannot be copied. Problem reported.

-- JiriChudoba - 08 Jan 2007

Edit | Attach | Watch | Print version | History: r42 < r41 < r40 < r39 < r38 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r42 - 2007-12-10 - JiriChudoba
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback