Project Plan Integrated Monitoring

  • Participants: Dirk Duellmann, Witek Pokorski, Dennis Waldron, Theodoros Rekatsinas, Jacek Wojcieszuk

General monitoring categories:

General statistics

  • File requests percentages (piegraph) - requested resident files, inter-pool copies and tape recalls
  • Requests timeseries (timeseries) - number of requests as timeseries
  • Requests per pool (histogram) - number of requests per pool
  • Pool transactions (table) - file copies between pools

File request statistics

  • Top ten users (histogram) - number of requests per user (top ten)
  • Prestaged / non-prestaged tape recalled files (histogram) - distribution per pools
  • Prestaged / non-prestaged requests per user (historgram) - distribution per users

File migration statistics

  • Files migration to tape (histogram) - number of files migrated to tape / per pool
  • Files migrated timeseries (timeseries) - number of files migrated per unit of time

Latency statistics

  • Request latency distribution (histogram)
  • Migration latency distribution (histogram)

File system statistics

  • File size distribution (histogram)

Tape statistics

  • Top ten tapes (histogram) - ten the most used (requested) tapes
  • Top ten mounters (histogram) - top ten users, whose activity prompts the most tape mounts
  • Tape mount efficiency - Distribution of recalled files per single tape mount

Garbage collection statistics

  • File age distribution of garbage collected files (histogram)
  • Size distribution of garbage collected files (histogram)
  • Files requested after garbage collection (histogram) - number of files requested with a given interval after garbage collection
  • number of file accesses before garbage collection

Specific monitoring items under discussion

  • list of bad tapes/files
  • user access protocol by user

Monitoring Schema Details

monDBSchema.png

Entities

  • Requests Table (subreqid, timestamp, reqid, nsfileid, filesize,svcclass,username,state,totallatency,filename)
    The table is populated by the "reqs" procedure, which runs periodically every 5 minutes.
    File Requests are distinguished into three subcategories: Disk Hits (File was already in Cache) ,
    Disk Copies (File was copied from another svcclass server to user's server), Tape Recalls (File was recalled from tape)
    We consider the last two sub categories as disk misses.
    Source Tables: castor_dlf.dlf_messages,castor_dlf.dlf_str_param_values

  • DiskHits Table (Subreqid,fileage,number of accesses,number of copies)
    This Entity is a subclass of the requests entity, containing complementary information for every disk hit.

  • DiskCopy Table (Subreqid, readlatency,original pool,target pool, number of copies in all svcclasses)
    The table is populated by the "diskcopyproc" procedure, which runs periodically every 5 minutes.
    Source Tables: requests,castor_dlf.dlf_messages,castor_dlf.dlf_str_param_values

  • TapeRecall Table (subreqid, readlatency, tapeid, tape mount status)
    This table is populated by the "taperecallproc" procedure, which runs periodically every 5 minutes.
    Source Tables: castor_dlf.dlf_messages,castor_dlf.dlf_num_param_values,castor_dlf.dlf_str_param_values

  • Migration Table (subreqid, timestamp, reqid, nsfileid, filesize,svcclass,username,totallatency,filename,filesize)
    The table is populated by the "migs" procedure, which runs periodically every five minutes.
    Source Tables: castor_dlf.dlf_messages,castor_dlf.dlf_str_param_values

  • GCFiles Table (timestamp,fileage,filesize,nsfileid)
    The Table is populated by the "gcfilesproc" procedure(5 minutes period). We keep info for every deleted file.
    Source Tables: castor_dlf.dlf_messages,castor_dlf.dlf_num_param_values

  • SvcClass_Map Table (svcclass)
    Contains all svcclasses' names of the entire castor2 instance.

  • Xrootd Table (timestamp,message_type,message_string,message_int,servername)

  • ConfigSchema Table (expiry , reqsmaxtime,dhmaxtime,dcmaxtime,trmaxtime,gcmaxtime,totalmaxtime,migsmaxtime)
    In this table we keep the expiry period (in days) for our database's data and the maximum timestamp for each
    one of the above tables , so we can ensure that our sliding 5 minutes timewindow will continue from exactly the
    same point, where its execution stopped. We use this technique to minimize the possibility of losing any of
    dlf's messages, due to job scheduler inaccuracies.

Materialized Views

  • Req_Del (timestamp , dif)
    In this view we store the timestamp and the time interval for files, which were requested and recalled
    from tape after deletion. We are interested only in files that were requested within the next 24 hours.
    DDL Code:
    CREATE MATERIALIZED VIEW REQ_DEL
    REFRESH FORCE ON COMMIT
    AS select a.timestamp, round((a.timestamp - b.timestamp)*24,5) dif
    from requests a , gcfiles b
    where a.nsfileid = b.nsfileid
    and a.state = 'TapeRecall'
    and a.timestamp > b.timestamp
    and (a.timestamp - b.timestamp) <= 1

Other Procedures

  • TotalLatProc
    This procedure runs every 5 minutes and locates the "Job started" messages in dlf, extracting
    the total latency from the "totalwaittime" summary message. The "totallatency" field is updated in
    both migration and requests tables.

  • NewPartitions
    This procedure run with a daily period, creating a new partition for every partitioned table.
    Every day we create a new partition for the next, and not the current, day.

  • CleanOldData
    This procedure runs once per day, deleting any data in our database, which have a lifetime longer
    than the expiry period. Actually, we delete the old data by simply dropping the expired partitions.
Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng monDBSchema.png r1 manage 205.7 K 2009-04-30 - 17:19 TheodoreRekatsinas CASTOR monitoring DB schema
JPEGjpg monSchema.jpg r1 manage 58.0 K 2008-11-11 - 11:19 TheodoreRekatsinas Monitoring Database Schema
Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2009-04-30 - TheodoreRekatsinas
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DataManagement All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback