Quarterly plan for May-July 2011
Job monitoring area
Historical view
ATLAS
- Add new sorting attribute , namely new definition of the generic activity (end of June)
- Add # of users metric (end of June)
- Changes on the UI following the suggestions of Torre , (end of June)
This development was posponed. The reason is change of priorities, due to a serious performance degradation
of ATLAS Job monitoring Dashboard. ATLAS Job monitoring Dashboard was tuned and validated for many months on the
integration DB server. After that the application was moved to a poduction DB server on the 24th of May. After migration the performance of the
panda collector dropped 3-4 times compared to the integration DB. The developers and the DBAs put a lot of effort trying to debug and understand
the reason. For the moment the main explanation of DBAs is that production is a shared service and that is why the application is sharing the cache
with other applications. It was also noticed that for some reason on the production server the same application does much more reading than it was doing on integration,but the reason of this behaviour is not clear. Since investigation took too long, the Dashboard team took a decision to rewrite the panda collector,
which was not foreseen in the plan of this quarter. This work is in progress.
CMS
- Adapt new version of the historical view redesigned for ATLAS to CMS (end of August) This include 1). changing smry tables adding new sorting attributes (data type, CMSSW version), 2).add # of users metric, 3).add site/application failure on the interactive-like plot. 4).Redoing UI
In progress
Data collectors
CMS
- New version of the CMS job monitoring collectors. First prototype (end of July, production in September)
First prototype that works at the task granularity level has been developed. Tests on the integration database are being performed and few issues have been already discovered and currently being solved. Performance enhancement is being shown thanks to the usage of pure bulk sql queries at the level of the PL/SQL code. Most part of this development is going to be used also at the level of job granularity, with some differences related at the different amount of information between jobs and tasks database tables. Developement also showed that given the current use cases something at the database schema level could be changed in order to simplify the development and speed up the queries (as denormalizing some information).
Consistency checks
- Consistency checks between CMS Dashboard and WMAgent (end of June)
The tool for regular consistency checks with publishing of the results on the web is deployed (when?):
http://dashb-cms-stats.cern.ch/wmagent/WMAgentDashboardDaily.html
For the moment see a good consistency. This was achived by constant watching the consistency and implementing changes on the Dashboard and WMAgent side. Some changes on the WMAgent side are still work in progress,
UI performance
- Move job table into archive. Leave in the job table only data for last 30 days. Benchmark whether this gives performance improvement. (end of June)
This change will be first implemented for ATLAS, since the main performance tuning is currently tried with ATLAS instance.
Task monitoring
ATLAS
- Validate production monitoirng by ATLAS production team (done end of June)
- Imlement a long list of feature requests coming from the firts round of validation (partially done end of July, more requests came, in progress) (Savannah tickets with detailed description of the feature requests to be added )
Firts round of changes were deployed before tha ATLAS SW week. More requests had come and are being implemented.
- Introduce task concept and properly implement it in the the ATLAS analysis task monitoring (end of July) This includes 1). modifications in the ATLAS job monitoring triggers 2).changes on the DAO and UI levels
In progress. The collector and triggers and DAOs are modified. Developed a task API for the bookkeping of the analysis users (additional task, not foreseen in the querterly plan). The UI changes are a bit delayed due to the vacation time.
- Start work on the functionality which would allow to resubmit and kill jobs from the task monitoring UI. First step autentication and authorisation (end of August).
hBrowse framework
- Development of new functionalities to meet the requirements of Atlas Task Monitoring and UI for analysis users applications (DONE - Mid of July - all changes have been released)
- New charting methods (highcharts and others) /general need - DONE/
- Complex table headers /atlas request - DONE/
- Added possibility to add small tables along with charts /for Datasets Distribution - DONE/
- Added filters summary (after hovering a mouse over filters frame label the selected filters summary is displayed) /atlas reqiest - DONE/
- Added a possibility to select number of records user wants to see on one page /atlas requset - DONE/
- etc.
Datasets distribution
- Development of new datasets distribution application (DONE - Mid of June, not deployed yet)
- Application will work similarly to task monitorng and UI for analysis users
- Separate client and server parts
- Client will use hBrowse framework
- Application will supply 4 main veiws:
- EVNT Replications
- Validation Replication
- Pacballs
- Conditions DB
- Application is now in testing and waiting for BETA deployment
Site monitoring
- CMS topology and downtime collectors (end June). Done Mid-July.
- Color definition (end June). Done End-July
- New plotting library (end July). Postponed until the end of August.
- Database repartition (end July). Done in the testing instance End-June. Deployed for ATLAS,LHCb, ALICE end of July.To be deployed for CMS mid of August.
- Incorporate alarm system (end July). Done End-June. Now we get alarms when the collectors are not running properly and when the data registered in the SSB is old
SUM
- Validation of the new version of SUM which uses APIs from the new SAM components (mid of July)
- This item depends on the experiments providing ATP feeder
Transfer monitoring
ATLAS DDM Dashboard
- Migration of all servers to new hardware and SLC5. (mid-June) COMPLETED 10-Jun-2011
- 1.0 Bug fix releases. DEPLOYED 27-Jun-2011
- 2.0 M1 release. (mid-May) DEPLOYED 13-May-2011
- Plots of transfer statistics by source, destination and activity.
- Filtering by activity.
- Filtering of source and destination by tier / cloud / site / token.
- Grouping of source and destination by tier / cloud / site / token.
- 2.0 M2 release. (end-July) DEPLOYED 18-Jul-2011
- QA and tuning.
- Transpose: sources <--> destinations.
- Customizable historical plots.
Global Transfer Monitoring System
- Agree with experiments (ATLAS and CMS) on the set of attributes to be reported from the FTS instances (mid of June) AGREED 16-Jun-2011
- Development of the data transfer consumer (end of August) ONGOING
Google Earth
- New collectors for ATLAS job monitoring info.(middle of June)
Deployed end of June
Handling of the Dashboard cluster
- Migration of some of the Dashboard servers which still run on SLC4 hosts to SLC5 (middle of June)
COMPLETED ON THE 13th JUNE. As of the 26/7, only 3 (out of 45) machines are running SLC4. These last hosts are still there until the experiments start using aliases (instead of hardcoding the name of the machine). The last three machines should be done before the end of August.
Dashboard documentation
- Updating documentation , namely dao part for Oracle, new visualization - client-side jQuery-based UIs
Done. End of June
Tasks performed but not foreseen in the quarterly plan
As was stated in the beginning of job mopnitoring related part, we experienced serious problems with the ATLAS Job monitoring after migration to the production DB instance. In order to solve the problem many actions were taken which were not foreseen in the initial plan and delayed some of the job monitoring debvelopments:
- Retarting the collector with approproate monitoring on the integration DB instance
- Monitoring of the indexes and removing the ones which were not used
- Rewriting a lot of sql statements used in the stored procedures
- A lot of modifications implemented in the current Panda collector version
- Finally the decision about complete collector redesign was taken. The first part of redesign (handling tasks) was accomplished by the end of July.Second part should be ready by the middle of August.
Participating in the conferences/meetings/workshops
- Active participation in the ATLAS sw week. Many Dashboard-related and Tier3 monitoring presentations
- Participation in the WLCG workshop. Site monitorig talk.
--
JuliaAndreeva - 01-Jun-2011