Difference: ATLASAnalytics (1 vs. 59)

Revision 592018-06-11 - IlijaVukotic

Line: 1 to 1
 
META TOPICPARENT name="AtlasDistributedComputing"
<!--  
Line: 47 to 47
 

Data visualization

Changed:
<
<
>
>
 
  • UC ES visualizations - google doc with list of links and explanations of all the kibana dashboards and Jupyter notebooks at UC.

Data analysis

Revision 582017-10-23 - FedericaLegger

Line: 1 to 1
 
META TOPICPARENT name="AtlasDistributedComputing"
<!--  
Line: 29 to 29
 

Clusters and Resources

  • analytix cluster at CERN, access to hadoop: In order to get the access you have to be in ai-hadoop-users e-group which is managed by CERN-IT. To get access you need to open a Snow ticket to Hadoop Service, providing your lxplus user name.
Changed:
<
<
>
>
 

Data collection

Revision 572017-08-23 - FedericaLegger

Line: 1 to 1
 
META TOPICPARENT name="AtlasDistributedComputing"
<!--  
Line: 26 to 26
 
Added:
>
>

Clusters and Resources

  • analytix cluster at CERN, access to hadoop: In order to get the access you have to be in ai-hadoop-users e-group which is managed by CERN-IT. To get access you need to open a Snow ticket to Hadoop Service, providing your lxplus user name.

 

Data collection

  • Sources

Revision 562017-06-30 - FedericaLegger

Line: 1 to 1
 
META TOPICPARENT name="AtlasDistributedComputing"
<!--  
Line: 19 to 19
 
  • Provide an open platform with documented collections and tools to broaden participation in ADC Analytics.

Organisation

Changed:
<
<
>
>
  • Meeting: ADC Analytics Weekly (Indico), usually Wednesdays 16:00-18:00 CEST (40-R-402) - on a biweekly basis: one week is dedicated to Monitoring, the following week to Analytics
 

Revision 552016-12-05 - IlijaVukotic

Line: 1 to 1
 
META TOPICPARENT name="AtlasDistributedComputing"
<!--  
Line: 41 to 41
 

Data visualization

Changed:
<
<
>
>
 

Revision 542016-09-15 - IlijaVukotic

Line: 1 to 1
 
META TOPICPARENT name="AtlasDistributedComputing"
<!--  
Line: 44 to 44
 
Added:
>
>
  • UC ES visualizations - google doc with list of links and explanations of all the kibana dashboards and Jupyter notebooks at UC.
 

Data analysis

  • Go here to see all notebooks of the ATLASMINER JIRA tickets

Revision 532016-08-26 - MarioLassnig

Line: 1 to 1
 
META TOPICPARENT name="AtlasDistributedComputing"
<!--  
Line: 42 to 42
 

Data visualization

Changed:
<
<
>
>
 

Data analysis

Revision 522016-06-10 - MarioLassnig

Line: 1 to 1
Changed:
<
<
META TOPICPARENT name="AdcMonitoring"
>
>
META TOPICPARENT name="AtlasDistributedComputing"
 
<!--  
-->

Revision 512016-06-10 - JoaquinIgnacioBogadoGarcia

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"
<!--  
Line: 70 to 70
 
Added:
>
>
 

Related efforts

  • WLCG Machine Learning Demonstrator Indico

Revision 502016-06-10 - MarioLassnig

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"
<!--  
Line: 46 to 46
 

Data analysis

Changed:
<
<
  • Go here to see the notebooks of the ATLASMINER JIRA tickets
>
>
  • Go here to see all notebooks of the ATLASMINER JIRA tickets
 
Added:
>
>

Completed notebooks

  • For a given RSE, show the size vs. age distribution Link
 

Dedicated projects

Revision 492016-06-09 - MarioLassnig

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"
<!--  
Line: 23 to 23
 
Changed:
<
<
>
>
 

Data collection

Line: 46 to 46
 

Data analysis

Added:
>
>
 
Changed:
<
<
  • GPU-enabled Jupyter server https://shodan.cern.ch:7643/notebooks/[https://shodan.cern.ch:7643/notebooks/]] (playground)
  • SWAN Jupyter [[https://swan002.cern.ch/[https://swan002.cern.ch/]] (in beta, will become next master)
>
>
 

Dedicated projects

Revision 482016-06-08 - MarioLassnig

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"
<!--  
Line: 45 to 45
 
Changed:
<
<

Custom data analysis

>
>

Data analysis

 

Dedicated projects

Revision 472016-06-08 - MarioLassnig

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"
<!--  
Line: 41 to 41
 

Data visualization

Changed:
<
<
>
>
 

Revision 462016-06-08 - MarioLassnig

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"
<!--  
Line: 9 to 9
 
Changed:
<
<

Introduction

>
>

Overview

  The ATLAS Analytics effort is focused on creating systems which provide ATLAS Distributed Computing (ADC) with new capabilities for understanding distributed systems and overall operational performance. These capabilities include:
Changed:
<
<
  • Correlate information from multiple systems (PanDA, Rucio, FTS, Dashboards, Tier0, PilotFactory)
  • Predictive Analytics: Execute arbitrary data mining or machine learning algorithms over aggregated data
>
>
  • Correlate information from multiple systems (PanDA, Rucio, FTS, Dashboards, Tier0, PilotFactory, ...)
  • Predictive Analytics: Execute arbitrary data mining or machine learning algorithms over raw and aggregated data
 
  • Ability to host new third party analytics services on a scalable compute platform
Changed:
<
<
  • Must be flexible enough to satisfy variety of use cases for different user roles
    • e.g. production and site managers may be interested in analysis of a pattern of job failures while resource managers may be interested in computing model scenarios.
>
>
  • Satisfy variety of use cases for different user roles for ad-hoc analytics
 
  • Provide an open platform with documented collections and tools to broaden participation in ADC Analytics.
Changed:
<
<

Architecture

>
>

Organisation

 
Changed:
<
<
At right is a high level layout of a platform intended to provide guidance for several functions:
>
>

Data collection

 
Changed:
<
<
  • Acquisition, filtering and upload of data sources into a repository
  • Hadoop cluster for analysis of multiple data sources to create reduced collections for higher level analytics
  • Serve repository collections in multiple formats to external clients
  • Makes collected sources available for export by external users
  • Host analytics services on the platform such as ElasticSearch, Logstash, Kibana, etc.

Whitepaper and Survey

Organizational details

Meetings

civais-v2.png

Hadoop Clusters

  • Two clusters are provided by CERN IT for general purpose use: lxhadoop and analytix.
  • CERN IT recommends using analytix only (!)
  • Hadoop accounts are requested from Rainer dot Toebbicke at cern dot ch
  • There is a dedicated /atlas/analytics home directory.
  • All ATLAS accounts have access via the appropriate zp group.
  • When at CERN (or proxying to CERN) you may monitor you Hadoop jobs here: http://p01001533040197.cern.ch:8088/cluster

Analytics clusters

CloudLab ElasticSearch cluster

  • Only ElasticSearch.
  • Kibana access is fully open at : http://atlas-kibana.mwt2.org:5601
  • API access will be read only from everywhere and read/write from CERN and University of Chicago.
  • The only host to be used for indexing is this one: http://cl-analytics.mwt2.org:9200
  • HQ, Marvel accessible from CERN and University of Chicago. If you need access ask Ilija Vukotic for it.
  • Backup of the indexed data is described here.

ElasticSearch (CERN IT)

ElasticSearch (CERN ATLAS)

  • ATLAS Analytics cluster service card.
  • these machines are used to:
    • sqoop data from Oracle into hadoop. Hadoop is set up in exactly the same way as on lxhadoop.
    • flume data directly or via AMQ
    • can be used to run map/reduce codes (pig, java,...), but for longer jobs lxhadoop is the preferred.
    • send data to ElasticSearch at Clemson
    • run CERN ElasticSearch cluster. Please keep in mind that the cluster is rather slow due to storage not being local and machines having not enough memory. Access its Kibana through SSO at https://aianalytics01.cern.ch/

Description of data sources and import to the platform

>
>
 
Changed:
<
<
>
>
 
Changed:
<
<
>
>

Data visualization

 
Changed:
<
<

Analyses

>
>

Custom data analysis

 
Added:
>
>
 
Changed:
<
<

Sub-projects

>
>

Dedicated projects

 
Deleted:
<
<
  • Rucio dump mining
 
Added:
>
>
 
Changed:
<
<
>
>
 
Deleted:
<
<

Visualization / Kibana

 

Tutorials and How-To's

Line: 114 to 65
 
Changed:
<
<

Hadoop Documentation

Hadoop components and commands

  • Hadoop framework designed to process large data sets (main components are hdfs and mapreduce)
  • hdfs (Hadoop Distributed File System)
  • mapreduce software framework to write application to process data on hdfs
  • Hadoop shell commands

Tools to transfer data in and out of Hadoop

  • HBase is a key/value store optimized for random read/write and scalability
  • Sqoop for transfering bulk of data between Hadoop and other datastores such as relational DBs
  • Flume is a service to collect and transfer large amount of data in hadoop

Languages

  • Spark is a fast and general engine for large-scale data processing which can run on top of the Hadoop Distributed File System (amongst other).
  • Pig is a higher level language that produces mapreduce programs to be used on an hadoop cluster.
  • R, what is R? higher level language designed explicitely for statistical analysis and visualization.
  • python for data analysis python has also libraries for statistical analysis although not yet as extensive as the R ones.

Other data analytics tools

  • ELK is an analytics framework composed by a search engine (ElasticSearch), a log collector (Logstash) and a visualization engine (Kibana).

Plots for Scrutiny reports

The plots for the scrutiny reports show the usage of all Atlas data on the grid per volume. To create those plots a couple of bash scripts and Pig jobs are used that take various inputs from Hadoop like the Rucio traces and database dumps to compute the volume of accessed data over different time intervals. The code can be found on Github.

The jobs run every Monday morning. It is limited to official Atlas data (data* and mc*) on DATADISK.

Inputs:

  • Rucio traces (/user/rucio01/traces/)
  • Database dumps (/user/rucio01/dumps/"date"/)

Outputs:

  • Five csv files, Four of them for the accesses in the last 3, 6, 9, 12 months and one csv file for all time. Each csv file as 2 columns. The first is the number of accesses and the second is the volume of data. There is a special case for the first and second row. The first row is always '-1' and counts the volume of data that is older than the number of months and the second is always '0' and it counts everything that is younger then the number of months.
  • List of datasets with zero accesses
>
>

Related efforts

  • WLCG Machine Learning Demonstrator Indico
  • IT Analytics Working Group Indico, TWiki
  • WLCG Analytics Platform TWiki
 
Changed:
<
<
Output location:
>
>

Attic

 
Changed:
<
<

Related efforts

>
>
Historical information goes into the ATLASAnalyticsAttic.
 

Line: 167 to 80
 
<!-- For significant updates to the topic, consider adding your 'signature' (beneath this editing box) -->
Major updates:
-- IlijaVukotic - 2014-11-11
Changed:
<
<
-- MarioLassnig - 2016-04-11
>
>
-- MarioLassnig - 2016-06-08
 
<!-- Person responsible for the page: 
Either leave as is - the creator's name will be inserted; 
Or replace the complete REVINFO tag (including percentages symbols) with a name in the form TwikiUsersName -->
Changed:
<
<
Responsible: %REVINFO{"$wikiusername" rev="1.1"}%
>
>
Responsible: IlijaVukotic, MarioLassnig
 
<!-- Once this page has been reviewed, please add the name and the date e.g. StephenHaywood - 31 Oct 2006 -->
Last reviewed by: Never reviewed
Deleted:
<
<
META FILEATTACHMENT attachment="civais-v2.png" attr="" comment="" date="1418142732" name="civais-v2.png" path="civais-v2.png" size="105208" user="rgardner" version="1"
META FILEATTACHMENT attachment="civais-panda.png" attr="" comment="" date="1418147536" name="civais-panda.png" path="civais-panda.png" size="78831" user="rgardner" version="1"

Revision 452016-06-03 - ThomasBeermann

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"
<!--  
Line: 138 to 138
 

Plots for Scrutiny reports

Changed:
<
<
The scrutiny reports need a plot that shows the usage of all Atlas data on the grid. For that a couple of scripts and Pig jobs have been written that take various inputs from the traces and dumps on Hadoop to compute the volume of accessed data over different time intervals. The code can be found on Github
>
>
The plots for the scrutiny reports show the usage of all Atlas data on the grid per volume. To create those plots a couple of bash scripts and Pig jobs are used that take various inputs from Hadoop like the Rucio traces and database dumps to compute the volume of accessed data over different time intervals. The code can be found on Github.
 
Changed:
<
<
The jobs run every Monday morning.
>
>
The jobs run every Monday morning. It is limited to official Atlas data (data* and mc*) on DATADISK.
  Inputs:
Changed:
<
<
  • traces
  • db dumps
>
>
  • Rucio traces (/user/rucio01/traces/)
  • Database dumps (/user/rucio01/dumps/"date"/)
  Outputs:
Changed:
<
<
  • Five csv files, Four of them for the accesses in the last 3, 6, 9, 12 months and one csv file for all time. Each csv file as 2 columns. The first is the number of accesses and the second is the volume of data. There is a special case for the first and second row. The first row is always '-1' and counts the volume of data that is older than the number of months and the second is always '0' and it counts everything that is younger then months.
>
>
  • Five csv files, Four of them for the accesses in the last 3, 6, 9, 12 months and one csv file for all time. Each csv file as 2 columns. The first is the number of accesses and the second is the volume of data. There is a special case for the first and second row. The first row is always '-1' and counts the volume of data that is older than the number of months and the second is always '0' and it counts everything that is younger then the number of months.
 
  • List of datasets with zero accesses
Changed:
<
<
Scrutiny plots Zero access datasets
>
>
Output location:
 

Related efforts

Revision 442016-06-03 - ThomasBeermann

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"
<!--  
Line: 136 to 136
 Other data analytics tools
  • ELK is an analytics framework composed by a search engine (ElasticSearch), a log collector (Logstash) and a visualization engine (Kibana).
Added:
>
>

Plots for Scrutiny reports

The scrutiny reports need a plot that shows the usage of all Atlas data on the grid. For that a couple of scripts and Pig jobs have been written that take various inputs from the traces and dumps on Hadoop to compute the volume of accessed data over different time intervals. The code can be found on Github

The jobs run every Monday morning.

Inputs:

  • traces
  • db dumps

Outputs:

  • Five csv files, Four of them for the accesses in the last 3, 6, 9, 12 months and one csv file for all time. Each csv file as 2 columns. The first is the number of accesses and the second is the volume of data. There is a special case for the first and second row. The first row is always '-1' and counts the volume of data that is older than the number of months and the second is always '0' and it counts everything that is younger then months.
  • List of datasets with zero accesses

Scrutiny plots Zero access datasets

 

Related efforts

Revision 432016-04-11 - MarioLassnig

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"
<!--  
Line: 11 to 11
 

Introduction

Changed:
<
<
The ATLAS Analytics effort is focused on creating systems which provide ATLAS ADC with new capabilities for understanding distributed systems and overall operational performance. These capabilities include:
  • Correlate information from multiple systems (Panda, FTS, dashboards)
  • Execute arbitrary data mining or machine learning algorithms over aggregated data
>
>
The ATLAS Analytics effort is focused on creating systems which provide ATLAS Distributed Computing (ADC) with new capabilities for understanding distributed systems and overall operational performance. These capabilities include:
  • Correlate information from multiple systems (PanDA, Rucio, FTS, Dashboards, Tier0, PilotFactory)
  • Predictive Analytics: Execute arbitrary data mining or machine learning algorithms over aggregated data
 
  • Ability to host new third party analytics services on a scalable compute platform
  • Must be flexible enough to satisfy variety of use cases for different user roles
Changed:
<
<
    • E.g. production and site managers may be interested in analysis of a pattern of job failures while resource managers may be interested in computing model scenarios.
  • Provide an open platform with documented collections and tools to broaden participation in ADC analytics.
>
>
    • e.g. production and site managers may be interested in analysis of a pattern of job failures while resource managers may be interested in computing model scenarios.
  • Provide an open platform with documented collections and tools to broaden participation in ADC Analytics.
 

Architecture

Line: 32 to 33
 
Added:
>
>
 

Organizational details

Changed:
<
<
>
>
 

Meetings

Line: 44 to 48
  civais-v2.png
Changed:
<
<

Hadoop Cluster

>
>

Hadoop Clusters

 
Changed:
<
<
  • One needs an account on lxhadoop.cern.ch. One logs into it through lxplus.
  • Request the account from Rainer dot Toebbicke at cern dot ch.
  • All the data should go into /atlas/analytics All ATLAS people have access.
>
>
  • Two clusters are provided by CERN IT for general purpose use: lxhadoop and analytix.
  • CERN IT recommends using analytix only (!)
  • Hadoop accounts are requested from Rainer dot Toebbicke at cern dot ch
  • There is a dedicated /atlas/analytics home directory.
  • All ATLAS accounts have access via the appropriate zp group.
 

Analytics clusters

Deleted:
<
<

CERN

  • ATLAS Analytics cluster service card.
  • these machines are used to:
    • sqoop data from Oracle into hadoop. Hadoop is set up in exactly the same way as on lxhadoop.
    • flume data directly or via AMQ
    • can be used to run map/reduce codes (pig, java,...), but for longer jobs lxhadoop is the preferred.
    • send data to ElasticSearch at Clemson
    • run CERN ElasticSearch cluster. Please keep in mind that the cluster is rather slow due to storage not being local and machines having not enough memory. Access its Kibana through SSO at https://aianalytics01.cern.ch/
 
Changed:
<
<

CloudLab ElasticSearch cluster

>
>

CloudLab ElasticSearch cluster

 
  • Only ElasticSearch.
  • Kibana access is fully open at : http://atlas-kibana.mwt2.org:5601
  • API access will be read only from everywhere and read/write from CERN and University of Chicago.
Line: 69 to 67
 
  • HQ, Marvel accessible from CERN and University of Chicago. If you need access ask Ilija Vukotic for it.
  • Backup of the indexed data is described here.
Added:
>
>

ElasticSearch (CERN IT)

ElasticSearch (CERN ATLAS)

  • ATLAS Analytics cluster service card.
  • these machines are used to:
    • sqoop data from Oracle into hadoop. Hadoop is set up in exactly the same way as on lxhadoop.
    • flume data directly or via AMQ
    • can be used to run map/reduce codes (pig, java,...), but for longer jobs lxhadoop is the preferred.
    • send data to ElasticSearch at Clemson
    • run CERN ElasticSearch cluster. Please keep in mind that the cluster is rather slow due to storage not being local and machines having not enough memory. Access its Kibana through SSO at https://aianalytics01.cern.ch/
 

Description of data sources and import to the platform

Line: 84 to 98
 

Sub-projects

Added:
>
>
 
Changed:
<
<

Visualization

>
>

Visualization / Kibana

 
Changed:
<
<

Tutorials and how-to's

>
>

Tutorials and How-To's

 
Changed:
<
<

Hadoop tools Docs

>
>

Hadoop Documentation

  Hadoop components and commands
  • Hadoop framework designed to process large data sets (main components are hdfs and mapreduce)
Line: 107 to 122
 
Changed:
<
<
Tools to transfer data in and out of hadoop
  • Hadoop database optimized for random read/write and scalability
  • sqoop for transfering bulk of data between hadoop and other datastores such as relational DB
  • flume service to collect and transfer large amount of data in hadoop
>
>
Tools to transfer data in and out of Hadoop
  • HBase is a key/value store optimized for random read/write and scalability
  • Sqoop for transfering bulk of data between Hadoop and other datastores such as relational DBs
  • Flume is a service to collect and transfer large amount of data in hadoop
  Languages
Added:
>
>
  • Spark is a fast and general engine for large-scale data processing which can run on top of the Hadoop Distributed File System (amongst other).
 
  • Pig is a higher level language that produces mapreduce programs to be used on an hadoop cluster.
  • R, what is R? higher level language designed explicitely for statistical analysis and visualization.
  • python for data analysis python has also libraries for statistical analysis although not yet as extensive as the R ones.

Other data analytics tools

Changed:
<
<
  • ELK an analytic framework composed by a search engine (Elasticsearch), a log collector (Logstash) and a visualization engine (Kibana).
>
>
  • ELK is an analytics framework composed by a search engine (ElasticSearch), a log collector (Logstash) and a visualization engine (Kibana).
 

Related efforts

Line: 132 to 148
 
<!-- For significant updates to the topic, consider adding your 'signature' (beneath this editing box) -->
Major updates:
Changed:
<
<
-- IlijaVukotic - 2014-11-11
>
>
-- IlijaVukotic - 2014-11-11
-- MarioLassnig - 2016-04-11
 
<!-- Person responsible for the page: 
Either leave as is - the creator's name will be inserted; 

Revision 412016-02-19 - FedericaLegger

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"
<!--  
Line: 88 to 88
 
Changed:
<
<
>
>
 

Visualization

Revision 402016-02-16 - IlijaVukotic

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"
<!--  
Line: 63 to 63
 

CloudLab ElasticSearch cluster

  • Only ElasticSearch.
Changed:
<
<
>
>
 
  • API access will be read only from everywhere and read/write from CERN and University of Chicago.
Added:
>
>
 
  • HQ, Marvel accessible from CERN and University of Chicago. If you need access ask Ilija Vukotic for it.
  • Backup of the indexed data is described here.
Line: 87 to 88
 
Changed:
<
<
>
>
 

Visualization

Changed:
<
<
>
>
 

Tutorials and how-to's

Revision 392016-02-11 - FedericaLegger

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"
<!--  
Line: 87 to 87
 
Changed:
<
<
>
>
 

Visualization

Revision 382015-12-16 - MariaGrigorieva

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"
<!--  
Line: 18 to 19
 
    • E.g. production and site managers may be interested in analysis of a pattern of job failures while resource managers may be interested in computing model scenarios.
  • Provide an open platform with documented collections and tools to broaden participation in ADC analytics.
Deleted:
<
<
 

Architecture

At right is a high level layout of a platform intended to provide guidance for several functions:

Line: 29 to 27
 
  • Hadoop cluster for analysis of multiple data sources to create reduced collections for higher level analytics
  • Serve repository collections in multiple formats to external clients
  • Makes collected sources available for export by external users
Changed:
<
<
  • Host analytics services on the platform such as ElasticSearch, Logstash, Kibana, etc.
>
>
  • Host analytics services on the platform such as ElasticSearch, Logstash, Kibana, etc.

 

Whitepaper and Survey

Changed:
<
<
>
>
 

Organizational details

Changed:
<
<
>
>
 

Meetings

Changed:
<
<




civais-v2.png

>
>
 
Added:
>
>
civais-v2.png
 

Hadoop Cluster

Line: 102 to 86
 
Added:
>
>
 

Visualization

Line: 134 to 118
 Other data analytics tools
  • ELK an analytic framework composed by a search engine (Elasticsearch), a log collector (Logstash) and a visualization engine (Kibana).
Deleted:
<
<
 

Related efforts

Changed:
<
<
>
>
 

Revision 372015-12-15 - MaksimGubin

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"
<!--  
Line: 91 to 91
 
Added:
>
>
 

Analyses

Revision 362015-12-01 - IlijaVukotic

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"
<!--  
Line: 82 to 82
 
  • Kibana access is fully open: http://cl-analytics.mwt2.org:5601
  • API access will be read only from everywhere and read/write from CERN and University of Chicago.
  • HQ, Marvel accessible from CERN and University of Chicago. If you need access ask Ilija Vukotic for it.
Added:
>
>
  • Backup of the indexed data is described here.
 

Description of data sources and import to the platform

Revision 352015-11-19 - FedericaLegger

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"
<!--  
Line: 99 to 99
 
Added:
>
>
 

Visualization

Revision 342015-11-18 - IlijaVukotic

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"
<!--  
Line: 107 to 107
 
Changed:
<
<
>
>
 

Hadoop tools Docs

Revision 332015-11-18 - IlijaVukotic

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"
<!--  
Line: 107 to 107
 
Changed:
<
<
>
>
 

Hadoop tools Docs

Revision 322015-11-18 - IlijaVukotic

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"
<!--  
Line: 107 to 107
 
Changed:
<
<
  • Recording of the Vidyo tutorial part 1.
>
>
 

Hadoop tools Docs

Revision 312015-11-16 - IlijaVukotic

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"
<!--  
Line: 98 to 98
 

Sub-projects

Changed:
<
<
>
>
 

Visualization

Revision 302015-11-11 - IlijaVukotic

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"
<!--  
Line: 101 to 101
 

Visualization

Changed:
<
<
>
>
 

Tutorials and how-to's

Revision 292015-11-11 - IlijaVukotic

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"
<!--  
Line: 97 to 97
 

Sub-projects

Added:
>
>
 

Visualization

Deleted:
<
<
 

Tutorials and how-to's

Added:
>
>
 

Revision 282015-11-10 - IlijaVukotic

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"
<!--  
Line: 105 to 105
 

Tutorials and how-to's

Added:
>
>
  • Recording of the Vidyo tutorial part 1.
 

Hadoop tools Docs

Revision 272015-11-04 - IlijaVukotic

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"
<!--  
Line: 87 to 87
 
Added:
>
>
 

Revision 262015-11-03 - IlijaVukotic

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"
<!--  

Revision 252015-11-02 - IlijaVukotic

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"
<!--  
Line: 59 to 59
 
Changed:
<
<

Hadoop Cluster access

>
>

Hadoop Cluster

 
Changed:
<
<
  • One needs an account on lxhadoop.cern.ch
>
>
  • One needs an account on lxhadoop.cern.ch. One logs into it through lxplus.
 
  • Request the account from Rainer dot Toebbicke at cern dot ch.
Changed:
<
<
  • /atlas area now exists. All ATLAS people have access.
  • Load data using hdfs-put.
>
>
 

Analytics clusters

CERN

Changed:
<
<
>
>
  • ATLAS Analytics cluster service card.
  • these machines are used to:
    • sqoop data from Oracle into hadoop. Hadoop is set up in exactly the same way as on lxhadoop.
    • flume data directly or via AMQ
    • can be used to run map/reduce codes (pig, java,...), but for longer jobs lxhadoop is the preferred.
    • send data to ElasticSearch at Clemson
    • run CERN ElasticSearch cluster. Please keep in mind that the cluster is rather slow due to storage not being local and machines having not enough memory. Access its Kibana through SSO at https://aianalytics01.cern.ch/
 

CloudLab ElasticSearch cluster

Changed:
<
<
  • Only ElasticSearch.
  • Kibana access is fully open: cl-analytics.mwt2.org:5601
>
>
 
  • API access will be read only from everywhere and read/write from CERN and University of Chicago.
  • HQ, Marvel accessible from CERN and University of Chicago. If you need access ask Ilija Vukotic for it.

Line: 81 to 89
 
Deleted:
<
<
 

Analyses

Line: 96 to 103
 

Tutorials and how-to's

Added:
>
>
 

Hadoop tools Docs

Revision 242015-11-02 - IlijaVukotic

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"
<!--  
Line: 59 to 59
 
Deleted:
<
<

ElasticSearch platform related

 

Hadoop Cluster access

  • One needs an account on lxhadoop.cern.ch
  • Request the account from Rainer dot Toebbicke at cern dot ch.
Changed:
<
<
  • /atlas area now exists. Send email to Ilija for access.
>
>
  • /atlas area now exists. All ATLAS people have access.
 
  • Load data using hdfs-put.
Added:
>
>

Analytics clusters

CERN

 
Added:
>
>

CloudLab ElasticSearch cluster

  • Only ElasticSearch.
  • Kibana access is fully open: cl-analytics.mwt2.org:5601
  • API access will be read only from everywhere and read/write from CERN and University of Chicago.
  • HQ, Marvel accessible from CERN and University of Chicago. If you need access ask Ilija Vukotic for it.
 

Description of data sources and import to the platform

Line: 82 to 87
 
Added:
>
>

Sub-projects

 

Visualization

Revision 232015-11-02 - JoaquinIgnacioBogadoGarcia

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"
<!--  
Line: 76 to 76
 
Added:
>
>
 

Analyses

Revision 222015-10-14 - RobertGardner

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"
<!--  
Line: 59 to 59
 
Added:
>
>

ElasticSearch platform related

 

Hadoop Cluster access

Revision 212015-06-04 - IlijaVukotic

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"
<!--  
Line: 66 to 66
 
  • Request the account from Rainer dot Toebbicke at cern dot ch.
  • /atlas area now exists. Send email to Ilija for access.
  • Load data using hdfs-put.
Added:
>
>
 

Description of data sources and import to the platform

Line: 80 to 81
 

Visualization

Added:
>
>
 

Tutorials and how-to's

Revision 202015-06-01 - IlijaVukotic

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"
<!--  
Line: 74 to 74
 
Changed:
<
<

Analysis codes

>
>

Analyses

 

Visualization

Revision 192015-02-12 - PeterLove

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"
<!--  
Line: 69 to 69
 

Description of data sources and import to the platform

Added:
>
>
 

Revision 182015-02-11 - RobertGardner

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"
<!--  
Line: 16 to 16
 
  • Ability to host new third party analytics services on a scalable compute platform
  • Must be flexible enough to satisfy variety of use cases for different user roles
    • E.g. production and site managers may be interested in analysis of a pattern of job failures while resource managers may be interested in computing model scenarios.
Added:
>
>
  • Provide an open platform with documented collections and tools to broaden participation in ADC analytics.
 
Added:
>
>

Architecture

 At right is a high level layout of a platform intended to provide guidance for several functions:
Added:
>
>
 
  • Acquisition, filtering and upload of data sources into a repository
  • Hadoop cluster for analysis of multiple data sources to create reduced collections for higher level analytics
  • Serve repository collections in multiple formats to external clients
Line: 25 to 30
 
  • Serve repository collections in multiple formats to external clients
  • Makes collected sources available for export by external users
  • Host analytics services on the platform such as ElasticSearch, Logstash, Kibana, etc.
Deleted:
<
<
civais-v2.png
 
Changed:
<
<

Organizational details

>
>

Whitepaper and Survey

Organizational details

 
Changed:
<
<

Meetings

>
>

Meetings

 
Added:
>
>



civais-v2.png

 

Hadoop Cluster access

  • One needs an account on lxhadoop.cern.ch
Line: 83 to 102
 Other data analytics tools
  • ELK an analytic framework composed by a search engine (Elasticsearch), a log collector (Logstash) and a visualization engine (Kibana).
Deleted:
<
<

Whitepaper and Survey

 

Related efforts

Revision 172015-02-09 - IlijaVukotic

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"
<!--  
Line: 34 to 34
 

Organizational details

Changed:
<
<
>
>
 

Meetings

Revision 162015-02-05 - IlijaVukotic

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"
<!--  
Line: 34 to 34
 

Organizational details

Changed:
<
<
>
>
 

Meetings

Revision 152015-02-04 - RobertGardner

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"
<!--  
Line: 56 to 56
 

Analysis codes

Added:
>
>

Visualization

 

Tutorials and how-to's

Revision 142015-01-23 - RobertGardner

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"
<!--  
Line: 37 to 37
 

Meetings

Added:
>
>
 

Revision 132015-01-21 - RobertGardner

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"
Added:
>
>
<!--  
-->
 

ATLAS Analytics

<!-- this line is optional -->

Revision 122014-12-19 - AlessandraForti

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"

ATLAS Analytics

<!-- this line is optional -->
Line: 28 to 28
 

Organizational details

Changed:
<
<
  • egroup: atlas-adc-data-analytics
>
>
 
Line: 56 to 56
 

Hadoop tools Docs

Changed:
<
<
Tools to interact with hadoop
>
>
Hadoop components and commands
  • Hadoop framework designed to process large data sets (main components are hdfs and mapreduce)
 
  • hdfs (Hadoop Distributed File System)
Changed:
<
<
>
>

Tools to transfer data in and out of hadoop

 
  • sqoop for transfering bulk of data between hadoop and other datastores such as relational DB
  • flume service to collect and transfer large amount of data in hadoop
Line: 69 to 73
 
  • python for data analysis python has also libraries for statistical analysis although not yet as extensive as the R ones.

Other data analytics tools

Changed:
<
<
  • ELK (Elasticsearch,Logstash,Kibana)
>
>
  • ELK an analytic framework composed by a search engine (Elasticsearch), a log collector (Logstash) and a visualization engine (Kibana).
 

Whitepaper and Survey

Line: 82 to 86
 
Deleted:
<
<
 

Revision 112014-12-19 - AlessandraForti

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"

ATLAS Analytics

<!-- this line is optional -->
Line: 40 to 41
 
  • One needs an account on lxhadoop.cern.ch
  • Request the account from Rainer dot Toebbicke at cern dot ch.
  • /atlas area now exists. Send email to Ilija for access.
Changed:
<
<
  • Load data using hdfs-put other commands to interact with hadoop canbe found in the hadoop documentation.

>
>
  • Load data using hdfs-put.
 

Description of data sources and import to the platform

Line: 54 to 54
 

Tutorials and how-to's

Added:
>
>

Hadoop tools Docs

Tools to interact with hadoop

Languages

  • Pig is a higher level language that produces mapreduce programs to be used on an hadoop cluster.
  • R, what is R? higher level language designed explicitely for statistical analysis and visualization.
  • python for data analysis python has also libraries for statistical analysis although not yet as extensive as the R ones.

Other data analytics tools

  • ELK (Elasticsearch,Logstash,Kibana)
 

Whitepaper and Survey

Line: 63 to 79
 

Related efforts

Changed:
<
<
>
>
 

Revision 102014-12-18 - AlessandraForti

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"

ATLAS Analytics

<!-- this line is optional -->
Line: 40 to 40
 
  • One needs an account on lxhadoop.cern.ch
  • Request the account from Rainer dot Toebbicke at cern dot ch.
  • /atlas area now exists. Send email to Ilija for access.
Changed:
<
<
  • Load data using hdfs-put.
>
>
  • Load data using hdfs-put other commands to interact with hadoop canbe found in the hadoop documentation.
 

Description of data sources and import to the platform

Revision 92014-12-17 - PeterLove

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"

ATLAS Analytics

<!-- this line is optional -->
Line: 47 to 47
 
Added:
>
>
 

Analysis codes

Revision 82014-12-17 - RobertGardner

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"

ATLAS Analytics

<!-- this line is optional -->
Line: 37 to 37
 

Hadoop Cluster access

Added:
>
>
  • One needs an account on lxhadoop.cern.ch
  • Request the account from Rainer dot Toebbicke at cern dot ch.
  • /atlas area now exists. Send email to Ilija for access.
  • Load data using hdfs-put.

 

Description of data sources and import to the platform

Revision 72014-12-09 - RobertGardner

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"

ATLAS Analytics

<!-- this line is optional -->
Line: 6 to 6
 

Introduction

Added:
>
>
The ATLAS Analytics effort is focused on creating systems which provide ATLAS ADC with new capabilities for understanding distributed systems and overall operational performance. These capabilities include:
  • Correlate information from multiple systems (Panda, FTS, dashboards)
  • Execute arbitrary data mining or machine learning algorithms over aggregated data
  • Ability to host new third party analytics services on a scalable compute platform
  • Must be flexible enough to satisfy variety of use cases for different user roles
    • E.g. production and site managers may be interested in analysis of a pattern of job failures while resource managers may be interested in computing model scenarios.

At right is a high level layout of a platform intended to provide guidance for several functions:
  • Acquisition, filtering and upload of data sources into a repository
  • Hadoop cluster for analysis of multiple data sources to create reduced collections for higher level analytics
  • Serve repository collections in multiple formats to external clients
  • Makes collected sources available for export by external users
  • Host analytics services on the platform such as ElasticSearch, Logstash, Kibana, etc.
civais-v2.png
 

Organizational details

Meetings

Changed:
<
<
>
>
 

Hadoop Cluster access

Changed:
<
<

Data description

>
>

Description of data sources and import to the platform

 

Analysis codes

Changed:
<
<

Tutorial

>
>

Tutorials and how-to's

Whitepaper and Survey

 

Related efforts

Changed:
<
<
>
>
 


Line: 45 to 73
 Last reviewed by: Never reviewed

\ No newline at end of file

Added:
>
>
META FILEATTACHMENT attachment="civais-v2.png" attr="" comment="" date="1418142732" name="civais-v2.png" path="civais-v2.png" size="105208" user="rgardner" version="1"
META FILEATTACHMENT attachment="civais-panda.png" attr="" comment="" date="1418147536" name="civais-panda.png" path="civais-panda.png" size="78831" user="rgardner" version="1"

Revision 62014-11-26 - IlijaVukotic

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"

ATLAS Analytics

<!-- this line is optional -->
Line: 23 to 23
 

Analysis codes

Added:
>
>

Tutorial

 

Related efforts

Revision 52014-11-25 - IlijaVukotic

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"

ATLAS Analytics

<!-- this line is optional -->
Line: 12 to 12
 

Meetings

Changed:
<
<
>
>
 

Hadoop Cluster access

Revision 42014-11-19 - IlijaVukotic

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"

ATLAS Analytics

<!-- this line is optional -->
Line: 17 to 17
 

Hadoop Cluster access

Data description

Added:
>
>
 

Analysis codes

Revision 32014-11-18 - RobertGardner

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"

ATLAS Analytics

<!-- this line is optional -->
Changed:
<
<

Introduction

>
>

Introduction

 
Changed:
<
<

Organizational details

>
>

Organizational details

 
Line: 14 to 14
 

Meetings

Changed:
<
<

Hadoop Cluster access

>
>

Hadoop Cluster access

 
Changed:
<
<

Data description

>
>

Data description

 
Changed:
<
<

Analysis codes

>
>

Analysis codes

 
Changed:
<
<

Related efforts

>
>

Related efforts

 
Added:
>
>


 
<!-- For significant updates to the topic, consider adding your 'signature' (beneath this editing box) -->
Major updates:

Revision 22014-11-12 - IlijaVukotic

Line: 1 to 1
 
META TOPICPARENT name="AdcMonitoring"

ATLAS Analytics

<!-- this line is optional -->
Line: 21 to 21
 

Analysis codes

Related efforts

Changed:
<
<
>
>
 
<!-- For significant updates to the topic, consider adding your 'signature' (beneath this editing box) -->
Major updates:

Revision 12014-11-11 - IlijaVukotic

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="AdcMonitoring"

ATLAS Analytics

<!-- this line is optional -->

Introduction

Organizational details

Meetings

Hadoop Cluster access

Data description

Analysis codes

Related efforts


<!-- For significant updates to the topic, consider adding your 'signature' (beneath this editing box) -->
Major updates:
-- IlijaVukotic - 2014-11-11

<!-- Person responsible for the page: 
Either leave as is - the creator's name will be inserted; 
Or replace the complete REVINFO tag (including percentages symbols) with a name in the form TwikiUsersName -->
Responsible: IlijaVukotic
<!-- Once this page has been reviewed, please add the name and the date e.g. StephenHaywood - 31 Oct 2006 -->
Last reviewed by: Never reviewed
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback