HBaseForPanda
General information
- Hadoop/Hbase namenode -- lxfssm4401
- Our development node -- voatlas132. Accessible from lxvoadm, has outbound connectivity.
voatlas132
- The staging area is /data (402 GB).
- voatlas132 has old java (1.4), too outdated to compile and run HBase/Hadoop clients. java-1.7.0 was unpacked into /usr/share/java-1.7.0. For compilation/running only this java can be used:
export PATH=/usr/share/java-1.7.0/bin:$PATH
- Current working area is a directory under Karpenko's home. Relocation to another place or (better) establishing a code repository should be considered.
- To compile and run an HBase client relevant jars and configuration files from the namenode must be present on the development node. All the Hadoop and HBase jars from lxfssm4401 were copied to the working area (~/work/hbase_client_test/hadooop_hbase_jars directory). Full configuration is not needed; to connect to the remote HBase server the client requires hbase-site.xml with zookeper quorum address in it and log4j properties file. These files were placed under ~/work/hbase_client_test/min_configuration directory. Logging on the client site is currently completely disabled (all levels set to OFF).
- Consequently, the compilation classpath is (relative to the working area) hadoop_hbase_jars/*:~/work/hbase_client_test/min_configuration/ and the running classpath is hadoop_hbase_jars/*:min_configuration/:.
export PATH=/usr/share/java-1.7.0/bin:$PATH
javac -cp hadoop_hbase_jars/*:min_configuration/ ClientExample.java
java -cp hadoop_hbase_jars/*:min_configuration/:. ClientExample
Things we need to decide
- Database schema:
- Primary index? PANDAID?
- Secondary indices (if any)?
- How are we getting the data to voatlas132? How does it look like when it's already on voatlas132?
- How do we store the job information? Just a long formatted string ("1341532850,244000303,"TRIUMF-productionPilots",...and so on...")? The same, but JSON-formatted ("PANDAID": "111111111", "TASKID": "222222", ...and so on...)? Splitting the data for each job among several columns?
- How to name the table(s) for storing jobs information? 'panda_jobs_info' ?
- How to name the column families and columns (if we need many families and columns)?
Staging data from S3 to voatlas132
s3manager.py can download a csv file with jobs for a whole day. The day can be given as YYYYMMDD command line argument, otherwise the default is 4 days back from today. A simple cron job can secure the fetching of 4-days old data every day.
16 21 * * * root python /data/staging_test/s3downloaded/s3manager.py >> /data/staging_test/s3logs/s3fetching.log 2>&1
Thrift
Thrift tutorial
.
thrift-0.8.0 has been installed on voatlas132. It was configured and built to support python only. It's needed to be rebuilt if we need to support more languages. Hbase.thrift was taken from Cloudera 4 tarball (CDH 4 distro is installed on the namenode currently) and compiled.
Some performance numbers
Blunt insertion of information about 984710 jobs (916 MB, reading a csv file line-by-line and immediately putting each read line into HBase, inserting strings as they are, no additional processing) takes around 30 minutes with the client buffer disabled and 1-2 minutes with the client buffer enabled.
Blunt insertion with Thrift takes also around 30 minutes with one-by-one puts. With batching mutations (10,000 at a time) it takes around 2 minutes. I.e. times are absolutely comparable with pure Java client, at least for this simple example. Increasing the batch list to 50k and even 100k mutations does not give any gain in execution time. Maybe the number of mutations in the batch can even be decreased without loss of the performance.
Inserting data into HDFS
Should we need to insert data into HDFS manually from the staging area on voatlas132:
cat SomeFile | ssh username@lxfssm4401 "hadoop fs -put - /Destination/Path/Inside/HDFS"
Major updates:
--
MaximPotekhin - 22 Aug 2012
Responsible: MaximPotekhin