CDH3のインストール

至る所に書かれてますがメモっとく。環境はMac10.6.7上のVirtualBox4.0.8上のCentOS5.6にCDH3を擬似分散モードでインストールします。ディスク容量はデフォルトの8Gより大きくした方がよさげ。なぜなら後で気軽に増やせないからw

増やすには
可搬性疑似仮想アプライアンスサーバーシステム構想 « Midnightjapan
にあるようにLVMをうごうごしないといけません。

ちなみにディスク容量不足の状態でHDFSにデータ突っ込むと

could only be replicated to 0 nodes, instead of 1

というエラーがでますw

ともあれ、インストール方法いきます。

本家の記事はこちら
https://ccp.cloudera.com/display/CDHDOC/CDH3+Installation#CDH3Installation-InstallingCDH3onRedHatSystems

  • JDKのインストール

Java SE - Downloads | Oracle Technology Network | Oracle
からダウンロードしてrootでインストール

[root@localhost ~]# sh jdk-6u25-linux-i586-rpm.bin
[root@localhost ~]# curl http://archive.cloudera.com/redhat/cdh/cloudera-cdh3.repo > /etc/yum.repos.d/cloudera-cdh3.repo
  • CDH3のインストール

インストールするとhdfsユーザとmapredユーザが追加されます。hadoopユーザがいるとリネームするようです。

[root@localhost ~]# yum install hadoop-0.20-conf-pseudo hadoop-0.20-native

圧縮ファイルを使うためにはhadoop-0.20-nativeが必要です。

これ入れとかないと、例えばHiveでSequenceFileかつ圧縮を使おうとすると下記のように怒られます。

java.lang.IllegalArgumentException: SequenceFile doesn't work with GzipCodec without native-hadoop code! 

参考:
Hiveとか - ‡A Case Of Identity‡

  • インストール確認
[root@localhost ~]# yum list installed | grep hadoop
hadoop-0.20.noarch                       0.20.2+923.21-1               installed
hadoop-0.20-conf-pseudo.noarch           0.20.2+923.21-1               installed
hadoop-0.20-datanode.noarch              0.20.2+923.21-1               installed
hadoop-0.20-jobtracker.noarch            0.20.2+923.21-1               installed
hadoop-0.20-namenode.noarch              0.20.2+923.21-1               installed
hadoop-0.20-native.i386                  0.20.2+923.21-1               installed
hadoop-0.20-secondarynamenode.noarch     0.20.2+923.21-1               installed
hadoop-0.20-tasktracker.noarch           0.20.2+923.21-1               installed
  • サービス起動
[root@localhost ~]# for service in /etc/init.d/hadoop-0.20-*; do sudo $service start; done
  • サービス起動確認
[root@localhost ~]# /usr/java/latest/bin/jps
28374 NameNode
28628 Jps
28563 TaskTracker
28444 SecondaryNameNode
28248 DataNode
28312 JobTracker
[wyukawa@localhost ~]$ cat .bashrc
# .bashrc

# Source global definitions
if [ -f /etc/bashrc ]; then
        . /etc/bashrc
fi

# User specific aliases and functions
export JAVA_HOME=/usr/java/latest
export HADOOP_HOME=/usr/lib/hadoop
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin
[wyukawa@localhost ~]$ source .bashrc
  • HDFSの動作確認
[wyukawa@localhost ~]$ hadoop dfs -ls /
Found 1 items
drwxr-xr-x   - mapred supergroup          0 2011-06-14 22:59 /var
[wyukawa@localhost ~]$ hadoop jar /usr/lib/hadoop/hadoop-examples.jar pi 4 2000
Number of Maps  = 4
Samples per Map = 2000
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Starting Job
11/06/14 23:03:49 INFO mapred.FileInputFormat: Total input paths to process : 4
11/06/14 23:03:49 INFO mapred.JobClient: Running job: job_201106142259_0001
11/06/14 23:03:50 INFO mapred.JobClient:  map 0% reduce 0%
11/06/14 23:03:58 INFO mapred.JobClient:  map 50% reduce 0%
11/06/14 23:04:05 INFO mapred.JobClient:  map 75% reduce 0%
11/06/14 23:04:06 INFO mapred.JobClient:  map 100% reduce 0%
11/06/14 23:04:15 INFO mapred.JobClient:  map 100% reduce 100%
11/06/14 23:04:16 INFO mapred.JobClient: Job complete: job_201106142259_0001
11/06/14 23:04:16 INFO mapred.JobClient: Counters: 23
11/06/14 23:04:16 INFO mapred.JobClient:   Job Counters 
11/06/14 23:04:16 INFO mapred.JobClient:     Launched reduce tasks=1
11/06/14 23:04:16 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=27579
11/06/14 23:04:16 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
11/06/14 23:04:16 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
11/06/14 23:04:16 INFO mapred.JobClient:     Launched map tasks=4
11/06/14 23:04:16 INFO mapred.JobClient:     Data-local map tasks=4
11/06/14 23:04:16 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=16112
11/06/14 23:04:16 INFO mapred.JobClient:   FileSystemCounters
11/06/14 23:04:16 INFO mapred.JobClient:     FILE_BYTES_READ=94
11/06/14 23:04:16 INFO mapred.JobClient:     HDFS_BYTES_READ=948
11/06/14 23:04:16 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=264577
11/06/14 23:04:16 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=215
11/06/14 23:04:16 INFO mapred.JobClient:   Map-Reduce Framework
11/06/14 23:04:16 INFO mapred.JobClient:     Reduce input groups=2
11/06/14 23:04:16 INFO mapred.JobClient:     Combine output records=0
11/06/14 23:04:16 INFO mapred.JobClient:     Map input records=4
11/06/14 23:04:16 INFO mapred.JobClient:     Reduce shuffle bytes=112
11/06/14 23:04:16 INFO mapred.JobClient:     Reduce output records=0
11/06/14 23:04:16 INFO mapred.JobClient:     Spilled Records=16
11/06/14 23:04:16 INFO mapred.JobClient:     Map output bytes=72
11/06/14 23:04:16 INFO mapred.JobClient:     Map input bytes=96
11/06/14 23:04:16 INFO mapred.JobClient:     Combine input records=0
11/06/14 23:04:16 INFO mapred.JobClient:     Map output records=8
11/06/14 23:04:16 INFO mapred.JobClient:     SPLIT_RAW_BYTES=476
11/06/14 23:04:16 INFO mapred.JobClient:     Reduce input records=8
Job Finished in 27.29 seconds
Estimated value of Pi is 3.14100000000000000000

これで基本的にはうまくいくはずなんだけど下記のようなエラーが出た場合はhadoop.tmp.dirに書き込み権限が無いので追加する。

Exception in thread "main" java.io.IOException: Permission denied 
        at java.io.UnixFileSystem.createFileExclusively(Native Method) 
        at java.io.File.checkAndCreate(File.java:1704) 
        at java.io.File.createTempFile(File.java:1792) 
        at org.apache.hadoop.util.RunJar.main(RunJar.java:146) 

hadoop.tmp.dirは/var/lib/hadoop-0.20/cache/${user.name}になってますね。

[root@localhost ~]# cat /usr/lib/hadoop/conf/core-site.xml 
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:8020</value>
  </property>

  <property>
     <name>hadoop.tmp.dir</name>
     <value>/var/lib/hadoop-0.20/cache/${user.name}</value>
  </property>

  <!-- OOZIE proxy user setting -->
  <property>
    <name>hadoop.proxyuser.oozie.hosts</name>
    <value>*</value>
  </property>
  <property>
    <name>hadoop.proxyuser.oozie.groups</name>
    <value>*</value>
  </property>

</configuration>

なので、こんな感じになってればOK

[root@localhost ~]# ls -l /var/lib/hadoop-0.20/
合計 8
drwxrwxrwt 6 root hadoop 4096  6月 14 23:03 cache
[root@localhost ~]# ls -l /var/lib/hadoop-0.20/cache
合計 32
drwxr-xr-x 3 hdfs    hdfs    4096  6月  7 23:18 hadoop
drwxr-xr-x 3 hdfs    hdfs    4096  6月 14 22:59 hdfs
drwxr-xr-x 3 mapred  mapred  4096  6月 14 22:59 mapred
drwxrwxr-x 2 wyukawa wyukawa 4096  6月 14 23:04 wyukawa