CDH3のインストール
至る所に書かれてますがメモっとく。環境はMac10.6.7上のVirtualBox4.0.8上のCentOS5.6にCDH3を擬似分散モードでインストールします。ディスク容量はデフォルトの8Gより大きくした方がよさげ。なぜなら後で気軽に増やせないからw
増やすには
可搬性疑似仮想アプライアンスサーバーシステム構想 « Midnightjapan
にあるようにLVMをうごうごしないといけません。
ちなみにディスク容量不足の状態でHDFSにデータ突っ込むと
could only be replicated to 0 nodes, instead of 1
というエラーがでますw
ともあれ、インストール方法いきます。
- JDKのインストール
Java SE - Downloads | Oracle Technology Network | Oracle
からダウンロードしてrootでインストール
[root@localhost ~]# sh jdk-6u25-linux-i586-rpm.bin
[root@localhost ~]# curl http://archive.cloudera.com/redhat/cdh/cloudera-cdh3.repo > /etc/yum.repos.d/cloudera-cdh3.repo
- CDH3のインストール
インストールするとhdfsユーザとmapredユーザが追加されます。hadoopユーザがいるとリネームするようです。
[root@localhost ~]# yum install hadoop-0.20-conf-pseudo hadoop-0.20-native
圧縮ファイルを使うためにはhadoop-0.20-nativeが必要です。
これ入れとかないと、例えばHiveでSequenceFileかつ圧縮を使おうとすると下記のように怒られます。
java.lang.IllegalArgumentException: SequenceFile doesn't work with GzipCodec without native-hadoop code!
参考:
Hiveとか - ‡A Case Of Identity‡
- インストール確認
[root@localhost ~]# yum list installed | grep hadoop hadoop-0.20.noarch 0.20.2+923.21-1 installed hadoop-0.20-conf-pseudo.noarch 0.20.2+923.21-1 installed hadoop-0.20-datanode.noarch 0.20.2+923.21-1 installed hadoop-0.20-jobtracker.noarch 0.20.2+923.21-1 installed hadoop-0.20-namenode.noarch 0.20.2+923.21-1 installed hadoop-0.20-native.i386 0.20.2+923.21-1 installed hadoop-0.20-secondarynamenode.noarch 0.20.2+923.21-1 installed hadoop-0.20-tasktracker.noarch 0.20.2+923.21-1 installed
- サービス起動
[root@localhost ~]# for service in /etc/init.d/hadoop-0.20-*; do sudo $service start; done
- サービス起動確認
[root@localhost ~]# /usr/java/latest/bin/jps 28374 NameNode 28628 Jps 28563 TaskTracker 28444 SecondaryNameNode 28248 DataNode 28312 JobTracker
- 環境変数の設定
[wyukawa@localhost ~]$ cat .bashrc # .bashrc # Source global definitions if [ -f /etc/bashrc ]; then . /etc/bashrc fi # User specific aliases and functions export JAVA_HOME=/usr/java/latest export HADOOP_HOME=/usr/lib/hadoop export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin [wyukawa@localhost ~]$ source .bashrc
- HDFSの動作確認
[wyukawa@localhost ~]$ hadoop dfs -ls / Found 1 items drwxr-xr-x - mapred supergroup 0 2011-06-14 22:59 /var
- MapReduceの動作確認
[wyukawa@localhost ~]$ hadoop jar /usr/lib/hadoop/hadoop-examples.jar pi 4 2000 Number of Maps = 4 Samples per Map = 2000 Wrote input for Map #0 Wrote input for Map #1 Wrote input for Map #2 Wrote input for Map #3 Starting Job 11/06/14 23:03:49 INFO mapred.FileInputFormat: Total input paths to process : 4 11/06/14 23:03:49 INFO mapred.JobClient: Running job: job_201106142259_0001 11/06/14 23:03:50 INFO mapred.JobClient: map 0% reduce 0% 11/06/14 23:03:58 INFO mapred.JobClient: map 50% reduce 0% 11/06/14 23:04:05 INFO mapred.JobClient: map 75% reduce 0% 11/06/14 23:04:06 INFO mapred.JobClient: map 100% reduce 0% 11/06/14 23:04:15 INFO mapred.JobClient: map 100% reduce 100% 11/06/14 23:04:16 INFO mapred.JobClient: Job complete: job_201106142259_0001 11/06/14 23:04:16 INFO mapred.JobClient: Counters: 23 11/06/14 23:04:16 INFO mapred.JobClient: Job Counters 11/06/14 23:04:16 INFO mapred.JobClient: Launched reduce tasks=1 11/06/14 23:04:16 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=27579 11/06/14 23:04:16 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 11/06/14 23:04:16 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 11/06/14 23:04:16 INFO mapred.JobClient: Launched map tasks=4 11/06/14 23:04:16 INFO mapred.JobClient: Data-local map tasks=4 11/06/14 23:04:16 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=16112 11/06/14 23:04:16 INFO mapred.JobClient: FileSystemCounters 11/06/14 23:04:16 INFO mapred.JobClient: FILE_BYTES_READ=94 11/06/14 23:04:16 INFO mapred.JobClient: HDFS_BYTES_READ=948 11/06/14 23:04:16 INFO mapred.JobClient: FILE_BYTES_WRITTEN=264577 11/06/14 23:04:16 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=215 11/06/14 23:04:16 INFO mapred.JobClient: Map-Reduce Framework 11/06/14 23:04:16 INFO mapred.JobClient: Reduce input groups=2 11/06/14 23:04:16 INFO mapred.JobClient: Combine output records=0 11/06/14 23:04:16 INFO mapred.JobClient: Map input records=4 11/06/14 23:04:16 INFO mapred.JobClient: Reduce shuffle bytes=112 11/06/14 23:04:16 INFO mapred.JobClient: Reduce output records=0 11/06/14 23:04:16 INFO mapred.JobClient: Spilled Records=16 11/06/14 23:04:16 INFO mapred.JobClient: Map output bytes=72 11/06/14 23:04:16 INFO mapred.JobClient: Map input bytes=96 11/06/14 23:04:16 INFO mapred.JobClient: Combine input records=0 11/06/14 23:04:16 INFO mapred.JobClient: Map output records=8 11/06/14 23:04:16 INFO mapred.JobClient: SPLIT_RAW_BYTES=476 11/06/14 23:04:16 INFO mapred.JobClient: Reduce input records=8 Job Finished in 27.29 seconds Estimated value of Pi is 3.14100000000000000000
これで基本的にはうまくいくはずなんだけど下記のようなエラーが出た場合はhadoop.tmp.dirに書き込み権限が無いので追加する。
Exception in thread "main" java.io.IOException: Permission denied at java.io.UnixFileSystem.createFileExclusively(Native Method) at java.io.File.checkAndCreate(File.java:1704) at java.io.File.createTempFile(File.java:1792) at org.apache.hadoop.util.RunJar.main(RunJar.java:146)
hadoop.tmp.dirは/var/lib/hadoop-0.20/cache/${user.name}になってますね。
[root@localhost ~]# cat /usr/lib/hadoop/conf/core-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:8020</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/var/lib/hadoop-0.20/cache/${user.name}</value> </property> <!-- OOZIE proxy user setting --> <property> <name>hadoop.proxyuser.oozie.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.oozie.groups</name> <value>*</value> </property> </configuration>
なので、こんな感じになってればOK
[root@localhost ~]# ls -l /var/lib/hadoop-0.20/ 合計 8 drwxrwxrwt 6 root hadoop 4096 6月 14 23:03 cache [root@localhost ~]# ls -l /var/lib/hadoop-0.20/cache 合計 32 drwxr-xr-x 3 hdfs hdfs 4096 6月 7 23:18 hadoop drwxr-xr-x 3 hdfs hdfs 4096 6月 14 22:59 hdfs drwxr-xr-x 3 mapred mapred 4096 6月 14 22:59 mapred drwxrwxr-x 2 wyukawa wyukawa 4096 6月 14 23:04 wyukawa