hadoop fsコマンドの仕様メモを2つほど

lsとstatでは表示されるタイムスタンプが9時間異なる。

$ hadoop fs -ls /user/hive/warehouse/hoge/
Found 1 items
-rw-r--r--   3 hadoop supergroup        189 2011-10-24 17:45 /user/hive/warehouse/hoge/sequencefile
$ hadoop fs -stat /user/hive/warehouse/hoge/sequencefile
2011-10-24 08:45:29

9時間というとアレですよね。タイムゾーンですね。

hadoop fsコマンドはバックでFsShellを読むのでソースを見てみます。バージョンはCDH3u0です。

  public static final SimpleDateFormat dateForm = 
    new SimpleDateFormat("yyyy-MM-dd HH:mm");
  protected static final SimpleDateFormat modifFmt =
    new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
  static final int BORDER = 2;
  static {
    modifFmt.setTimeZone(TimeZone.getTimeZone("UTC"));
  }

hadoop fs -lsはdateFormを、hadoop fs -statはmodifFmtを使います。

SimpleDateFormatはスレッドセーフじゃねえ！っていう突っ込みはおいときますｗ

hadoop fs -textコマンドはhadoop fs -catと違いSequenceFileの中身も見れる

ヘルプにはこう書かれています。

-cat <src>:     Fetch all files that match the file pattern <src>
                and display their content on stdout.

-text <src>:    Takes a source file and outputs the file in text format.
                The allowed formats are zip and TextRecordInputStream.

text実行時のソースはこうなっています。下から読むといいと思います。

  private class TextRecordInputStream extends InputStream {
    SequenceFile.Reader r;
    WritableComparable key;
    Writable val;

    DataInputBuffer inbuf;
    DataOutputBuffer outbuf;

    public TextRecordInputStream(FileStatus f) throws IOException {
      r = new SequenceFile.Reader(fs, f.getPath(), getConf());
      key = ReflectionUtils.newInstance(r.getKeyClass().asSubclass(WritableComparable.class),
                                        getConf());
      val = ReflectionUtils.newInstance(r.getValueClass().asSubclass(Writable.class),
                                        getConf());
      inbuf = new DataInputBuffer();
      outbuf = new DataOutputBuffer();
    }

    public int read() throws IOException {
      int ret;
      if (null == inbuf || -1 == (ret = inbuf.read())) {
        if (!r.next(key, val)) {
          return -1;
        }
        byte[] tmp = key.toString().getBytes();
        outbuf.write(tmp, 0, tmp.length);
        outbuf.write('\t');
        tmp = val.toString().getBytes();
        outbuf.write(tmp, 0, tmp.length);
        outbuf.write('\n');
        inbuf.reset(outbuf.getData(), outbuf.getLength());
        outbuf.reset();
        ret = inbuf.read();
      }
      return ret;
    }
  }

  private InputStream forMagic(Path p, FileSystem srcFs) throws IOException {
    FSDataInputStream i = srcFs.open(p);

    // check codecs
    CompressionCodecFactory cf = new CompressionCodecFactory(getConf());
    CompressionCodec codec = cf.getCodec(p);
    if (codec != null) {
      return codec.createInputStream(i);
    }

    switch(i.readShort()) {
      case 0x1f8b: // RFC 1952
        i.seek(0);
        return new GZIPInputStream(i);
      case 0x5345: // 'S' 'E'
        if (i.readByte() == 'Q') {
          i.close();
          return new TextRecordInputStream(srcFs.getFileStatus(p));
        }
        break;
    }
    i.seek(0);
    return i;
  }

  void text(String srcf) throws IOException {
    Path srcPattern = new Path(srcf);
    new DelayedExceptionThrowing() {
      @Override
      void process(Path p, FileSystem srcFs) throws IOException {
        if (srcFs.isDirectory(p)) {
          throw new IOException("Source must be a file.");
        }
        printToStdout(forMagic(p, srcFs));
      }
    }.globAndProcess(srcPattern, srcPattern.getFileSystem(getConf()));
  }

text→forMagicと呼び出していきヘッダがSEQだとSequenceFileと判断しTextRecordInputStreamクラスを使います。
コンストラクタでSequenceFile.Readerが使われていることがわかります。
readメソッドを見ると、下記のように出力されることがわかります。

key1	\t	value1
key2	\t	value2
key3	\t	value3

以前HiveのSequenceFileとかパーティションとか - wyukawa’s blogでHiveでSequenceFileを使う場合はkey何でもいいよーって書きました。

HiveでSELECT INSERTする場合は値が入っていないBytesWritableをkeyに使いますのでhadoop fs -textするとkeyが空なのでタブから始まります。

keyに値をセットするとhadoop fs -textした時にkeyも表示されます。SELECTしたときは表示されません。

keyにsampleという文字列をセットした場合はこんな挙動になります。なおhogeテーブルに2つのstringのカラム（値はaとb）があるという前提です。

$ hive -e "select * from hoge"
a	b
$ hadoop fs -text /user/hive/warehouse/hoge/sequencefile
sample    a       b