Welcome to ISC18 Student Cluster Competition.

Tensorflow 1.7.1 on Ubuntu 16.04 was used for this document.

  1. Download Tensorflow source code

    git clone https://github.com/tensorflow/tensorflow
    git checkout r1.7


  2. Install Tensorflow dependencies

    apt install bazel
    apt install python-numpy python-dev python-pip python-wheel
    pip install six numpy wheel


  3. Build Tensorflow

    $ cd tensorflow  # cd to the top-level directory created
    $ ./configure    # Choose GPU and VERBS support 
    $ bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
    $ bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
    $ pip install /tmp/tensorflow_pkg/tensorflow-1.7.1-cp27-cp27mu-linux_x86_64.whl   # For TensorFlow 1.7.1


  4. Validate your installation

    $ Python
    import tensorflow as tf
    hello = tf.constant('Hello, TensorFlow!')
    sess = tf.Session()
    print(sess.run(hello))
    
    If the system outputs the following, then you are ready to begin writing TensorFlow programs:
    
    Hello, TensorFlow!


  5. Download Tensorflow model and benchmark

    $ git clone https://github.com/tensorflow/models.git
    $ git clone https://github.com/tensorflow/benchmarks.git


  6. Converting ImageNet data to TFRecord format

    First, create a login at http://image-net.org and make sure that your hard disk has at least 500 GB of free space for downloading and storing the data. 
    Here we select DATA_DIR=/imagenet-data
    $ DATA_DIR=/imagenet-data
    $ cd models/research/inception
    $ bazel build //inception:download_and_preprocess_imagenet
    $ bazel-bin/inception/download_and_preprocess_imagenet "${DATA_DIR}" 


  7. Run the Tensorflow benchmark using GPUs

    $ DATA_DIR=/imagenet-data
    $ TRAIN_DIR=/imagenet-train
    $ cd benchmarks/scripts/tf_cnn_benchmarks
    $ python tf_cnn_benchmarks.py \
            --data_format=NCHW --batch_size=64 \
            --model=vgg16 --optimizer=momentum --variable_update=replicated \
            --nodistortions --gradient_repacking=8 --num_gpus=2 \
            --num_epochs=10  --weight_decay=1e-4 --data_dir=$DATA_DIR --use_fp16 \
            --train_dir=$TRAIN_DIR --print_training_accuracy=true


  8. Sample output

    TensorFlow:  1.7
    Model:       vgg16
    Dataset:     imagenet
    Mode:        training
    SingleSess:  False
    Batch size:  128 global
                 64 per device
    Num batches: 20018
    Num epochs:  2.00
    Devices:     ['/gpu:0', '/gpu:1']
    Data format: NCHW
    Layout optimizer: False
    Optimizer:   momentum
    Variables:   replicated
    AllReduce:   None
    ==========
    Generating model
    Running warm up
    Done warm up
    Step    Img/sec total_loss      top_1_accuracy  top_5_accuracy
    1       images/sec: 605.9 +/- 0.0 (jitter = 0.0)        7.774   0.000   0.000
    10      images/sec: 598.9 +/- 3.7 (jitter = 8.8)        7.774   0.000   0.000
    20      images/sec: 600.9 +/- 2.0 (jitter = 4.8)        7.774   0.000   0.000
    30      images/sec: 600.5 +/- 1.6 (jitter = 6.1)        7.774   0.000   0.000
    ...
    19990   images/sec: 577.4 +/- 0.3 (jitter = 7.0)        4.494   0.234   0.453
    20000   images/sec: 577.5 +/- 0.3 (jitter = 7.0)        4.583   0.195   0.438
    20010   images/sec: 576.7 +/- 0.3 (jitter = 7.0)        4.828   0.219   0.422
    ----------------------------------------------------------------
    total images/sec: 576.26
    ----------------------------------------------------------------