TensorFlow (ISC18)

Welcome to ISC18 Student Cluster Competition.

Tensorflow 1.7.1 on Ubuntu 16.04 was used for this document.

Download Tensorflow source code

git clone https://github.com/tensorflow/tensorflow
git checkout r1.7

Install Tensorflow dependencies

apt install bazel
apt install python-numpy python-dev python-pip python-wheel
pip install six numpy wheel

Build Tensorflow

$ cd tensorflow  # cd to the top-level directory created
$ ./configure    # Choose GPU and VERBS support 
$ bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
$ bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
$ pip install /tmp/tensorflow_pkg/tensorflow-1.7.1-cp27-cp27mu-linux_x86_64.whl   # For TensorFlow 1.7.1

Validate your installation

$ Python
import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))

If the system outputs the following, then you are ready to begin writing TensorFlow programs:

Hello, TensorFlow!

Download Tensorflow model and benchmark

$ git clone https://github.com/tensorflow/models.git
$ git clone https://github.com/tensorflow/benchmarks.git

Converting ImageNet data to TFRecord format

First, create a login at http://image-net.org and make sure that your hard disk has at least 500 GB of free space for downloading and storing the data. 
Here we select DATA_DIR=/imagenet-data
$ DATA_DIR=/imagenet-data
$ cd models/research/inception
$ bazel build //inception:download_and_preprocess_imagenet
$ bazel-bin/inception/download_and_preprocess_imagenet "${DATA_DIR}"

Run the Tensorflow benchmark using GPUs

$ DATA_DIR=/imagenet-data
$ TRAIN_DIR=/imagenet-train
$ cd benchmarks/scripts/tf_cnn_benchmarks
$ python tf_cnn_benchmarks.py \
        --data_format=NCHW --batch_size=64 \
        --model=vgg16 --optimizer=momentum --variable_update=replicated \
        --nodistortions --gradient_repacking=8 --num_gpus=2 \
        --num_epochs=10  --weight_decay=1e-4 --data_dir=$DATA_DIR --use_fp16 \
        --train_dir=$TRAIN_DIR --print_training_accuracy=true

Sample output

TensorFlow:  1.7
Model:       vgg16
Dataset:     imagenet
Mode:        training
SingleSess:  False
Batch size:  128 global
             64 per device
Num batches: 20018
Num epochs:  2.00
Devices:     ['/gpu:0', '/gpu:1']
Data format: NCHW
Layout optimizer: False
Optimizer:   momentum
Variables:   replicated
AllReduce:   None
==========
Generating model
Running warm up
Done warm up
Step    Img/sec total_loss      top_1_accuracy  top_5_accuracy
1       images/sec: 605.9 +/- 0.0 (jitter = 0.0)        7.774   0.000   0.000
10      images/sec: 598.9 +/- 3.7 (jitter = 8.8)        7.774   0.000   0.000
20      images/sec: 600.9 +/- 2.0 (jitter = 4.8)        7.774   0.000   0.000
30      images/sec: 600.5 +/- 1.6 (jitter = 6.1)        7.774   0.000   0.000
...
19990   images/sec: 577.4 +/- 0.3 (jitter = 7.0)        4.494   0.234   0.453
20000   images/sec: 577.5 +/- 0.3 (jitter = 7.0)        4.583   0.195   0.438
20010   images/sec: 576.7 +/- 0.3 (jitter = 7.0)        4.828   0.219   0.422
----------------------------------------------------------------
total images/sec: 576.26
----------------------------------------------------------------