Table of Contents

PLEASE SEE UPDATES IN SECTION 1.6

Introduction

Language understanding is an ongoing challenge and one of the most relevant and influential areas across any industry.

...

Code Block

root@tessa002:/workspace/nvidia-examples/bert/data/download# mkdir squad
root@tessa002:/workspace/nvidia-examples/bert/data/download# cd squad
root@tessa002:/workspace/nvidia-examples/bert/data/download/squad# mkdir -p squad/v1.1
root@tessa002:/workspace/nvidia-examples/bert/data/download/squad#download# cd squad/v1.1/
root@tessa002:/workspace/nvidia-examples/bert/data/download/squad/v1.1# wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
root@tessa002:/workspace/nvidia-examples/bert/data/download/squad/v1.1# wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json
root@tessa002:/workspace/nvidia-examples/bert/data/download/squad/v1.1#  wget https://github.com/allenai/bi-att-flow/archive/master.zip
root@tessa002:/workspace/nvidia-examples/bert/data/download/squad/v1.1# unzip master.zip
root@tessa002:/workspace/nvidia-examples/bert/data/download/squad/v1.1# cdcp bi-att-flow-master/squad/evaluate-v1.1.py .
root@tessa002:cd /workspace/nvidia-examples/bert/data/download/squad/v1.1/bi-att-flow-master# cd squad
root@tessa002:/workspace/nvidia-examples/bert/data/download/squad/v1.1/bi-att-flow-master/squad# cp evaluate-v1.1.py /workspace/nvidia-examples/bert/data/download/squad/v1.1/
root@tessa002:cd /workspace/nvidia-examples/bert

1.2.5 Start fine tuning

BERT representations can be fine tuned with just one additional output layer for a state-of-the-art Question Answering system. From within the container, you can use the following script to run fine-training for SQuAD.

Note : consider logging results with “>2&1 tee $LOGFILE” for submissions to judges

For SQuAD 1.1 FP16 training with XLA using a DGX-1 with (8) V100 32G, run:

Code Block
bash scripts/run_squad.sh 10 5e-6 fp16 true 8 384 128 base 1.1 data/download/google_pretrained_weights/uncased_L-12_H-768_A-12/bert_model.ckpt 1.1

For SQuAD 1.1 FP16 training with XLA using (4) T4 16GB GPU's run:

Code Block
bash scripts/run_squad.sh 10 5e-6 fp16 true 4 384 128 base 1.1 data/download/google_pretrained_weights/uncased_L-12_H-768_A-12/bert_model.ckpt 1.1

1.2.6 Verify results

Code Block
INFO:tensorflow:-----------------------------

1.2.5 Start fine tuning

BERT representations can be fine tuned with just one additional output layer for a state-of-the-art Question Answering system. From within the container, you can use the following script to run fine-training for SQuAD.

Note : consider logging results with “>2&1 | tee $LOGFILE” for submissions to judges

For SQuAD 1.1 FP16 training with XLA using a DGX-1 with (8) V100 32G, run:

Code Block
bash scripts/run_squad.sh 10 5e-6 fp16 true 8 384 128 base 1.1 data/download/google_pretrained_weights/uncased_L-12_H-768_A-12/bert_model.ckpt 1.1

For SQuAD 1.1 FP16 training with XLA using (4) T4 16GB GPU's run:

Code Block
bash scripts/run_squad.sh 10 5e-6 fp16 true 4 384 128 base 1.1 data/download/google_pretrained_weights/uncased_L-12_H-768_A-12/bert_model.ckpt 1.1

1.2.6 Verify results

Code Block

INFO:tensorflow:-----------------------------
I0326 01:25:43.144953 140630939256640 run_squad.py:1127] -----------------------------
INFO:tensorflow:Total Inference Time = 88.62 for Sentences = 10840
I0326 01:25:43.145423 140630939256640 run_squad.py:1129] Total Inference Time = 88.62 for Sentences = 10840
INFO:tensorflow:Total Inference Time W/O Overhead = 75.86 for Sentences = 10824
I0326 01:25:43.145554 140630939256640 run_squad.py:1131] Total Inference Time W/O Overhead = 75.86 for Sentences = 10824
INFO:tensorflow:Summary Inference Statistics
I0326 01:25:43.144953145649 140630939256640 run_squad.py:1127] -----------------------------
INFO:tensorflow:Total Inference Time = 88.62 for Sentences = 10840
1132] Summary Inference Statistics
INFO:tensorflow:Batch size = 8
I0326 01:25:43.145423145738 140630939256640 run_squad.py:11291133] TotalBatch Inferencesize Time = 88.62 for Sentences = 10840
8
INFO:tensorflow:Total Inference Time W/O Overhead = 75.86 for Sentences = 10824
I0326 Sequence Length = 384
I0326 01:25:43.145554145867 140630939256640 run_squad.py:11311134] Total Inference Time W/O OverheadSequence Length = 75.86 for Sentences = 10824
384
INFO:tensorflow:SummaryPrecision Inference= Statisticsfp16
I0326 01:25:43.145649145962 140630939256640 run_squad.py:11321135] SummaryPrecision Inference= Statisticsfp16
INFO:tensorflow:Batch sizeLatency Confidence Level 50 (ms) = 855.79
I0326 01:25:43.145738146052 140630939256640 run_squad.py:11331136] BatchLatency sizeConfidence =Level 8
INFO:tensorflow:Sequence Length = 384
I0326 01:25:43.145867 140630939256640 run_squad.py:1134] Sequence Length = 384
INFO:tensorflow:Precision = fp16
50 (ms) = 55.79
INFO:tensorflow:Latency Confidence Level 90 (ms) = 57.03
I0326 01:25:43.145962146145 140630939256640 run_squad.py:11351137] Precision Latency Confidence Level 90 (ms) = fp1657.03
INFO:tensorflow:Latency Confidence Level 5095 (ms) = 5557.7929
I0326 01:25:43.146052146225 140630939256640 run_squad.py:11361138] Latency Confidence Level 5095 (ms) = 5557.7929
INFO:tensorflow:Latency Confidence Level 9099 (ms) = 5758.0362
I0326 01:25:43.146145146308 140630939256640 run_squad.py:11371139] Latency Confidence Level 9099 (ms) = 5758.0362
INFO:tensorflow:Latency Confidence Level 95100 (ms) = 57286.2980
I0326 01:25:43.146225146387 140630939256640 run_squad.py:11381140] Latency Confidence Level 95100 (ms) = 57286.2980
INFO:tensorflow:Latency ConfidenceAverage Level 99 (ms) = 5856.6207
I0326 01:25:43.146308146471 140630939256640 run_squad.py:11391141] Latency ConfidenceAverage Level 99 (ms) = 5856.6207
INFO:tensorflow:Latency Confidence Level 100Throughput Average (mssentences/sec) = 286142.8068
I0326 01:25:43.146387146564 140630939256640 run_squad.py:11401142] LatencyThroughput Confidence LevelAverage 100 (mssentences/sec) = 286142.8068
INFO:tensorflow:Latency Average (ms) = 56.07
-----------------------------
I0326 01:25:43.146471146645 140630939256640 run_squad.py:1141] Latency Average (ms) = 56.071143] -----------------------------
INFO:tensorflow:Throughput Average (sentences/sec) = 142.68Writing predictions to: /results/tf_bert_finetuning_squad_base_fp16_gbs40_200326010711/predictions.json
I0326 01:25:43.146564146801 140630939256640 run_squad.py:1142431] Throughput Average (sentences/sec) = 142.68Writing predictions to: /results/tf_bert_finetuning_squad_base_fp16_gbs40_200326010711/predictions.json
INFO:tensorflow:-----------------------------:Writing nbest to: /results/tf_bert_finetuning_squad_base_fp16_gbs40_200326010711/nbest_predictions.json
I0326 01:25:43.146645146886 140630939256640 run_squad.py:1143432] -----------------------------
INFO:tensorflow:Writing predictions  Writing nbest to: /results/tf_bert_finetuning_squad_base_fp16_gbs40_200326010711/nbest_predictions.json
I0326 01:25:43.146801 140630939256640 run_squad.py:431] Writing predictions to: /results/tf_bert_finetuning_squad_base_fp16_gbs40_200326010711/predictions.json
INFO:tensorflow:Writing nbest to: /results/tf_bert_finetuning_squad_base_fp16_gbs40_200326010711/nbest_predictions.json
I0326 01:25:43.146886 140630939256640 run_squad.py:432] Writing nbest to: /results/tf_bert_finetuning_squad_base_fp16_gbs40_200326010711/nbest_predictions.json
{"exact_match": 78.0321665089877, "f1": 86.34229152935384}

Note : part of your final score includes these results:

{"exact_match": 78.0321665089877, "f1": 86.34229152935384}

1.2.7 (Optional) Alternative method with Lambda Labs

https://github.com/lambdal/bert

Code Block

root@tessa002:/workspace# mkdir lambdal
root@tessa002:/workspace# cd lambdal
root@tessa002:/workspace/lambdal# git clone https://github.com/lambdal/bert
root@tessa002:/workspace/lambdal/bert# mpirun -np 4 -H localhost:4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib --allow-run-as-root python3 run_squad_hvd.py --vocab{"exact_match": 78.0321665089877, "f1": 86.34229152935384}

Note : part of your final score includes these results:

{"exact_match": 78.0321665089877, "f1": 86.34229152935384}

1.2.7 (Optional) Alternative method with Lambda Labs

https://github.com/lambdal/bert

Code Block

root@tessa002:/workspace# mkdir lambdal
root@tessa002:/workspace# cd lambdal
root@tessa002:/workspace/lambdal# git clone https://github.com/lambdal/bert
root@tessa002:/workspace/lambdal# cd bert
root@tessa002:/workspace/lambdal/bert# mpirun -np 4 -H localhost:4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib --allow-run-as-root python3 run_squad_hvd.py --vocab_file=/workspace/nvidia-examples/bert/data/download/google_pretrained_weights/uncased_L-12_H-768_A-12/vocab.txt   --bert_config_file=/workspace/nvidia-examples/bert/data/download/google_pretrained_weights/uncased_L-12_H-768_A-12/bert_config.json   --init_checkpoint=/workspace/nvidia-examples/bert/data/download/google_pretrained_weights/uncased_L-12_H-768_A-12/bert_model.ckpt  --do_train=True   --train_file=/workspace/nvidia-examples/bert/data/download/google_pretrained_weights/uncased_L-12_H-768_A-12/vocab.txt/squad/v1.1/train-v1.1.json   --do_predict=True   --bert_configpredict_file=/workspace/nvidia-examples/bert/data/download/google_pretrained_weights/uncased_L-12_H-768_A-12/bert_configsquad/v1.1/dev-v1.1.json   --inittrain_batch_checkpoint=/workspace/nvidia-examples/bert/data/download/google_pretrained_weights/uncased_L-12_H-768_A-12/bert_model.ckpt  --do_train=Truesize=12 --learning_rate=3e-5   --num_train_file=/workspace/nvidia-examples/bert/data/download/squad/v1.1/train-v1.1.jsonepochs=2.0   --max_seq_length=384   --dodoc_predictstride=True128   --predictoutput_filedir=/workspaceresults/nvidia-examples/bert/data/downloadlambdal/squad1/squad/v1.1/dev-v1.1.json_base/ --train_batch_size=12 --learning_rate=3e-5   --num_train_epochs=2.0   --max_seq_length=384   --doc_stride=128   --output_dir=/results/lambdal/squad1/squad_base/ --horovod=true
horovod=true

look for similar output

Code Block

I0326 05:55:19.917063 140421161031488 run_squad_hvd.py:747] Writing predictions to: /results/lambdal/squad1/squad_base/predictions.json
INFO:tensorflow:Writing nbest to: /results/lambdal/squad1/squad_base/nbest_predictions.json
I0326 05:55:19.917179 140421161031488 run_squad_hvd.py:748] Writing nbest to: /results/lambdal/squad1/squad_base/nbest_predictions.json

...

Code Block

{
    "56ddde6b9a695914005b9628": [
        {
            "text": "Sundar Pichai",
            "probability": 0.6877274611974046,
            "start_logit": 7.016119003295898,
            "end_logit": 6.917689323425293
        },
        {
            "text": "Sundar Pichai was appointed CEO of Google, replacing Larry Page",
            "probability": 0.27466839794889614,
            "start_logit": 7.016119003295898,
            "end_logit": 5.999861240386963
        },
        {
            "text": "Larry Page",
            "probability": 0.02874494871571203,
            "start_logit": 4.759016513824463,
            "end_logit": 5.999861240386963
        },

Note : Part of your final score is based on inferencing unseen data; which will be provided by the judges on the day of the challenge.

Scores will be derived from the nbest_predictions.json output for each question on the context.

1.3 Challenge Limitation

Must stick to pre-defined model (BERT-Base, Uncased)
Teams can locally cache (on SSD) starting model weights and dataset
HuggingFace implementation (TensorFlow/PyTorch) is the official standard. Usage of other implementation, or modification to official, is subject to approval.
Teams are allowed to explore different optimizers (SGD/Adam etc.) or learning rate schedules, or any other techniques that do not modify model architecture.
Teams are not allowed to modify any model hyperparameters or add additional layers.
Entire model must be fine-tuned (cannot freeze layers)
You must provide all scripts and methodology used to achieve results

1.4 Teams must produce

Training scripts with their full training routine and command lines and output
Evaluation-only script for verification of result. Final evaluation is on a fixed sequence length (128 tokens).
Final model ckpt and inference files
Team’s training scripts and methodology, command line and logs of runs
run_squad.py predictions.json and nbest_predictions.json

...

",
            "probability": 0.02874494871571203,
            "start_logit": 4.759016513824463,
            "end_logit": 5.999861240386963
        },

Note : Part of your final score is based on inferencing unseen data; which will be provided by the judges on the day of the challenge.

Scores will be derived from the nbest_predictions.json output for each question on the context.

1.3 Challenge Limitation

Must stick to pre-defined model (BERT-Base, Uncased)
Teams can locally cache (on SSD) starting model weights and dataset
HuggingFace implementation (TensorFlow/PyTorch) is the official standard. Usage of other implementation, or modification to official, is subject to approval.
Teams are allowed to explore different optimizers (SGD/Adam etc.) or learning rate schedules, or any other techniques that do not modify model architecture.
Entire model must be fine-tuned (cannot freeze layers)
You must provide all scripts and methodology used to achieve results

1.4 Teams must produce

Training scripts with their full training routine and command lines and output
Evaluation-only script for verification of result.
Final model ckpt and inference files
Team’s training scripts and methodology, command line and logs of runs
run_squad.py predictions.json and nbest_predictions.json

1.5 Final Scoring

The judges will score with standard evaluate-v1.1.py from Squad 1.1

i.e. {"exact_match": 81.01229895931883, "f1": 88.61239393038589}

Final scores from unseen data of multiple questions; prediction from file, using standard run_squad.py

1.6 UPDATES (June 8, 2020)

In past discussion we had questions on training BERT from scratch; this is beyond the scope of this competition and is not allowed. You will need to use the BERT-BASE model file as outlined in the guidelines section 1.2.3

Change/modify the output layer and to allow additional layers
Allow for ensemble techniques

We must disallow integration of dev-set data into training dataset ; the SQUAD 1.1 datasets must remain unchanged / augmented

We must disallow additional external data integrated into training dataset for this competition because there is not enough time to be able to verify that the dev-set data might inadvertently be part of that acquired dataset augmentation

We allow any hyper-parameters ; ie. learn rate, optimizer, drop-out, etc.
We will also allow setting for random seed. This will reduce the variance between training runs
The F1 score will be used as score for team ranking.

Teams should submit their best 5 runs, please upload your runs in separate folders containing ckpt, logs, etc. - you/we will average top 3 of the 5 f1 scores for your final score

We will use the F1 as the quality metric for score / ranking. We will not round the output score computed from the output of the evaluate-v1.1.py.

The judges will score with standard evaluate-v1.1.py from Squad 1.1 as outlined in section 1

i.e. {"exact_match": 81.01229895931883, "f1": 88.61239393038589}

Final scores from unseen data of multiple questions; prediction from file, using standard run_squad.py.5 of the SQuAD 1.1 with Tensorflow BERT-BASE Guidelines.

We will use the probability score for unseen inference data (as test_input.json) to be provided no later than June 10th, as a secondary ranking in the event of any tie to the f1 average scoring against your top training run.

Versions Compared

Old Version 25

New Version Current

Key

Introduction

1.2.5 Start fine tuning

1.2.6 Verify results

1.2.5 Start fine tuning

1.2.6 Verify results

1.2.7 (Optional) Alternative method with Lambda Labs

1.2.7 (Optional) Alternative method with Lambda Labs

1.3 Challenge Limitation

1.4 Teams must produce

1.3 Challenge Limitation

1.4 Teams must produce

1.5 Final Scoring

1.6 UPDATES (June 8, 2020)

Page Comparison

Versions Compared

Old Version 25

New Version Current

Key

Introduction

1.2.5 Start fine tuning

1.2.6 Verify results

1.2.5 Start fine tuning

1.2.6 Verify results

1.2.7 (Optional) Alternative method with Lambda Labs

1.2.7 (Optional) Alternative method with Lambda Labs

1.3 Challenge Limitation

1.4 Teams must produce

1.3 Challenge Limitation

1.4 Teams must produce

1.5 Final Scoring

1.6 UPDATES (June 8, 2020)