*** Update ***
We plan to move to BERT-LARGE to suit the V100 DGX system on NSCC cluster. more details to follow
Introduction
Language understanding is an ongoing challenge and one of the most relevant and influential areas across any industry.
“Bidirectional Encoder Representations from Transformers (BERT) is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. When BERT was originally published it achieved state-of-the-art performance in eleven natural language understanding tasks.
BERT is a method of pre-training language representations, meaning that we train a general-purpose "language understanding" model on a large text corpus (like Wikipedia), and then use that model for downstream NLP tasks that we care about (like question answering). BERT outperforms previous methods because it is the first unsupervised, deeply bidirectional system for pre-training NLP.
1 SQuAD 1.1 with Tensorflow BERT-BASE
1.1 About the application and benchmarks
This guide is to be used as a starting point. It does not provide detailed guidance on optimizations and additional tuning. Please follow the guidelines in the Competition Limits section of this document.
1.1.1 About BERT-BASE
BERT, or Bidirectional Encoder Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks.
BERT is a method of pre-training language representations, meaning that we train a general-purpose "language understanding" model on a large text corpus (like Wikipedia), and then use that model for downstream NLP tasks that we care about (like question answering). BERT outperforms previous methods because it is the first unsupervised, deeply bidirectional system for pre-training NLP.
BERT’s model architecture is a multi-layer bidirectional Transformer encoder based on the original implementation described in Vaswani et al. (2017) and released in the tensor2tensor library. The architecture of BERT is almost identical to the original Transformer. A good reference guides for its implementation is “The Annotated Transformer.”
The developer team denote the number of layers (i.e., Transformer blocks) as L, the hidden size as H, and the number of self-attention heads as A. The team primarily report results on two model sizes:
· BERT-BASE (L=12, H=768, A=12, Total Parameters=110M) and
· BERT-LARGE (L=24, H=1024, A=16, Total Parameters=340M)
BERT-BASE contains 110M parameters and BERT-LARGE contains 340M parameters.
For the purposes of this challenge, we will be using BERT-BASE.
1.1.2 About SQuAD 1.1
The Stanford Question Answering Dataset (SQuAD) is a popular question answering benchmark dataset. BERT (at the time of the release) obtains state-of-the-art results on SQuAD with almost no task-specific network architecture modifications or data augmentation. However, it does require semi-complex data pre-processing and post-processing to deal with (a) the variable-length nature of SQuAD context paragraphs, and (b) the character-level answer annotations which are used for SQuAD training. This processing is implemented and documented in run_squad.py.
1.2 Running SQuAD 1.1 fine tuning and inference
1.2.1 Using Docker and NVIDIA Docker Image
docker pull nvcr.io/nvidia/tensorflow:20.02-tf1-py3 docker images REPOSITORY TAG IMAGE ID CREATED SIZE nvcr.io/nvidia/tensorflow 20.02-tf1-py3 0c7b70421b78 7 weeks ago 9.49GB
Example of how to run the container:
Usage: docker run [OPTIONS] IMAGE [COMMAND] [ARG...]
docker run -it --net=host -v bigdata:/bigdata 0c7b70421b78
1.2.2 Download the benchmark codes
Note: if you are using the docker container above, you already have the code and examples in /workspace/nvidia-examples/bert/ and can skip this step.
NVIDIA BERT codes is a publicly available implementation of BERT. It supports Multi-GPU training with Horovod - NVIDIA BERT fine-tune code uses Horovod to implement efficient multi-GPU training with NCCL.
[~]# git clone https://github.com/NVIDIA/DeepLearningExamples.git
You may use other implementations, optimize and tune; but you must use the BERT-Base uncased pre-trained model for the purposes of this challenge.
Some other examples include:
https://github.com/lambdal/bert - This is a fork of the original (Google's) BERT implementation, with added Multi-GPU support with Horovod.
1.2.3 Download BERT-BASE model file
The BERT-BASE, Uncased model file contains 12-layer, 768-hidden, 12-heads, 110M parameters. Its download link can be found at https://github.com/google-research/bert
We will create directories and download to :
/workspace/nvidia-examples/bert/data/download/google_pretrained_weights
root@tessa002:/workspace/nvidia-examples/bert/data# mkdir -p download/google_pretrained_weights root@tessa002:/workspace/nvidia-examples/bert/data/download# cd download/google_pretrained_weights/ root@tessa002:/workspace/nvidia-examples/bert/data/download/google_pretrained_weights# wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip root@tessa002:/workspace/nvidia-examples/bert/data/download/google_pretrained_weights# unzip uncased_L-12_H-768_A-12.zip Archive: uncased_L-12_H-768_A-12.zip creating: uncased_L-12_H-768_A-12/ inflating: uncased_L-12_H-768_A-12/bert_model.ckpt.meta inflating: uncased_L-12_H-768_A-12/bert_model.ckpt.data-00000-of-00001 inflating: uncased_L-12_H-768_A-12/vocab.txt inflating: uncased_L-12_H-768_A-12/bert_model.ckpt.index inflating: uncased_L-12_H-768_A-12/bert_config.json
1.2.4 Download the SQuAD 1.1 dataset
To run on SQuAD, you will first need to download the dataset. The SQuAD website does not seem to link to the v1.1 datasets any longer, but the necessary files can be found here:
We will download these to: /workspace/nvidia-examples/bert/data/download/squad/v1.1
root@tessa002:/workspace/nvidia-examples/bert/data/download# mkdir -p squad/v1.1 root@tessa002:/workspace/nvidia-examples/bert/data/download# cd squad/v1.1 root@tessa002:/workspace/nvidia-examples/bert/data/download/squad/v1.1# wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json root@tessa002:/workspace/nvidia-examples/bert/data/download/squad/v1.1# wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json root@tessa002:/workspace/nvidia-examples/bert/data/download/squad/v1.1# wget https://github.com/allenai/bi-att-flow/archive/master.zip root@tessa002:/workspace/nvidia-examples/bert/data/download/squad/v1.1# unzip master.zip root@tessa002:/workspace/nvidia-examples/bert/data/download/squad/v1.1# cp bi-att-flow-master/squad/evaluate-v1.1.py . root@tessa002:cd /workspace/nvidia-examples/bert
1.2.5 Start fine tuning
BERT representations can be fine tuned with just one additional output layer for a state-of-the-art Question Answering system. From within the container, you can use the following script to run fine-training for SQuAD.
Note : consider logging results with “>2&1 | tee $LOGFILE” for submissions to judges
For SQuAD 1.1 FP16 training with XLA using a DGX-1 with (8) V100 32G, run:
bash scripts/run_squad.sh 10 5e-6 fp16 true 8 384 128 base 1.1 data/download/google_pretrained_weights/uncased_L-12_H-768_A-12/bert_model.ckpt 1.1
For SQuAD 1.1 FP16 training with XLA using (4) T4 16GB GPU's run:
bash scripts/run_squad.sh 10 5e-6 fp16 true 4 384 128 base 1.1 data/download/google_pretrained_weights/uncased_L-12_H-768_A-12/bert_model.ckpt 1.1
1.2.6 Verify results
INFO:tensorflow:----------------------------- I0326 01:25:43.144953 140630939256640 run_squad.py:1127] ----------------------------- INFO:tensorflow:Total Inference Time = 88.62 for Sentences = 10840 I0326 01:25:43.145423 140630939256640 run_squad.py:1129] Total Inference Time = 88.62 for Sentences = 10840 INFO:tensorflow:Total Inference Time W/O Overhead = 75.86 for Sentences = 10824 I0326 01:25:43.145554 140630939256640 run_squad.py:1131] Total Inference Time W/O Overhead = 75.86 for Sentences = 10824 INFO:tensorflow:Summary Inference Statistics I0326 01:25:43.145649 140630939256640 run_squad.py:1132] Summary Inference Statistics INFO:tensorflow:Batch size = 8 I0326 01:25:43.145738 140630939256640 run_squad.py:1133] Batch size = 8 INFO:tensorflow:Sequence Length = 384 I0326 01:25:43.145867 140630939256640 run_squad.py:1134] Sequence Length = 384 INFO:tensorflow:Precision = fp16 I0326 01:25:43.145962 140630939256640 run_squad.py:1135] Precision = fp16 INFO:tensorflow:Latency Confidence Level 50 (ms) = 55.79 I0326 01:25:43.146052 140630939256640 run_squad.py:1136] Latency Confidence Level 50 (ms) = 55.79 INFO:tensorflow:Latency Confidence Level 90 (ms) = 57.03 I0326 01:25:43.146145 140630939256640 run_squad.py:1137] Latency Confidence Level 90 (ms) = 57.03 INFO:tensorflow:Latency Confidence Level 95 (ms) = 57.29 I0326 01:25:43.146225 140630939256640 run_squad.py:1138] Latency Confidence Level 95 (ms) = 57.29 INFO:tensorflow:Latency Confidence Level 99 (ms) = 58.62 I0326 01:25:43.146308 140630939256640 run_squad.py:1139] Latency Confidence Level 99 (ms) = 58.62 INFO:tensorflow:Latency Confidence Level 100 (ms) = 286.80 I0326 01:25:43.146387 140630939256640 run_squad.py:1140] Latency Confidence Level 100 (ms) = 286.80 INFO:tensorflow:Latency Average (ms) = 56.07 I0326 01:25:43.146471 140630939256640 run_squad.py:1141] Latency Average (ms) = 56.07 INFO:tensorflow:Throughput Average (sentences/sec) = 142.68 I0326 01:25:43.146564 140630939256640 run_squad.py:1142] Throughput Average (sentences/sec) = 142.68 INFO:tensorflow:----------------------------- I0326 01:25:43.146645 140630939256640 run_squad.py:1143] ----------------------------- INFO:tensorflow:Writing predictions to: /results/tf_bert_finetuning_squad_base_fp16_gbs40_200326010711/predictions.json I0326 01:25:43.146801 140630939256640 run_squad.py:431] Writing predictions to: /results/tf_bert_finetuning_squad_base_fp16_gbs40_200326010711/predictions.json INFO:tensorflow:Writing nbest to: /results/tf_bert_finetuning_squad_base_fp16_gbs40_200326010711/nbest_predictions.json I0326 01:25:43.146886 140630939256640 run_squad.py:432] Writing nbest to: /results/tf_bert_finetuning_squad_base_fp16_gbs40_200326010711/nbest_predictions.json {"exact_match": 78.0321665089877, "f1": 86.34229152935384}
Note : part of your final score includes these results:
{"exact_match": 78.0321665089877, "f1": 86.34229152935384}
1.2.7 (Optional) Alternative method with Lambda Labs
root@tessa002:/workspace# mkdir lambdal root@tessa002:/workspace# cd lambdal root@tessa002:/workspace/lambdal# git clone https://github.com/lambdal/bert root@tessa002:/workspace/lambdal# cd bert root@tessa002:/workspace/lambdal/bert# mpirun -np 4 -H localhost:4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib --allow-run-as-root python3 run_squad_hvd.py --vocab_file=/workspace/nvidia-examples/bert/data/download/google_pretrained_weights/uncased_L-12_H-768_A-12/vocab.txt --bert_config_file=/workspace/nvidia-examples/bert/data/download/google_pretrained_weights/uncased_L-12_H-768_A-12/bert_config.json --init_checkpoint=/workspace/nvidia-examples/bert/data/download/google_pretrained_weights/uncased_L-12_H-768_A-12/bert_model.ckpt --do_train=True --train_file=/workspace/nvidia-examples/bert/data/download/squad/v1.1/train-v1.1.json --do_predict=True --predict_file=/workspace/nvidia-examples/bert/data/download/squad/v1.1/dev-v1.1.json --train_batch_size=12 --learning_rate=3e-5 --num_train_epochs=2.0 --max_seq_length=384 --doc_stride=128 --output_dir=/results/lambdal/squad1/squad_base/ --horovod=true
look for similar output
I0326 05:55:19.917063 140421161031488 run_squad_hvd.py:747] Writing predictions to: /results/lambdal/squad1/squad_base/predictions.json INFO:tensorflow:Writing nbest to: /results/lambdal/squad1/squad_base/nbest_predictions.json I0326 05:55:19.917179 140421161031488 run_squad_hvd.py:748] Writing nbest to: /results/lambdal/squad1/squad_base/nbest_predictions.json
To check score:
root@tessa002:/workspace/lambdal/bert# python /workspace/nvidia-examples/bert/data/download/squad/v1.1/evaluate-v1.1.py /workspace/nvidia-examples/bert/data/download/squad/v1.1/dev-v1.1.json /results/lambdal/squad1/squad_base/predictions.json {"exact_match": 78.1929990539262, "f1": 86.51319484763773}
Note : part of your final score includes these results:
{"exact_match": 78.1929990539262, "f1": 86.51319484763773}
1.2.8 Example predict Q&A on real data
Example predict Q&A on real data is available here: github.com/google-research/bert
Note : This is the method that judges will use to score unseen data
root@tessa002:/workspace/nvidia-examples/bert# cd /workspace root@tessa002:/workspace# git clone https://github.com/google-research/bert.git root@tessa002:/workspace# cd bert
1.2.9 Create a sample input file
Create a simple input file, save as test_input.json in json format (note the "id" to reference later).
Using vi editor should automatically handle the formatting of json, or switch to paste mode (:set paste -> [paste text] -> :set nopaste):
{ "version": "v1.1", "data": [ { "title": "your_title", "paragraphs": [ { "qas": [ { "question": "Who is current CEO?", "id": "56ddde6b9a695914005b9628", "is_impossible": "" }, { "question": "Who founded google?", "id": "56ddde6b9a695914005b9629", "is_impossible": "" }, { "question": "when did IPO take place?", "id": "56ddde6b9a695914005b962a", "is_impossible": "" } ], "context": "Google was founded in 1998 by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University in California. Together they own about 14 percent of its shares and control 56 percent of the stockholder voting power through supervoting stock. They incorporated Google as a privately held company on September 4, 1998. An initial public offering (IPO) took place on August 19, 2004, and Google moved to its headquarters in Mountain View, California, nicknamed the Googleplex. In August 2015, Google announced plans to reorganize its various interests as a conglomerate called Alphabet Inc. Google is Alphabet's leading subsidiary and will continue to be the umbrella company for Alphabet's Internet interests. Sundar Pichai was appointed CEO of Google, replacing Larry Page who became the CEO of Alphabet." } ] } ] }
1.2.10 Run run_squad.py
Run run_squad.py as do-predict=true using fine-tuned model checkpoint :
root@tessa002:/workspace/bert# python3 run_squad.py --vocab_file=/workspace/nvidia-examples/bert/data/download/google_pretrained_weights/uncased_L-12_H-768_A-12/vocab.txt --bert_config_file=/workspace/nvidia-examples/bert/data/download/google_pretrained_weights/uncased_L-12_H-768_A-12/bert_config.json --init_checkpoint=/results/tf_bert_finetuning_squad_base_fp16_gbs40_200326010711/model.ckpt-2408 --do_train=False --max_query_length=30 --do_predict=True --predict_file=test_input.json --predict_batch_size=16 --max_seq_length=384 --doc_stride=128 --output_dir=/results/squad1/squad_test/
Note: If you are using alternative method from Lamda Labs, you will need to use that checkpoint :
root@tessa002:/workspace/lambdal/bert# python3 run_squad.py --vocab_file=/workspace/nvidia-examples/bert/data/download/google_pretrained_weights/uncased_L-12_H-768_A-12/vocab.txt --bert_config_file=/workspace/nvidia-examples/bert/data/download/google_pretrained_weights/uncased_L-12_H-768_A-12/bert_config.json --init_checkpoint=/results/lambdal/squad1/squad_base/model.ckpt-3649 --do_train=False --max_query_length=30 --do_predict=True --predict_file=test_input.json --predict_batch_size=16 --max_seq_length=384 --doc_stride=128 --output_dir=/results/lambdal/squad1/squad_test/
You should see similar output below
I0326 02:11:40.096473 140685488179008 run_squad.py:1259] Processing example: 0 INFO:tensorflow:prediction_loop marked as finished I0326 02:11:40.165820 140685488179008 error_handling.py:101] prediction_loop marked as finished INFO:tensorflow:prediction_loop marked as finished I0326 02:11:40.166095 140685488179008 error_handling.py:101] prediction_loop marked as finished INFO:tensorflow:Writing predictions to: /results/squad1/squad_test/predictions.json I0326 02:11:40.166555 140685488179008 run_squad.py:745] Writing predictions to: /results/squad1/squad_test/predictions.json INFO:tensorflow:Writing nbest to: /results/squad1/squad_test/nbest_predictions.json I0326 02:11:40.166669 140685488179008 run_squad.py:746] Writing nbest to: /results/squad1/squad_test/nbest_predictions.json
1.2.11 Check correctness in file : predictions.json
{ "56ddde6b9a695914005b9628": "Sundar Pichai", "56ddde6b9a695914005b9629": "Larry Page and Sergey Brin", "56ddde6b9a695914005b9630": "September 4, 1998", "56ddde6b9a695914005b9631": "CEO", "56ddde6b9a695914005b9632": "Alphabet Inc" }
1.2.12 Check accuracy in file: nbest_predictions.json
{ "56ddde6b9a695914005b9628": [ { "text": "Sundar Pichai", "probability": 0.6877274611974046, "start_logit": 7.016119003295898, "end_logit": 6.917689323425293 }, { "text": "Sundar Pichai was appointed CEO of Google, replacing Larry Page", "probability": 0.27466839794889614, "start_logit": 7.016119003295898, "end_logit": 5.999861240386963 }, { "text": "Larry Page", "probability": 0.02874494871571203, "start_logit": 4.759016513824463, "end_logit": 5.999861240386963 },
Note : Part of your final score is based on inferencing unseen data; which will be provided by the judges on the day of the challenge.
Scores will be derived from the nbest_predictions.json output for each question on the context.
1.3 Challenge Limitation
Must stick to pre-defined model (BERT-Base, Uncased)
Teams can locally cache (on SSD) starting model weights and dataset
HuggingFace implementation (TensorFlow/PyTorch) is the official standard. Usage of other implementation, or modification to official, is subject to approval.
Teams are allowed to explore different optimizers (SGD/Adam etc.) or learning rate schedules, or any other techniques that do not modify model architecture.
Teams are not allowed to modify any model hyperparameters or add additional layers.
Entire model must be fine-tuned (cannot freeze layers)
You must provide all scripts and methodology used to achieve results
1.4 Teams must produce
Training scripts with their full training routine and command lines and output
Evaluation-only script for verification of result. Final evaluation is on a fixed sequence length (128 tokens).
Final model ckpt and inference files
Team’s training scripts and methodology, command line and logs of runs
run_squad.py predictions.json and nbest_predictions.json
1.5 Final Scoring
The judges will score with standard evaluate-v1.1.py from Squad 1.1
i.e. {"exact_match": 81.01229895931883, "f1": 88.61239393038589}
Final scores from unseen data of multiple questions; prediction from file, using standard run_squad.py
0 Comments