Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents
  • PLEASE SEE UPDATES IN SECTION 1.6

Introduction

Language understanding is an ongoing challenge and one of the most relevant and influential areas across any industry.

...

  • Training scripts with their full training routine and command lines and output

  • Evaluation-only script for verification of result. Final evaluation is on a fixed sequence length (128 tokens).

  • Final model ckpt and inference files

  • Team’s training scripts and methodology, command line and logs of runs

  • run_squad.py predictions.json and nbest_predictions.json

...

Final scores from unseen data of multiple questions; prediction from file, using standard run_squad.py

1.6  UPDATES (June 8, 2020)

In past discussion we had questions on training BERT from scratch; this is beyond the scope of this competition and is not allowed. You will need to use the BERT-BASE model file as outlined in the guidelines section 1.2.3

Change/modify the output layer and to allow additional layers
Allow for ensemble techniques

We must disallow integration of dev-set data into training dataset ; the SQUAD 1.1 datasets must remain unchanged / augmented

We must disallow additional external data integrated into training dataset for this competition because there is not enough time to be able to verify that the dev-set data might inadvertently be part of that acquired dataset augmentation

We allow any hyper-parameters ; ie. learn rate, optimizer, drop-out, etc.
We will also allow setting for random seed. This will reduce the variance between training runs
The F1 score will be used as score for team ranking.

Teams should submit their best 5 runs, please upload your runs in separate folders containing ckpt, logs, etc. - you/we will average top 3 of the 5 f1 scores for your final score

We will use the F1 as the quality metric for score / ranking. We will not round the output score computed from the output of the evaluate-v1.1.py.

The judges will score with standard evaluate-v1.1.py from Squad 1.1 as outlined in section 1.5 of the SQuAD 1.1 with Tensorflow BERT-BASE Guidelines.

We will use the probability score for unseen inference data (as test_input.json) to be provided no later than June 10th, as a secondary ranking in the event of any tie to the f1 average scoring against your top training run.