# MRPC Paraphrasing Modeling using BERT based sequence modeling transformer
Essentially see if one sentence is a "paraphrase" of another sentence

## MRPC Data Set

Microsoft released a paraphrasing data set where in each example, two sentences are paired up to see if one is a paraphrase of the other or not. So the label is 0 or 1. 1 means its a paraphrase and 0 is its not.

### Install HuggingFace libraries to help us with the task
www.huggingface.co

In [None]:
!pip install datasets
!pip install evaluate
!pip install transformers
!pip install accelerate 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.10.1-py3-none-any.whl (469 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.12.1-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting multiprocess
  Downloading multiprocess-0.70.14-py38-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.0/132.0 KB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash
  Downloading xxhash-3.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (213 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m213.0/213.

In [None]:
# Use MRPC data set 

from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/27.9k [00:00<?, ?B/s]

Downloading and preparing dataset glue/mrpc to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

### Inspect the data set

In [None]:
num_examples = 10
for index in range(num_examples):
  print("sentence 1: " + raw_datasets["train"][index]["sentence1"] + " \n" +  "sentence 2: " + raw_datasets["train"][index]["sentence2"] + "\n" + str(raw_datasets["train"][index]["label"]) + "\n\n\n")

sentence 1: Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence . 
sentence 2: Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .
1



sentence 1: Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion . 
sentence 2: Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 .
0



sentence 1: They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added . 
sentence 2: On June 10 , the ship 's owners had published an advertisement on the Internet , offering the explosives for sale .
1



sentence 1: Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 . 
sentence 2: Tab shares jumped 20 cents , or 4.6 % , to set a record closing high at A $ 4.57 .
0



sentence 1: The stock rose $ 2.11 , or about 

### Tokenizing the Datasets

Converting text to tokens - so it can be tensorized by the model

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Map to merge tokens from two sentences
def tokenize_function(example):
  return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


# Tokenize the data sets
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)



  0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

*italicized text*## Sub-word Tokens

In [None]:
tokens = tokenizer.tokenize(raw_datasets["train"]["sentence1"][2])
print(tokens)
print(len(tokens))

['they', 'had', 'published', 'an', 'advertisement', 'on', 'the', 'internet', 'on', 'june', '10', ',', 'offering', 'the', 'cargo', 'for', 'sale', ',', 'he', 'added', '.']
21


## Convert Sub-word Tokens to Indices

> Indented block



In [None]:
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print(input_ids)
print(len(input_ids))

[2027, 2018, 2405, 2019, 15147, 2006, 1996, 4274, 2006, 2238, 2184, 1010, 5378, 1996, 6636, 2005, 5096, 1010, 2002, 2794, 1012]
21


## Load Hyper-parmaters and load the model 


In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer", per_device_train_batch_size=56) # Specify file where model will be saved

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

## Set up the Trainer for the fine-tuning/training

In [None]:
from transformers import Trainer 

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator
)


In [None]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence1, sentence2, idx. If sentence1, sentence2, idx are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 3668
  Num Epochs = 3
  Instantaneous batch size per device = 56
  Total train batch size (w. parallel, distributed & accumulation) = 56
  Gradient Accumulation steps = 1
  Total optimization steps = 198
  Number of trainable parameters = 109483778
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=198, training_loss=0.3242632548014323, metrics={'train_runtime': 43.5622, 'train_samples_per_second': 252.605, 'train_steps_per_second': 4.545, 'total_flos': 468804289553040.0, 'train_loss': 0.3242632548014323, 'epoch': 3.0})

In [None]:
!ls 

sample_data  test-trainer


In [None]:
predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)
print(predictions.metrics)

The following columns in the test set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: idx, sentence2, sentence1. If idx, sentence2, sentence1 are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 408
  Batch size = 8


(408, 2) (408,)
{'test_loss': 0.368180513381958, 'test_runtime': 0.7868, 'test_samples_per_second': 518.581, 'test_steps_per_second': 64.823}


### Add Compute Metrics to the Trainer!

In [None]:
import evaluate, numpy as np 

def compute_metrics(eval_preds):
  metrics = evaluate.load("glue", "mrpc")
  logits, labels = eval_preds #eval_preds is obtained from the model - As a tuple of logits and the label for each example!
  predictions = np.argmax(logits, axis=-1)
  return metrics.compute(predictions=predictions, references=labels) # Let the custom metric for the data set decide what metric to return between predictions and the labels!

In [None]:
training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch", per_device_train_batch_size=28,fp16=True,num_train_epochs=5)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer( 
    model,
    training_args,
    train_dataset = tokenized_datasets["train"],
    eval_dataset = tokenized_datasets["validation"],
    data_collator = data_collator,
    compute_metrics = compute_metrics
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--bert-base-uncased/snapshots/0a6aa9128b6194f4f3c4db429b6cb4891cdb421b/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embeddi

In [None]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence1, sentence2, idx. If sentence1, sentence2, idx are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 3668
  Num Epochs = 5
  Instantaneous batch size per device = 28
  Total train batch size (w. parallel, distributed & accumulation) = 28
  Gradient Accumulation steps = 1
  Total optimization steps = 655
  Number of trainable parameters = 109483778


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.381436,0.830882,0.884422
2,No log,0.420805,0.828431,0.871324
3,No log,0.528647,0.857843,0.901024
4,0.254300,0.659423,0.855392,0.899145
5,0.254300,0.694873,0.860294,0.901213


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence1, sentence2, idx. If sentence1, sentence2, idx are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 408
  Batch size = 8


Downloading builder script:   0%|          | 0.00/5.75k [00:00<?, ?B/s]

The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence1, sentence2, idx. If sentence1, sentence2, idx are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 408
  Batch size = 8
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence1, sentence2, idx. If sentence1, sentence2, idx are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 408
  Batch size = 8
Saving model checkpoint to test-trainer/checkpoint-500
Configuration saved in test-trainer/checkpoint-500/config.json
Model weights saved in test-trainer/checkpoint-500/pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argu

TrainOutput(global_step=655, training_loss=0.199941076031168, metrics={'train_runtime': 63.279, 'train_samples_per_second': 289.827, 'train_steps_per_second': 10.351, 'total_flos': 746610508465920.0, 'train_loss': 0.199941076031168, 'epoch': 5.0})

In [None]:
predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)
print(predictions.metrics)

The following columns in the test set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: idx, sentence2, sentence1. If idx, sentence2, sentence1 are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 408
  Batch size = 8


(408, 2) (408,)
{'test_loss': 0.666273832321167, 'test_accuracy': 0.8602941176470589, 'test_f1': 0.9025641025641027, 'test_runtime': 1.6117, 'test_samples_per_second': 253.154, 'test_steps_per_second': 31.644}


### Check the model data saved in the test-trained data set 

In [None]:
!ls 

sample_data  test-trainer


In [None]:
!ls test-trainer/checkpoint-500

config.json   pytorch_model.bin  scaler.pt     trainer_state.json
optimizer.pt  rng_state.pth	 scheduler.pt  training_args.bin


In [None]:
!cat test-trainer/checkpoint-500/config.json

{
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.26.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}
