## MRPC Model Training without using the Trainer Class in Hugging Face

### MRPC Data Set
Microsoft (Microsoft Research Parphrase Corpus) released a paraphrasing data set where in each example, two sentences are paired up to see if one is a paraphrase of the other or not. So the label is 0 or 1. 1 means its a paraphrase and 0 is its not.

### Install HuggingFace libraries to help us with the task
www.huggingface.co

In [2]:
!pip install datasets
!pip install evaluate
!pip install transformers
!pip install accelerate 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.10.1-py3-none-any.whl (469 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m31.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess
  Downloading multiprocess-0.70.14-py38-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.0/132.0 KB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting xxhash
  Downloading xxhash-3.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (213 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m213.0/213.0 KB[0m [31m25.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.12.1-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3

In [3]:
# LOAD MRPC data set 

from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/27.9k [00:00<?, ?B/s]

Downloading and preparing dataset glue/mrpc to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

### Inspect the data set

In [4]:
num_examples = 10
for index in range(num_examples):
  print("sentence 1: " + raw_datasets["train"][index]["sentence1"] + " \n" +  "sentence 2: " + raw_datasets["train"][index]["sentence2"] + "\n" + str(raw_datasets["train"][index]["label"]) + "\n\n\n")

sentence 1: Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence . 
sentence 2: Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .
1



sentence 1: Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion . 
sentence 2: Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 .
0



sentence 1: They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added . 
sentence 2: On June 10 , the ship 's owners had published an advertisement on the Internet , offering the explosives for sale .
1



sentence 1: Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 . 
sentence 2: Tab shares jumped 20 cents , or 4.6 % , to set a record closing high at A $ 4.57 .
0



sentence 1: The stock rose $ 2.11 , or about 

In [5]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)



  0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

### Example of Tokenization

In [6]:
# WordPiece Tokenization that's used by BERT

sentence = "I love the Seahawks team. They are the bestest team! I am flabbergasterd at their comebacks"
tokens = tokenizer.tokenize(sentence)
print(tokens)

['i', 'love', 'the', 'seahawks', 'team', '.', 'they', 'are', 'the', 'best', '##est', 'team', '!', 'i', 'am', 'fl', '##ab', '##berg', '##aster', '##d', 'at', 'their', 'comeback', '##s']


In [7]:
print(tokenizer.convert_tokens_to_ids(tokens))

[1045, 2293, 1996, 21390, 2136, 1012, 2027, 2024, 1996, 2190, 4355, 2136, 999, 1045, 2572, 13109, 7875, 4059, 24268, 2094, 2012, 2037, 12845, 2015]


### Sub-word Tokens/Wordpiece tokenization

In [8]:
tokens = tokenizer.tokenize(raw_datasets["train"]["sentence1"][2])
print(tokens)
print(len(tokens))

['they', 'had', 'published', 'an', 'advertisement', 'on', 'the', 'internet', 'on', 'june', '10', ',', 'offering', 'the', 'cargo', 'for', 'sale', ',', 'he', 'added', '.']
21


### Convert Sub-word Tokens to Indices

In [63]:
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print(input_ids)
print(len(input_ids))

[2027, 2018, 2405, 2019, 15147, 2006, 1996, 4274, 2006, 2238, 2184, 1010, 5378, 1996, 6636, 2005, 5096, 1010, 2002, 2794, 1012]
21


### Pair of sentences tokenization

In [9]:
def tokenize_function_1(example):
    return tokenizer.tokenize(example["sentence1"], example["sentence2"], truncation=True)

def tokenize_function_2(example):
  return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

In [11]:
sentence_1 = raw_datasets["train"]["sentence1"][5] # Sentence 1 of example 5
sentence_2 = raw_datasets["train"]["sentence2"][5] # Sentence 2 of example 5
example = {"sentence1": sentence_1, "sentence2": sentence_2}
print(sentence_1)
print(sentence_2)
print(tokenize_function_1(example))

Revenue in the first quarter of the year dropped 15 percent from the same period a year earlier .
With the scandal hanging over Stewart 's company , revenue the first quarter of the year dropped 15 percent from the same period a year earlier .
['revenue', 'in', 'the', 'first', 'quarter', 'of', 'the', 'year', 'dropped', '15', 'percent', 'from', 'the', 'same', 'period', 'a', 'year', 'earlier', '.', 'with', 'the', 'scandal', 'hanging', 'over', 'stewart', "'", 's', 'company', ',', 'revenue', 'the', 'first', 'quarter', 'of', 'the', 'year', 'dropped', '15', 'percent', 'from', 'the', 'same', 'period', 'a', 'year', 'earlier', '.']


In [12]:
tokenized = tokenize_function_2(example)
print(tokenized)
tokenized_2 = {}
tokenized_2["token_ids"] = tokenized["input_ids"]
tokenized_2["token_type_ids"] = tokenized["token_type_ids"]
tokenized_2["attention_mask"] = tokenized["attention_mask"]
print(tokenizer.decode(**tokenized_2))

{'input_ids': [101, 6599, 1999, 1996, 2034, 4284, 1997, 1996, 2095, 3333, 2321, 3867, 2013, 1996, 2168, 2558, 1037, 2095, 3041, 1012, 102, 2007, 1996, 9446, 5689, 2058, 5954, 1005, 1055, 2194, 1010, 6599, 1996, 2034, 4284, 1997, 1996, 2095, 3333, 2321, 3867, 2013, 1996, 2168, 2558, 1037, 2095, 3041, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
[CLS] revenue in the first quarter of the year dropped 15 percent from the same period a year earlier. [SEP] with the scandal hanging over stewart's company, revenue the first quarter of the year dropped 15 percent from the same period a year earlier. [SEP]


### Tokenization through BERT Tokenizer with tokenization of pairs of sentences

In [13]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer) # Ensures each batch of data has the same number of token ids through padding



  0%|          | 0/3 [00:00<?, ?it/s]

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

In [14]:
tokenized_datasets # tokenizer(data) => Applying tokenizer directly on data converts the sentences into input_ids like above. Pairs of sentences are tokenized jointly

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

### Prep Data for model anduse data loaders for batches

In [69]:
tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2","idx"]) # Remove un-necessary columns
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch") # Return pytorch tensors 
tokenized_datasets["train"].column_names

['labels', 'input_ids', 'token_type_ids', 'attention_mask']

In [70]:
from torch.utils.data import DataLoader 

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=56, collate_fn = data_collator
)

eval_dataloader = DataLoader(
    tokenized_datasets["validation"], batch_size=56, collate_fn = data_collator
)

In [71]:
for batch in train_dataloader:
  break 
{k: v.shape for k,v in batch.items()}

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'labels': torch.Size([56]),
 'input_ids': torch.Size([56, 83]),
 'token_type_ids': torch.Size([56, 83]),
 'attention_mask': torch.Size([56, 83])}

### Load Models, Optimizer for Fine-Tuning

Here we "load" a pre-trained BERT Modela and initialize it with the trained weights. We then add a few more layers for paraphrase classification on top of BERT and then "fine-tune" the model on the MRPC data set. 

In [72]:
from transformers import AutoModelForSequenceClassification
from transformers import AdamW, get_scheduler 

checkpoint = "bert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=5e-5)
num_epochs = 3
num_training_steps = num_epochs*len(train_dataloader) # How many gradient updates will the model make over the entire training?

lr_scheduler = get_scheduler(
    "linear",
    optimizer = optimizer,
    num_warmup_steps = 0,
    num_training_steps = num_training_steps,
)
print(num_training_steps)


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

198




### Test Model to see its predicting right on a given batch

In [73]:
outputs = model(**batch)
print(outputs.loss, outputs.logits.shape)

tensor(0.6918, grad_fn=<NllLossBackward0>) torch.Size([56, 2])


### Training Loop

The code below can only use one GPU at a time - So the full power of our premium Runtime GPU chosen isn't leveraged!

In [74]:
import torch
from tqdm.auto import tqdm 

# GPU
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") # USE GPU if available else CPU
model.to(device)
device


# TRAINING
progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs): 
  for batch in train_dataloader:
    batch = {k: v.to(device) for k,v in batch.items()} # Move batch of data to GPU device
    outputs = model(**batch) # Get predictions 
    loss = outputs.loss # Get loss by comparing Predictions with Labels
    loss.backward() # Update the gradient for all parameters in the model using "BACK PROPAGATION" - this is just compute gradient step

    optimizer.step() # Now, Actually take a step in the direction of negative gradient 
    lr_scheduler.step() # Change the learning rate as well 
    optimizer.zero_grad() # Change gradient to zero so gradients don't cumulate!
    progress_bar.update(1) # This is more of a graphics thingy!


  0%|          | 0/198 [00:00<?, ?it/s]

### Evaluation Loop

In [75]:
import evaluate

metric = evaluate.load("glue", "mrpc")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

{'accuracy': 0.8504901960784313, 'f1': 0.891651865008881}

### Using Accelerate library to make training faster with multiple GPUs

In [76]:
from accelerate import Accelerator
from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler

accelerator = Accelerator()

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=3e-5)

train_dl, eval_dl, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer
)

num_epochs = 3
num_training_steps = num_epochs * len(train_dl)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dl:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

  0%|          | 0/198 [00:00<?, ?it/s]